Reinforcement Learning Under Epistemic Uncertainty: From Markov Models to Evidential Decision Processes

Reinforcement Learning Under Epistemic Uncertainty

From Markov Models to Evidential Decision Processes

This research project investigates how sequential decision-making under uncertainty can be reformulated when classical probabilistic assumptions are insufficient.

The core thesis is straightforward:

Classical Markov Decision Processes (MDP) and Hidden Markov Models (HMM) are computationally powerful but epistemically fragile.

When uncertainty is not purely stochastic but partially epistemic (model uncertainty, sensor ambiguity, structural ignorance), Bayesian posterior inference becomes overconfident. This project develops a mathematically grounded framework that integrates:

Markovian structure (HMC / PMC / TMC),
Dempster–Shafer theory of evidence,
Reinforcement Learning (RL),

into a unified Evidential–Markov Decision framework.

1. Motivation: Limits of Classical HMM/MDP

In classical Hidden Markov Chains (HMC), the joint distribution factorizes as:

\[p(x, y) = p(x_1)\prod_{i=2}^n p(x_i|x_{i-1}) \prod_{i=1}^n p(y_i|x_i)\]

This structure ensures that posterior marginals $p(x_i

y)$ remain tractable via forward–backward recursion.

However, in real industrial systems:

Sensors are imperfect or partially reliable,
Priors are imprecise,
Observations may correspond to sets of hypotheses,
Degradation models drift over time.

Strict Bayesian updating forces epistemic uncertainty into precise probabilities — producing artificial confidence.

This is structurally incorrect.

2. Theoretical Backbone: Markov Chains + Evidence Theory

The theoretical basis relies on extending:

Hidden Markov Chains (HMC),
Pairwise Markov Chains (PMC),
Triplet Markov Chains (TMC),

using Dempster–Shafer fusion.

The key insight from evidential Markov models:

Replace probabilistic priors with belief assignments,
Replace likelihoods with evidential observations,
Preserve Markov factorization,
Retain linear-time computability of marginals.

The crucial structural condition is:

\[p(v)\propto\prod_{i=1}^{n-1}q_i(v_i, v_{i+1})\]

If this factorization holds, posterior marginals remain computable in linear time — even under belief-based fusion.

This preserves tractability while increasing epistemic robustness.

3. From Evidential Markov Chains to DST–MDP

The central move of this project:

Replace transition probabilities in MDP with belief assignments.

Instead of:

\[P(s'|s,a)\]

we define:

\[M_{s,a}(A), \quad A \subseteq \mathcal{S}\]

where $M_{s,a}$ is a basic belief assignment over subsets of states.

This leads to:

Belief-based transition dynamics,
Imprecise value functions,
Robust policy evaluation,
Explicit representation of ignorance.

The Bellman equation generalizes into a belief-space operator.

Key research question:

Under what structural constraints does a belief-MDP remain computationally tractable?

The answer lies in Markov factorization inherited from PMC/TMC theory.

4. Industrial Motivation

In industrial decision systems:

Diagnostics are uncertain,
Sensors are heterogeneous,
Safety constraints are strict,
Model misspecification is unavoidable.

A purely probabilistic RL agent may:

Overfit to incorrect priors,
Exploit spurious correlations,
Fail under distribution shift.

An evidential RL agent:

Separates aleatory and epistemic uncertainty,
Represents ignorance explicitly,
Enables conservative, safety-aware policies.

This directly supports:

Predictive Maintenance (PdM),
Hierarchical RL under constraints,
Safe RL in partially observable environments.

5. Computational Feasibility

The concern is obvious: belief models explode combinatorially.

However, if:

\[M(v) \propto \prod_{i=1}^{n-1} q_i(v_i, v_{i+1})\]

then marginal computation requires:

\[\mathcal{O}(n)\]

operations in sequence length $n$.

The complexity increases with the size of the belief state space, but remains linear in horizon length.

Thus:

Epistemic generalization does not necessarily imply exponential blow-up,
Structural discipline preserves scalability.

6. Research Directions

6.1 Formal DST–MDP Theory

Define belief-space Bellman operators,
Study contraction properties,
Analyze convergence guarantees.

6.2 Robust Policy Optimization

Conservative policy iteration,
Risk-sensitive objectives under belief bounds,
Worst-case value estimation.

6.3 Multi-Agent Extensions

Evidential MARL,
Decentralized sensor fusion,
Distributed belief coordination.

6.4 Hierarchical Integration

Strategic layer: belief abstraction,
Tactical layer: probabilistic execution,
Bridge via SMDP options.

7. Conceptual Position

This work sits at the intersection of:

Stochastic processes,
Decision theory,
Epistemic uncertainty modeling,
Reinforcement learning.

It challenges the implicit assumption that:

All uncertainty is probabilistic.

Instead, it asks:

What if part of the uncertainty reflects structural ignorance rather than stochastic noise?

8. Strategic Objective

The long-term goal is the development of a mathematically rigorous framework for:

Robust Reinforcement Learning under Epistemic Uncertainty.

This requires:

Maintaining Markov tractability,
Preserving computational feasibility,
Extending beyond strict Bayesian formalism,
Embedding robustness directly into policy evaluation.

If classical RL optimizes under risk,
this project aims to optimize under uncertainty about the risk model itself.

Updated on 15 Feb 2025.