Maksim Loginov

A PhD Research Project: Hierarchical Decision-Making Under Uncertainty

2026-03-14T00:00:00+00:00

On a recently accepted paper, a proof-of-concept implementation, and the broader direction of the research

Recently the thoughts of Marvin had to step aside for a while: most of my time was devoted to working on research papers that will eventually form part of my PhD dissertation. Fortunately, the effort was worthwhile — the papers have been accepted after peer review. This makes it a good moment to briefly summarize the core ideas behind the research project and the direction in which it is evolving.

The main article can be found here (Russian): https://www.elibrary.ru/item.asp?id=89021776

A proof-of-concept implementation is also available: https://www.elibrary.ru/item.asp?id=88851718

If anyone is interested in reading the full versions, feel free to contact me. Below is a short overview of the conceptual motivation behind the project.

Decision-Making Under Uncertainty

At its core, the problem addressed in this research arises from a simple observation: humans constantly face decisions. Some are trivial, others are strategic, and many are made under incomplete information with consequences that unfold over time. For this reason, decision theory should not be viewed merely as a managerial or applied discipline. Its foundations lie in mathematics — particularly in probability theory, statistics, stochastic processes, and optimization. The main difficulty is that decisions rarely occur in fully known environments. Both humans and complex technical systems operate under uncertainty. In fact, uncertainty is not merely noise around an otherwise deterministic world — it is a fundamental property of real systems. This is why plans often fail even when they appear logically sound: the environment does not necessarily evolve according to a simple linear model. As a consequence, overly simplified models quickly reach their limits. Real systems evolve with delays, branching processes, hidden states, partial observability, and irregular timing of events. Therefore uncertainty should be treated not as a nuisance, but as a central object of analysis. Closely related to uncertainty is the notion of entropy. While commonly known from physics, Shannon’s information theory interprets entropy as a measure of uncertainty and the amount of information required to describe a system. The less predictable a system is, the more information is required to understand and manage it.

Reinforcement Learning and Markov Decision Processes

This perspective naturally leads to reinforcement learning (RL) — one of the key paradigms of modern machine learning. In general terms, reinforcement learning studies how an agent interacts with an environment, receives feedback in the form of rewards, and gradually learns strategies that maximize long-term performance. However, RL is not merely about selecting profitable actions. It describes a dynamic process in which states, actions, observations, and rewards evolve over time. A common introductory example is the multi-armed bandit problem. In this scenario, an agent must choose between several options without knowing in advance which one is best. The agent must simultaneously explore new possibilities and exploit accumulated knowledge. This illustrates a fundamental tension in decision-making: the balance between exploration and exploitation. Extending this idea leads to Markov decision processes (MDPs). In an MDP framework we consider:

an agent
an environment
a set of states
a set of actions
a reward function

and a probabilistic transition function.

The transition function is particularly important. An action does not guarantee a specific outcome; instead, it changes the probability distribution of future states. This probabilistic structure makes the model far more realistic.The world does not respond deterministically. It responds with variability, delays, noisy observations, and unexpected deviations.

Beyond Classical Reinforcement Learning

Despite its strength, classical RL is still insufficient for many real-world systems. In practice:

the environment is only partially observable
events occur at irregular times
relevant variables may be hidden
and computational resources are limited.

A simple state-action transition model is often not expressive enough to capture these realities. This is where the present research project begins. The goal is not to restate reinforcement learning theory, but to adapt it to real decision-making environments where uncertainty, time irregularity, and limited information are fundamental constraints. The practical motivation for this research lies in industrial maintenance and equipment management. In such contexts, the question of when to intervene in the operation of equipment is critical. Intervening too late can lead to failures and costly repairs, while intervening too early can waste resources and reduce operational efficiency. The key question can be formulated simply: When should a system intervene in the operation of equipment in order to prevent failures while avoiding unnecessary resource expenditure? In practice, this question is much more complex than it appears. Equipment rarely switches directly from “working” to “broken”. Instead it degrades gradually, operates under changing loads, accumulates hidden defects, and may display ambiguous early signals of future failure. Therefore decisions must rely not only on current observations but also on predictions about possible future trajectories.

Hierarchical Decision-Making

To address this challenge, the project introduces a separation between strategic and tactical levels of decision-making. This structure is closely related to hierarchical reinforcement learning. At the strategic level, the system operates with higher-level behavioral scenarios — often described as options in hierarchical RL. These represent classes of actions such as:

initiating preventive maintenance
reallocating resources
switching to a conservative operating mode
postponing intervention
preparing for repair operations.

At the tactical level, the system reacts to the immediate operational context: resource availability, local constraints, sensor signals, and evolving conditions in the environment.In simple terms:

the strategic layer decides what general course of action should be taken
the tactical layer decides how that action should be implemented in the current situation.

This separation is not merely conceptual elegance. It also addresses a practical computational issue. Flat decision spaces quickly become intractable as the number of states and actions grows. Hierarchical structures help organize the decision space and significantly reduce computational complexity. One of the central goals of the project is therefore not only to create an intelligent system, but also a computationally economical one.A theoretically optimal model that requires excessive computational resources is rarely useful in real engineering environments.

Human-in-the-Loop

Another fundamental question arises in automated decision systems: Should machines be allowed to fully determine what constitutes a good or bad decision? In critical technical systems, completely removing the human operator from the decision loop would be methodologically questionable.Machines may be consistent and computationally efficient, but humans still provide elements that algorithms struggle to replicate: contextual understanding, professional intuition, and the ability to interpret unusual situations beyond the available data. For this reason the project explicitly introduces a Human-in-the-Loop architecture. This mechanism does not simply allow an operator to confirm decisions. Instead it integrates human expertise directly into critical decision points. In this framework, the human is no longer an external supervisor of the algorithm. Rather, the operator becomes part of a coupled human-machine decision system. Decisions emerge from the interaction between computational models, observed data, and human judgement. In this sense automation does not replace the decision-maker but forms a symbiotic decision architecture that strengthens system reliability where purely statistical models may fall short.

Toward Economical Decision Systems

Although the current research focuses primarily on hierarchical reinforcement learning, a natural extension of the work leads toward ideas related to Karl Friston’s perspective on efficient behavior and uncertainty management. Friston’s work emphasizes the importance of minimizing free energy — a concept that can be interpreted as a measure of surprise or uncertainty. In this view, effective decision-making is not just about maximizing rewards, but also about reducing uncertainty and maintaining computational economy. This perspective aligns with the goals of the current research project, which seeks to create a decision system that is not only effective but also computationally efficient and robust under uncertainty. This does not mean that the current project is built on active inference or the free-energy principle. Such a claim would be premature.However, the underlying logic of the research points toward a broader question: Can decision-making be understood not only as reward maximization, but also as a process of reducing uncertainty while maintaining computational economy?

In this view, effective decisions should balance:

goal achievement
uncertainty reduction and the cost of representing and computing policies.

Future research may therefore move toward architectures where control systems operate as economical predictive-control loops, minimizing unnecessary complexity while maintaining reliable decision performance. Such a direction could connect hierarchical reinforcement learning with broader principles of information-efficient decision systems.

Current Status and Next Steps

The acceptance of the article and the publication of the proof-of-concept represent an important milestone. They indicate that the conceptual framework has passed an initial stage of external evaluation. However, this should not be seen as a final result. Rather, the project currently represents a transition from an internal research idea to a structured and publicly discussable framework. The next steps will involve further development of the theoretical model, more extensive simulations, and eventually real-world applications in industrial settings.The most important outcome at this stage is not the completion of the system, but the confirmation that the research direction is coherent, mathematically grounded, and practically motivated. The broader vision is to create a decision system that can be applied in various domains where uncertainty, time irregularity, and limited information are fundamental challenges. This could include not only industrial maintenance but also areas such as healthcare, finance, and autonomous systems. The ultimate goal is to contribute to the development of intelligent decision systems that are both effective and computationally efficient, capable of operating reliably in complex and uncertain environments.

Launching the StrataMar Project

2026-03-07T00:00:00+00:00

Over the past several years my research has been focused on reinforcement learning, decision theory, and the broader problem of decision-making under uncertainty.

Many real-world systems—industrial infrastructure, maintenance operations, logistics networks, and complex technical installations—operate in environments where decisions must be made with incomplete information, changing conditions, and limited resources.

Traditional optimization approaches often assume stable models of the environment. In practice, however, industrial systems rarely behave in such a predictable way. Uncertainty, delayed effects of decisions, and partial observability are the rule rather than the exception.

These observations gradually led to the idea of building a research and engineering platform for studying adaptive decision-making systems.

This idea became the foundation of the StrataMar project.

What is StrataMar

StrataMar is a research initiative focused on the development of intelligent decision-support systems for complex environments.

The project explores how modern reinforcement learning methods can be combined with probabilistic reasoning, constrained optimization, and simulation-based experimentation in order to support decision-making in uncertain and dynamic systems.

Rather than focusing purely on prediction or purely on static optimization, the project investigates adaptive decision policies that evolve through interaction with simulated environments.

In practical terms, StrataMar aims to provide a framework where intelligent agents can learn how to operate in complex systems while respecting operational constraints such as safety, resource limitations, and economic costs.

Why Reinforcement Learning

Reinforcement learning provides a natural mathematical framework for modeling sequential decision processes.

In this framework an agent interacts with an environment over time and gradually improves its policy by observing the consequences of its actions.

This paradigm is particularly suitable for systems where:

decisions are sequential
system dynamics are uncertain
the environment evolves over time
actions have delayed consequences

Such problems naturally correspond to Markov Decision Processes and their extensions.

However, applying reinforcement learning in real operational environments raises additional challenges. Industrial systems often involve safety constraints, partial observability, and uncertain system dynamics.

Addressing these issues requires extending classical reinforcement learning approaches with ideas from decision theory and robust optimization.

Research Directions

The StrataMar project explores several interconnected research directions.

Decision-Making Under Uncertainty

Real systems rarely provide perfect information. Sensors can be noisy, system states may be partially observable, and unexpected events can occur.

For this reason the project investigates methods for representing and reasoning about uncertainty within reinforcement learning frameworks.

Hierarchical and Multi-Agent Systems

Large technical systems are rarely controlled by a single decision process. Instead they consist of multiple interacting subsystems operating at different temporal and functional scales.

StrataMar explores hierarchical decision architectures where different layers are responsible for strategic planning, operational control, and local decision processes.

Constrained and Robust Reinforcement Learning

In industrial environments decisions must satisfy operational constraints such as safety requirements, resource availability, and economic limitations.

The project studies reinforcement learning methods that explicitly incorporate constraints and robustness considerations.

Simulation-Driven Experimentation

Before intelligent decision systems can be deployed in real-world environments they must be extensively tested.

A central component of the StrataMar initiative is therefore the development of simulation environments that allow experimentation with different decision policies and system configurations.

What Makes the Approach Different

Many existing industrial AI systems focus primarily on prediction. They estimate the probability of equipment failure, detect anomalies in sensor data, or forecast demand.

While prediction is useful, it does not directly answer the more important operational question:

What decision should be taken next?

Other systems rely on classical optimization techniques. These methods are effective when system models are well understood and relatively stable, but they often struggle in environments characterized by uncertainty and dynamic changes.

The StrataMar approach attempts to bridge this gap.

Instead of focusing only on prediction or only on optimization, the project investigates decision architectures that combine:

sequential decision modeling
reinforcement learning
uncertainty-aware reasoning
constrained optimization
simulation-based validation

The goal is not simply to train agents that perform well in benchmark environments, but to explore the design of robust decision-support systems capable of operating in uncertain real-world conditions.

Architectural Perspective

A distinguishing feature of the StrataMar project is its architectural perspective.

Rather than designing a single intelligent agent, the system is conceptualized as a hierarchical decision architecture.

Different layers of the system operate at different time scales and levels of abstraction:

strategic planning layers
operational decision agents
monitoring and simulation layers

This structure reflects the way complex technical systems are typically managed and enables coordination between multiple decision processes.

Relation to Ongoing Research

The StrataMar initiative is closely connected to my ongoing research on reinforcement learning and decision-making under uncertainty.

The broader objective of this work is to explore how ideas from reinforcement learning, probabilistic reasoning, and decision theory can be transformed into practical engineering tools.

In this sense StrataMar serves both as:

a research platform for experimentation
a prototype architecture for future decision-support systems

Looking Ahead

The project is currently in its early stages.

The immediate focus is on the development of the simulation environment and the core architecture of the platform.

Future work will include:

experimental studies of reinforcement learning algorithms
investigation of uncertainty modeling techniques
development of hierarchical and multi-agent decision architectures
exploration of real-world application scenarios

The long-term objective is to better understand how adaptive decision systems can operate in complex environments characterized by uncertainty, constraints, and evolving system dynamics.

This article marks the beginning of the StrataMar research initiative.

Project Links

Website: https://stratamar.net
Research Blog: https://logmaks.github.io

Final vision of hierarchical multi-agent reinforcement learning framework for maintenance and repair (MRO)

2026-02-15T00:00:00+00:00

Abstract

This paper considers a hierarchical multi-agent reinforcement learning approach for optimizing technical maintenance and repair under uncertainty. A model is proposed that integrates strategic planning and tactical coordination with a safe learning mechanism.

Short Description

This paper proposes a hierarchical multi-agent reinforcement learning (HRL–MARL) framework for maintenance and repair (MRO) optimization in industrial environments under partial observability, stochastic failures, and hard operational constraints. The production–maintenance system is modeled as a Dec-POMDP, where multiple agents (equipment units and maintenance resources) act on local observations augmented with predictive maintenance (PdM) signals (e.g., RUL estimates and failure probabilities). To ensure safe operation and regulatory compliance, the decision process is constrained via a CMDP formulation using action masking (shielding) and Lagrangian penalty updates.

The control architecture is split into two time scales: a strategic level (SMDP over options) selects high-level maintenance decisions (e.g., maintenance windows, prioritization, resource allocation), while a tactical level executes these options through multi-agent coordination under CTDE (centralized training, decentralized execution). A human-in-the-loop approval operator is introduced at the strategic level to validate or reject proposed options; rejections trigger a safe fallback policy and are also incorporated into the reward to discourage actions that increase operator intervention.

A simulation study on a digital twin of an industrial equipment fleet compares the proposed method against baseline maintenance policies (reactive and schedule-based). Results indicate a shift from emergency repairs toward more preventive planning under limited resources, reducing failures while respecting constraints, and improving robustness in high-uncertainty regimes.

Concept note / early-stage research (2025): Marginal Hierarchical Reinforcement Learning for Asymmetric Multi-Agent Systems

2025-11-07T00:00:00+00:00

Summary

In classical multi-agent reinforcement learning (MARL), agents share the same learning framework but operate under fixed computational and communication assumptions. Real systems are rarely homogeneous. Each autonomous unit—drone, robot, vehicle—has different resources, observation limits, and safety requirements. The proposed concept, Marginal Hierarchical Reinforcement Learning (M-HRL), treats computation as a scarce, allocatable resource.

Abstract

In classical multi-agent reinforcement learning (MARL), agents share the same learning framework but operate under fixed computational and communication assumptions. Real-world systems, however, are rarely homogeneous. Each autonomous unit—be it a drone, robot, or vehicle—possesses different resources, observation limits, and safety requirements. This asymmetry introduces a challenge that conventional MARL fails to address: how to coordinate when computation itself becomes a scarce and distributable resource.

The proposed concept, Marginal Hierarchical Reinforcement Learning (M-HRL), extends the standard hierarchical RL paradigm beyond temporal or spatial decomposition. Instead of structuring policies around subgoals, it structures them around computational priorities. A leader agent dynamically allocates the available “budget” between safety and task-oriented layers across the coalition. The safety layer ensures collision-free motion, while the task layer focuses on mission optimization. This yields a lexicographic hierarchy of control: safety dominates reward pursuit.

Unlike pure swarm algorithms, which rely solely on local interactions, M-HRL maintains a soft form of centrality. Leadership is not permanent; it migrates based on connection quality and agent reliability. Thus, the system avoids both extremes: full decentralization (which risks incoherence) and total centralization (which risks fragility). This creates a self-stabilizing architecture—one that can tolerate communication loss and recover coordination through distributed election.

Philosophically, this design echoes Arrow’s Impossibility Theorem: no collective decision can satisfy all fairness axioms simultaneously. The leader’s computational authority is a pragmatic compromise—an “engineered dictatorship” of resources that preserves collective safety under rational constraints. Marginal HRL thus stands at the intersection of control theory, game theory, and machine intelligence, representing a step toward adaptive, resource-aware collectives capable of both autonomy and coherence.

Key Idea

M-HRL extends hierarchical RL beyond temporal/spatial subgoals toward computational priorities:

A leader forms a dynamic coalition and allocates the available budget between two layers:
- Safety Layer — collision avoidance with hard priority.
- Task Layer — mission optimization with adaptive share.
The overall control is lexicographic: safety dominates reward pursuit.
Leadership is transferable under link loss or degraded quality (handover).

Architecture (sketch)

CTDE training; decentralized execution.
Coalition selection + budget allocation ((b^S \ge b_{\min},\; b^T \ge 0)).
Merged action (a=\Phi(a^S,a^T)), safety-first.

References:

Loginov, M. (2025). Marginal Hierarchical Reinforcement Learning for Asymmetric Multi-Agent Systems (Concept Note). For collaboration or feedback please contact the page of the site.

Hello, Journal

2025-09-26T00:00:00+00:00

Hello! I’m Maksim Loginov, a researcher and developer passionate about Reinforcement Learning (RL), Multi-Agent Systems, and intelligent automation. My work focuses on applying advanced machine learning methods to optimize complex industrial processes — particularly in maintenance planning, diagnostics, and reliability management.

I combine my background in Python, C++, and data analysis with a strong interest in AI research and system design. Currently, I’m exploring how hierarchical and safe reinforcement learning can enhance decision-making in real-world industrial environments.

On this page, I’ll share my ongoing projects, research notes, and experiments — from simulation-based studies to practical implementations. If you’re interested in AI, RL, or industrial automation, you’re in the right place.

Welcome aboard — and stay tuned for updates!