| MDP & Environment |
Agent-Environment Interaction Loop |
 |
Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state |
All RL algorithms |
| MDP & Environment |
Markov Decision Process (MDP) Tuple |
 |
(S, A, P, R, γ) with transition dynamics and reward function |
s,a) and R(s,a,s′)) |
| MDP & Environment |
State Transition Graph |
 |
Full probabilistic transitions between discrete states |
Gridworld, Taxi, Cliff Walking |
| MDP & Environment |
Trajectory / Episode Sequence |
 |
Sequence of (s₀, a₀, r₁, s₁, …, s_T) |
Monte Carlo, episodic tasks |
| MDP & Environment |
Continuous State/Action Space Visualization |
 |
High-dimensional spaces (e.g., robot joints, pixel inputs) |
Continuous-control tasks (MuJoCo, PyBullet) |
| MDP & Environment |
Reward Function / Landscape |
 |
Scalar reward as function of state/action |
All algorithms; especially reward shaping |
| MDP & Environment |
Discount Factor (γ) Effect |
 |
How future rewards are weighted |
All discounted MDPs |
| Value & Policy |
State-Value Function V(s) |
 |
Expected return from state s under policy π |
Value-based methods |
| Value & Policy |
Action-Value Function Q(s,a) |
 |
Expected return from state-action pair |
Q-learning family |
| Value & Policy |
*Policy π(s) or π(a* |
 |
s) |
Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
| Value & Policy |
Advantage Function A(s,a) |
 |
Q(s,a) – V(s) |
A2C, PPO, SAC, TD3 |
| Value & Policy |
Optimal Value Function V / Q** |
 |
Solution to Bellman optimality |
Value iteration, Q-learning |
| Dynamic Programming |
Policy Evaluation Backup |
 |
Iterative update of V using Bellman expectation |
Policy iteration |
| Dynamic Programming |
Policy Improvement |
 |
Greedy policy update over Q |
Policy iteration |
| Dynamic Programming |
Value Iteration Backup |
 |
Update using Bellman optimality |
Value iteration |
| Dynamic Programming |
Policy Iteration Full Cycle |
 |
Evaluation → Improvement loop |
Classic DP methods |
| Monte Carlo |
Monte Carlo Backup |
 |
Update using full episode return G_t |
First-visit / every-visit MC |
| Monte Carlo |
Monte Carlo Tree (MCTS) |
 |
Search tree with selection, expansion, simulation, backprop |
AlphaGo, AlphaZero |
| Monte Carlo |
Importance Sampling Ratio |
 |
Off-policy correction ρ = π(a\ |
s) |
| Temporal Difference |
TD(0) Backup |
 |
Bootstrapped update using R + γV(s′) |
TD learning |
| Temporal Difference |
Bootstrapping (general) |
 |
Using estimated future value instead of full return |
All TD methods |
| Temporal Difference |
n-step TD Backup |
 |
Multi-step return G_t^{(n)} |
n-step TD, TD(λ) |
| Temporal Difference |
TD(λ) & Eligibility Traces |
 |
Decaying trace z_t for credit assignment |
TD(λ), SARSA(λ), Q(λ) |
| Temporal Difference |
SARSA Update |
 |
On-policy TD control |
SARSA |
| Temporal Difference |
Q-Learning Update |
 |
Off-policy TD control |
Q-learning, Deep Q-Network |
| Temporal Difference |
Expected SARSA |
 |
Expectation over next action under policy |
Expected SARSA |
| Temporal Difference |
Double Q-Learning / Double DQN |
 |
Two separate Q estimators to reduce overestimation |
Double DQN, TD3 |
| Temporal Difference |
Dueling DQN Architecture |
 |
Separate streams for state value V(s) and advantage A(s,a) |
Dueling DQN |
| Temporal Difference |
Prioritized Experience Replay |
 |
Importance sampling of transitions by TD error |
Prioritized DQN, Rainbow |
| Temporal Difference |
Rainbow DQN Components |
 |
All extensions combined (Double, Dueling, PER, etc.) |
Rainbow DQN |
| Function Approximation |
Linear Function Approximation |
 |
Feature vector φ(s) → wᵀφ(s) |
Tabular → linear FA |
| Function Approximation |
Neural Network Layers (MLP, CNN, RNN, Transformer) |
 |
Full deep network for value/policy |
DQN, A3C, PPO, Decision Transformer |
| Function Approximation |
Computation Graph / Backpropagation Flow |
 |
Gradient flow through network |
All deep RL |
| Function Approximation |
Target Network |
 |
Frozen copy of Q-network for stability |
DQN, DDQN, SAC, TD3 |
| Policy Gradients |
Policy Gradient Theorem |
 |
∇_θ J(θ) = E[∇_θ log π(a\ |
Flow diagram from reward → log-prob → gradient |
| Policy Gradients |
REINFORCE Update |
 |
Monte-Carlo policy gradient |
REINFORCE |
| Policy Gradients |
Baseline / Advantage Subtraction |
 |
Subtract b(s) to reduce variance |
All modern PG |
| Policy Gradients |
Trust Region (TRPO) |
 |
KL-divergence constraint on policy update |
TRPO |
| Policy Gradients |
Proximal Policy Optimization (PPO) |
 |
Clipped surrogate objective |
PPO, PPO-Clip |
| Actor-Critic |
Actor-Critic Architecture |
 |
Separate or shared actor (policy) + critic (value) networks |
A2C, A3C, SAC, TD3 |
| Actor-Critic |
Advantage Actor-Critic (A2C/A3C) |
 |
Synchronous/asynchronous multi-worker |
A2C/A3C |
| Actor-Critic |
Soft Actor-Critic (SAC) |
 |
Entropy-regularized policy + twin critics |
SAC |
| Actor-Critic |
Twin Delayed DDPG (TD3) |
 |
Twin critics + delayed policy + target smoothing |
TD3 |
| Exploration |
ε-Greedy Strategy |
 |
Probability ε of random action |
DQN family |
| Exploration |
Softmax / Boltzmann Exploration |
 |
Temperature τ in softmax |
Softmax policies |
| Exploration |
Upper Confidence Bound (UCB) |
 |
Optimism in face of uncertainty |
UCB1, bandits |
| Exploration |
Intrinsic Motivation / Curiosity |
 |
Prediction error as intrinsic reward |
ICM, RND, Curiosity-driven RL |
| Exploration |
Entropy Regularization |
 |
Bonus term αH(π) |
SAC, maximum-entropy RL |
| Hierarchical RL |
Options Framework |
 |
High-level policy over options (temporally extended actions) |
Option-Critic |
| Hierarchical RL |
Feudal Networks / Hierarchical Actor-Critic |
 |
Manager-worker hierarchy |
Feudal RL |
| Hierarchical RL |
Skill Discovery |
 |
Unsupervised emergence of reusable skills |
DIAYN, VALOR |
| Model-Based RL |
Learned Dynamics Model |
 |
ˆP(s′\ |
Separate model network diagram (often RNN or transformer) |
| Model-Based RL |
Model-Based Planning |
 |
Rollouts inside learned model |
MuZero, DreamerV3 |
| Model-Based RL |
Imagination-Augmented Agents (I2A) |
 |
Imagination module + policy |
I2A |
| Offline RL |
Offline Dataset |
 |
Fixed batch of trajectories |
BC, CQL, IQL |
| Offline RL |
Conservative Q-Learning (CQL) |
 |
Penalty on out-of-distribution actions |
CQL |
| Multi-Agent RL |
Multi-Agent Interaction Graph |
 |
Agents communicating or competing |
MARL, MADDPG |
| Multi-Agent RL |
Centralized Training Decentralized Execution (CTDE) |
 |
Shared critic during training |
QMIX, VDN, MADDPG |
| Multi-Agent RL |
Cooperative / Competitive Payoff Matrix |
 |
Joint reward for multiple agents |
Prisoner's Dilemma, multi-agent gridworlds |
| Inverse RL / IRL |
Reward Inference |
 |
Infer reward from expert demonstrations |
IRL, GAIL |
| Inverse RL / IRL |
Generative Adversarial Imitation Learning (GAIL) |
 |
Discriminator vs. policy generator |
GAIL, AIRL |
| Meta-RL |
Meta-RL Architecture |
 |
Outer loop (meta-policy) + inner loop (task adaptation) |
MAML for RL, RL² |
| Meta-RL |
Task Distribution Visualization |
 |
Multiple MDPs sampled from meta-distribution |
Meta-RL benchmarks |
| Advanced / Misc |
Experience Replay Buffer |
 |
Stored (s,a,r,s′,done) tuples |
DQN and all off-policy deep RL |
| Advanced / Misc |
State Visitation / Occupancy Measure |
 |
Frequency of visiting each state |
All algorithms (analysis) |
| Advanced / Misc |
Learning Curve |
 |
Average episodic return vs. episodes / steps |
Standard performance reporting |
| Advanced / Misc |
Regret / Cumulative Regret |
 |
Sub-optimality accumulated |
Bandits and online RL |
| Advanced / Misc |
Attention Mechanisms (Transformers in RL) |
 |
Attention weights |
Decision Transformer, Trajectory Transformer |
| Advanced / Misc |
Diffusion Policy |
 |
Denoising diffusion process for action generation |
Diffusion-RL policies |
| Advanced / Misc |
Graph Neural Networks for RL |
 |
Node/edge message passing |
Graph RL, relational RL |
| Advanced / Misc |
World Model / Latent Space |
 |
Encoder-decoder dynamics in latent space |
Dreamer, PlaNet |
| Advanced / Misc |
Convergence Analysis Plots |
 |
Error / value change over iterations |
DP, TD, value iteration |
| Advanced / Misc |
RL Algorithm Taxonomy |
 |
Comprehensive classification of algorithms |
All RL |
| Advanced / Misc |
Probabilistic Graphical Model (RL as Inference) |
 |
Formalizing RL as probabilistic inference |
Control as Inference, MaxEnt RL |
| Value & Policy |
Distributional RL (C51 / Categorical) |
 |
Representing return as a probability distribution |
C51, QR-DQN, IQN |
| Exploration |
Hindsight Experience Replay (HER) |
 |
Learning from failures by relabeling goals |
Sparse reward robotics, HER |
| Model-Based RL |
Dyna-Q Architecture |
 |
Integration of real experience and model-based planning |
Dyna-Q, Dyna-2 |
| Function Approximation |
Noisy Networks (Parameter Noise) |
 |
Stochastic weights for exploration |
Noisy DQN, Rainbow |
| Exploration |
Intrinsic Curiosity Module (ICM) |
 |
Reward based on prediction error |
Curiosity-driven exploration, ICM |
| Temporal Difference |
V-trace (IMPALA) |
 |
Asynchronous off-policy importance sampling |
IMPALA, V-trace |
| Multi-Agent RL |
QMIX Mixing Network |
 |
Monotonic value function factorization |
QMIX, VDN |
| Advanced / Misc |
Saliency Maps / Attention on State |
 |
Visualizing what the agent "sees" or prioritizes |
Interpretability, Atari RL |
| Exploration |
Action Selection Noise (OU vs Gaussian) |
 |
Temporal correlation in exploration noise |
DDPG, TD3 |
| Advanced / Misc |
t-SNE / UMAP State Embeddings |
 |
Dimension reduction of high-dim neural states |
Interpretability, SRL |
| Advanced / Misc |
Loss Landscape Visualization |
 |
Optimization surface geometry |
Training stability analysis |
| Advanced / Misc |
Success Rate vs Steps |
 |
Percentage of successful episodes |
Goal-conditioned RL, Robotics |
| Advanced / Misc |
Hyperparameter Sensitivity Heatmap |
 |
Performance across parameter grids |
Hyperparameter tuning |
| Dynamics |
Action Persistence (Frame Skipping) |
 |
Temporal abstraction by repeating actions |
Atari RL, Robotics |
| Model-Based RL |
MuZero Dynamics Search Tree |
 |
Planning with learned transition and value functions |
MuZero, Gumbel MuZero |
| Deep RL |
Policy Distillation |
 |
Compressing knowledge from teacher to student |
Kickstarting, multitask learning |
| Transformers |
Decision Transformer Token Sequence |
 |
Sequential modeling of RL as a translation task |
Decision Transformer, TT |
| Advanced / Misc |
Performance Profiles (rliable) |
 |
Robust aggregate performance metrics |
Reliable RL evaluation |
| Safety RL |
Safety Shielding / Barrier Functions |
 |
Hard constraints on the action space |
Constrained MDPs, Safe RL |
| Training |
Automated Curriculum Learning |
 |
Progressively increasing task difficulty |
Curriculum RL, ALP-GMM |
| Sim-to-Real |
Domain Randomization |
 |
Generalizing across environment variations |
Robotics, Sim-to-Real |
| Alignment |
RL with Human Feedback (RLHF) |
 |
Aligning agents with human preferences |
ChatGPT, InstructGPT |
| Neuro-inspired RL |
Successor Representation (SR) |
 |
Predictive state representations |
SR-Dyna, Neuro-RL |
| Inverse RL / IRL |
Maximum Entropy IRL |
 |
Probability distribution over trajectories |
MaxEnt IRL, Ziebart |
| Theory |
Information Bottleneck |
 |
Mutual information $I(S;Z)$ and $I(Z;A)$ balance |
VIB-RL, Information Theory |
| Evolutionary RL |
Evolutionary Strategies Population |
 |
Population-based parameter search |
OpenAI-ES, Salimans |
| Safety RL |
Control Barrier Functions (CBF) |
 |
Set-theoretic safety guarantees |
CBF-RL, Control Theory |
| Exploration |
Count-based Exploration Heatmap |
 |
Visitation frequency and intrinsic bonus |
MBIE-EB, RND |
| Exploration |
Thompson Sampling Posteriors |
 |
Direct uncertainty-based action selection |
Bandits, Bayesian RL |
| Multi-Agent RL |
Adversarial RL Interaction |
 |
Competition between protaganist and antagonist |
Robust RL, RARL |
| Hierarchical RL |
Hierarchical Subgoal Trajectory |
 |
Decomposing long-horizon tasks |
Subgoal RL, HIRO |
| Offline RL |
Offline Action Distribution Shift |
 |
Mismatch between dataset and current policy |
CQL, IQL, D4RL |
| Exploration |
Random Network Distillation (RND) |
 |
Prediction error as intrinsic reward |
RND, OpenAI |
| Offline RL |
Batch-Constrained Q-learning (BCQ) |
 |
Constraining actions to behavior dataset |
BCQ, Fujimoto |
| Training |
Population-Based Training (PBT) |
 |
Evolutionary hyperparameter optimization |
PBT, DeepMind |
| Deep RL |
Recurrent State Flow (DRQN/R2D2) |
 |
Temporal dependency in state-action value |
DRQN, R2D2 |
| Theory |
Belief State in POMDPs |
 |
Probability distribution over hidden states |
POMDPs, Belief Space |
| Multi-Objective RL |
Multi-Objective Pareto Front |
 |
Balancing conflicting reward signals |
MORL, Pareto Optimal |
| Theory |
Differential Value (Average Reward RL) |
 |
Values relative to average gain |
Average Reward RL, Mahadevan |
| Infrastructure |
Distributed RL Cluster (Ray/RLLib) |
 |
Parallelizing experience collection |
Ray, RLLib, Ape-X |
| Evolutionary RL |
Neuroevolution Topology Evolution |
 |
Evolving neural network architectures |
NEAT, HyperNEAT |
| Continual RL |
Elastic Weight Consolidation (EWC) |
 |
Preventing catastrophic forgetting |
EWC, Kirkpatric |
| Theory |
Successor Features (SF) |
 |
Generalizing predictive representations |
SF-Dyna, Barreto |
| Safety |
Adversarial State Noise (Perception) |
 |
Attacks on agent observation space |
Adversarial RL, Huang |
| Imitation Learning |
Behavioral Cloning (Imitation) |
 |
Direct supervised learning from experts |
BC, DAGGER |
| Relational RL |
Relational Graph State Representation |
 |
Modeling objects and their relations |
Relational MDPs, BoxWorld |
| Quantum RL |
Quantum RL Circuit (PQC) |
 |
Gate-based quantum policy networks |
Quantum RL, PQC |
| Symbolic RL |
Symbolic Policy Tree |
 |
Policies as mathematical expressions |
Symbolic RL, GP |
| Control |
Differentiable Physics Gradient Flow |
 |
Gradient-based planning through simulators |
Brax, Isaac Gym |
| Multi-Agent RL |
MARL Communication Channel |
 |
Information exchange between agents |
CommNet, DIAL |
| Safety |
Lagrangian Constraint Landscape |
 |
Constrained optimization boundaries |
Constrained RL, CPO |
| Hierarchical RL |
MAXQ Task Hierarchy |
 |
Recursive task decomposition |
MAXQ, Dietterich |
| Agentic AI |
ReAct Agentic Cycle |
 |
Reasoning-Action loops for LLMs |
ReAct, Agentic LLM |
| Bio-inspired RL |
Synaptic Plasticity RL |
 |
Hebbian-style synaptic weight updates |
Hebbian RL, STDP |
| Control |
Guided Policy Search (GPS) |
 |
Distilling trajectories into a policy |
GPS, Levine |
| Robotics |
Sim-to-Real Jitter & Latency |
 |
Temporal robustness in transfer |
Sim-to-Real, Robustness |
| Policy Gradients |
Deterministic Policy Gradient (DDPG) Flow |
 |
Gradient flow for deterministic policies |
DDPG |
| Model-Based RL |
Dreamer Latent Imagination |
 |
Learning and planning in latent space |
Dreamer (V1-V3) |
| Deep RL |
UNREAL Auxiliary Tasks |
 |
Learning from non-reward signals |
UNREAL, A3C extension |
| Offline RL |
Implicit Q-Learning (IQL) Expectile |
 |
In-sample learning via expectile regression |
IQL |
| Model-Based RL |
Prioritized Sweeping |
 |
Planning prioritized by TD error |
Sutton & Barto classic MBRL |
| Imitation Learning |
DAgger Expert Loop |
 |
Training on expert labels in agent-visited states |
DAgger |
| Representation |
Self-Predictive Representations (SPR) |
 |
Consistency between predicted and target latents |
SPR, sample-efficient RL |
| Multi-Agent RL |
Joint Action Space |
 |
Cartesian product of individual actions |
MARL theory, Game Theory |
| Multi-Agent RL |
Dec-POMDP Formal Model |
 |
Decentralized partially observable MDP |
Multi-agent coordination |
| Theory |
Bisimulation Metric |
 |
State equivalence based on transitions/rewards |
State abstraction, bisimulation theory |
| Theory |
Potential-Based Reward Shaping |
 |
Reward transformation preserving optimal policy |
Sutton & Barto, Ng et al. |
| Training |
Transfer RL: Source to Target |
 |
Reusing knowledge across different MDPs |
Transfer Learning, Distillation |
| Deep RL |
Multi-Task Backbone Arch |
 |
Single agent learning multiple tasks |
Multi-task RL, IMPALA |
| Bandits |
Contextual Bandit Pipeline |
 |
Decision making given context but no transitions |
Personalization, Ad-tech |
| Theory |
Theoretical Regret Bounds |
 |
Analytical performance guarantees |
Online Learning, Bandits |
| Value-based |
Soft Q Boltzmann Probabilities |
 |
Probabilistic action selection from Q-values |
s) \propto \exp(Q/\tau)$ |
| Robotics |
Autonomous Driving RL Pipeline |
 |
End-to-end or modular driving stack |
Wayve, Tesla, Comma.ai |
| Policy |
Policy action gradient comparison |
 |
Comparison of gradient derivation types |
PG Theorem vs DPG Theorem |
| Inverse RL / IRL |
IRL: Feature Expectation Matching |
 |
Comparing expert vs learner feature visitor frequency |
\mu(\pi^*) - \mu(\pi) |
| Imitation Learning |
Apprenticeship Learning Loop |
 |
Training to match expert performance via reward inference |
Apprenticeship Learning |
| Theory |
Active Inference Loop |
 |
Agents minimizing surprise (free energy) |
Free Energy Principle, Friston |
| Theory |
Bellman Residual Landscape |
 |
Training surface of the Bellman error |
TD learning, fitted Q-iteration |
| Model-Based RL |
Plan-to-Explore Uncertainty Map |
 |
Systematic exploration in learned world models |
Plan-to-Explore, Sekar et al. |
| Safety RL |
Robust RL Uncertainty Set |
 |
Optimizing for the worst-case environment transition |
Robust MDPs, minimax RL |
| Training |
HPO Bayesian Opt Cycle |
 |
Automating hyperparameter selection with GP |
Hyperparameter Optimization |
| Applied RL |
Slate RL Recommendation |
 |
Optimizing list/slate of items for users |
Recommender Systems, Ie et al. |
| Multi-Agent RL |
Fictitious Play Interaction |
 |
Belief-based learning in games |
Game Theory, Brown (1951) |
| Conceptual |
Universal RL Framework Diagram |
 |
High-level summary of RL components |
All RL |
| Offline RL |
Offline Density Ratio Estimator |
 |
Estimating $w(s,a)$ for off-policy data |
Importance Sampling, Offline RL |
| Continual RL |
Continual Task Interference Heatmap |
 |
Measuring negative transfer between tasks |
Lifelong Learning, EWC |
| Safety RL |
Lyapunov Stability Safe Set |
 |
Invariant sets for safe control |
Lyapunov RL, Chow et al. |
| Applied RL |
Molecular RL (Atom Coordinates) |
 |
RL for molecular design/protein folding |
Chemistry RL, AlphaFold-style |
| Architecture |
MoE Multi-task Architecture |
 |
Scaling models with mixture of experts |
MoE-RL, Sparsity |
| Direct Policy Search |
CMA-ES Policy Search |
 |
Evolutionary strategy for policy weights |
ES for RL, Salimans |
| Alignment |
Elo Rating Preference Plot |
 |
Measuring agent strength over time |
AlphaZero, League training |
| Explainable RL |
Explainable RL (SHAP Attribution) |
 |
Local attribution of features to agent actions |
Interpretability, SHAP/LIME |
| Meta-RL |
PEARL Context Encoder |
 |
Learning latent task representations |
PEARL, Rakelly et al. |
| Applied RL |
Medical RL Therapy Pipeline |
 |
Personalized medicine and dosing |
Healthcare RL, ICU Sepsis |
| Applied RL |
Supply Chain RL Pipeline |
 |
Optimizing stock levels and orders |
Logistics, Inventory Management |
| Robotics |
Sim-to-Real SysID Loop |
 |
Closing the reality gap via parameter estimation |
System Identification, Robotics |
| Architecture |
Transformer World Model |
 |
Sequence-to-sequence dynamics modeling |
DreamerV3, Transframer |
| Applied RL |
Network Traffic RL |
 |
Optimizing data packet routing in graphs |
Networking, Traffic Engineering |
| Training |
RLHF: PPO with Reference Policy |
 |
Ensuring RL fine-tuning doesn't drift too far |
InstructGPT, Llama 2/3 |
| Multi-Agent RL |
PSRO Meta-Game Update |
 |
Reaching Nash equilibrium in large games |
PSRO, Lanctot et al. |
| Multi-Agent RL |
DIAL: Differentiable Comm |
 |
End-to-end learning of communication protocols |
DIAL, Foerster et al. |
| Batch RL |
Fitted Q-Iteration Loop |
 |
Data-driven iteration with a supervised regressor |
Ernst et al. (2005) |
| Safety RL |
CMDP Feasible Region |
 |
Constrained optimization within a safety budget |
Constrained MDPs, Altman |
| Control |
MPC vs RL Planning |
 |
Comparison of control paradigms |
Control Theory vs RL |
| AutoML |
Learning to Optimize (L2O) |
 |
Using RL to learn an optimization update rule |
L2O, Li & Malik |
| Applied RL |
Smart Grid RL Management |
 |
Optimizing energy supply and demand |
Energy RL, Smart Grids |
| Applied RL |
Quantum State Tomography RL |
 |
RL for quantum state estimation |
Quantum RL, Neural Tomography |
| Applied RL |
RL for Chip Placement |
 |
Placing components on silicon grids |
Google Chip Placement |
| Applied RL |
RL Compiler Optimization (MLGO) |
 |
Inlining and sizing in compilers |
MLGO, LLVM |
| Applied RL |
RL for Theorem Proving |
 |
Automated reasoning and proof search |
LeanRL, AlphaProof |
| Modern RL |
Diffusion-QL Offline RL |
 |
Policy as reverse diffusion process |
s,k)$ with noise injection |
| Principles |
Fairness-reward Pareto Frontier |
 |
Balancing equity and returns |
Fair RL, Jabbari et al. |
| Principles |
Differentially Private RL |
 |
Privacy-preserving training |
DP-RL, Agarwal et al. |
| Applied RL |
Smart Agriculture RL |
 |
Optimizing crop yield and resources |
Precision Agriculture |
| Applied RL |
Climate Mitigation RL (Grid) |
 |
Environmental control policies |
ClimateRL, Carbon Control |
| Applied RL |
AI Education (Knowledge Tracing) |
 |
Personalized learning paths |
ITS, Bayesian Knowledge Tracing |
| Modern RL |
Decision SDE Flow |
 |
RL in continuous stochastic systems |
Neural SDEs, Control |
| Control |
Differentiable physics (Brax) |
 |
Gradients through simulators |
Brax, PhysX, MuJoCo |
| Applied RL |
Wireless Beamforming RL |
 |
Optimizing antenna signal directions |
5G/6G Networking |
| Applied RL |
Quantum Error Correction RL |
 |
Correcting noise in quantum circuits |
Quantum Computing RL |
| Multi-Agent RL |
Mean Field RL Interaction |
 |
Large population agent dynamics |
MF-RL, Yang et al. |
| HRL |
Goal-GAN Curriculum |
 |
Automatic goal generation |
Goal-GAN, Florensa et al. |
| Modern RL |
JEPA: Predictive Architecture |
 |
LeCun's world model framework |
JEPA, I-JEPA |
| Offline RL |
CQL Value Penalty Landscape |
 |
Conservatism in value functions |
CQL, Kumar et al. |
| Applied RL |
Causal RL |
 |
Causal Inverse RL Graph |
DAG with $S, A, R$ and latent $U$ |
| Quantum RL |
VQE-RL Optimization |
 |
Quantum circuit param tuning |
VQE, Quantum RL |
| Applied RL |
De-novo Drug Discovery RL |
 |
Generating optimized lead molecules |
Drug Discovery, Molecule RL |
| Applied RL |
Traffic Signal Coordination RL |
 |
Multi-intersection coordination |
IntelliLight, PressLight |
| Applied RL |
Mars Rover Pathfinding RL |
 |
Navigation on rough terrain |
Space RL, Mars Rover |
| Applied RL |
Sports Player Movement RL |
 |
Predicting/Optimizing player actions |
Sports Analytics, Ghosting |
| Applied RL |
Cryptography Attack RL |
 |
Searching for keys/vulnerabilities |
Crypto-RL, Learning to Attack |
| Applied RL |
Humanitarian Resource RL |
 |
Disaster response allocation |
AI for Good, Resource RL |
| Applied RL |
Video Compression RL (RD) |
 |
Optimizing bit-rate vs distortion |
Learned Video Compression |
| Applied RL |
Kubernetes Auto-scaling RL |
 |
Cloud resource management |
Cloud RL, K8s Scaling |
| Applied RL |
Fluid Dynamics Flow Control RL |
 |
Airfoil/Turbulence control |
Aero-RL, Flow Control |
| Applied RL |
Structural Optimization RL |
 |
Topology/Material design |
Structural RL, Topology Opt |
| Applied RL |
Human Decision Modeling |
 |
Prospect Theory in RL |
Behavioral RL, Prospect Theory |
| Applied RL |
Semantic Parsing RL |
 |
Language to Logic transformation |
Semantic Parsing, Seq2Seq-RL |
| Applied RL |
Music Melody RL |
 |
Reward-based melody generation |
Music-RL, Magenta |
| Applied RL |
Plasma Fusion Control RL |
 |
Magnetic control of Tokamaks |
DeepMind Fusion, Tokamak RL |
| Applied RL |
Carbon Capture RL cycle |
 |
Adsorption/Desorption optimization |
Carbon Capture, Green RL |
| Applied RL |
Swarm Robotics RL |
 |
Decentralized swarm coordination |
Swarm-RL, Multi-Robot |
| Applied RL |
Legal Compliance RL Game |
 |
Regulatory games |
Legal-RL, RegTech |
| Physics RL |
Physics-Informed RL (PINN) |
 |
Constraint-based RL loss |
PINN-RL, SciML |
| Modern RL |
Neuro-Symbolic RL |
 |
Combining logic and neural nets |
Neuro-Symbolic, Logic RL |
| Applied RL |
DeFi Liquidity Pool RL |
 |
Yield farming/Liquidity balancing |
DeFi-RL, AMM Optimization |
| Neuro RL |
Dopamine Reward Prediction Error |
 |
Biological RL signal curves |
Neuroscience-RL, Wolfram |
| Robotics |
Proprioceptive Sensory-Motor RL |
 |
Low-level joint control |
Proprioceptive RL, Unitree |
| Applied RL |
AR Object Placement RL |
 |
AR visual overlay optimization |
AR-RL, Visual Overlay |
| Reco RL |
Sequential Bundle RL |
 |
Recommendation item grouping |
Bundle-RL, E-commerce |
| Theoretical |
Online Gradient Descent vs RL |
 |
Gradient-based learning comparison |
Online Learning, Regret |
| Modern RL |
Active Learning: Query RL |
 |
Query-based sample selection |
Active-RL, Query Opt |
| Modern RL |
Federated RL global Aggregator |
 |
Privacy-preserving distributed RL |
Federated-RL, FedAvg-RL |
| Conceptual |
Ultimate Universal RL Mastery Diagram |
 |
Final summary of 230 items |
Absolute Mastery Milestone |