Reinforcement Learning Graphical Representations

This repository contains a full set of 230 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.

Category	Component	Details	Context
MDP & Environment	Agent-Environment Interaction Loop	Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state	All RL algorithms
MDP & Environment	Markov Decision Process (MDP) Tuple	(S, A, P, R, γ) with transition dynamics and reward function	s,a) and R(s,a,s′))
MDP & Environment	State Transition Graph	Full probabilistic transitions between discrete states	Gridworld, Taxi, Cliff Walking
MDP & Environment	Trajectory / Episode Sequence	Sequence of (s₀, a₀, r₁, s₁, …, s_T)	Monte Carlo, episodic tasks
MDP & Environment	Continuous State/Action Space Visualization	High-dimensional spaces (e.g., robot joints, pixel inputs)	Continuous-control tasks (MuJoCo, PyBullet)
MDP & Environment	Reward Function / Landscape	Scalar reward as function of state/action	All algorithms; especially reward shaping
MDP & Environment	Discount Factor (γ) Effect	How future rewards are weighted	All discounted MDPs
Value & Policy	State-Value Function V(s)	Expected return from state s under policy π	Value-based methods
Value & Policy	Action-Value Function Q(s,a)	Expected return from state-action pair	Q-learning family
Value & Policy	Policy π(s) or π(a	s)	Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps
Value & Policy	Advantage Function A(s,a)	Q(s,a) – V(s)	A2C, PPO, SAC, TD3
Value & Policy	Optimal Value Function V / Q**	Solution to Bellman optimality	Value iteration, Q-learning
Dynamic Programming	Policy Evaluation Backup	Iterative update of V using Bellman expectation	Policy iteration
Dynamic Programming	Policy Improvement	Greedy policy update over Q	Policy iteration
Dynamic Programming	Value Iteration Backup	Update using Bellman optimality	Value iteration
Dynamic Programming	Policy Iteration Full Cycle	Evaluation → Improvement loop	Classic DP methods
Monte Carlo	Monte Carlo Backup	Update using full episode return G_t	First-visit / every-visit MC
Monte Carlo	Monte Carlo Tree (MCTS)	Search tree with selection, expansion, simulation, backprop	AlphaGo, AlphaZero
Monte Carlo	Importance Sampling Ratio	Off-policy correction ρ = π(a\	s)
Temporal Difference	TD(0) Backup	Bootstrapped update using R + γV(s′)	TD learning
Temporal Difference	Bootstrapping (general)	Using estimated future value instead of full return	All TD methods
Temporal Difference	n-step TD Backup	Multi-step return G_t^{(n)}	n-step TD, TD(λ)
Temporal Difference	TD(λ) & Eligibility Traces	Decaying trace z_t for credit assignment	TD(λ), SARSA(λ), Q(λ)
Temporal Difference	SARSA Update	On-policy TD control	SARSA
Temporal Difference	Q-Learning Update	Off-policy TD control	Q-learning, Deep Q-Network
Temporal Difference	Expected SARSA	Expectation over next action under policy	Expected SARSA
Temporal Difference	Double Q-Learning / Double DQN	Two separate Q estimators to reduce overestimation	Double DQN, TD3
Temporal Difference	Dueling DQN Architecture	Separate streams for state value V(s) and advantage A(s,a)	Dueling DQN
Temporal Difference	Prioritized Experience Replay	Importance sampling of transitions by TD error	Prioritized DQN, Rainbow
Temporal Difference	Rainbow DQN Components	All extensions combined (Double, Dueling, PER, etc.)	Rainbow DQN
Function Approximation	Linear Function Approximation	Feature vector φ(s) → wᵀφ(s)	Tabular → linear FA
Function Approximation	Neural Network Layers (MLP, CNN, RNN, Transformer)	Full deep network for value/policy	DQN, A3C, PPO, Decision Transformer
Function Approximation	Computation Graph / Backpropagation Flow	Gradient flow through network	All deep RL
Function Approximation	Target Network	Frozen copy of Q-network for stability	DQN, DDQN, SAC, TD3
Policy Gradients	Policy Gradient Theorem	∇_θ J(θ) = E[∇_θ log π(a\	Flow diagram from reward → log-prob → gradient
Policy Gradients	REINFORCE Update	Monte-Carlo policy gradient	REINFORCE
Policy Gradients	Baseline / Advantage Subtraction	Subtract b(s) to reduce variance	All modern PG
Policy Gradients	Trust Region (TRPO)	KL-divergence constraint on policy update	TRPO
Policy Gradients	Proximal Policy Optimization (PPO)	Clipped surrogate objective	PPO, PPO-Clip
Actor-Critic	Actor-Critic Architecture	Separate or shared actor (policy) + critic (value) networks	A2C, A3C, SAC, TD3
Actor-Critic	Advantage Actor-Critic (A2C/A3C)	Synchronous/asynchronous multi-worker	A2C/A3C
Actor-Critic	Soft Actor-Critic (SAC)	Entropy-regularized policy + twin critics	SAC
Actor-Critic	Twin Delayed DDPG (TD3)	Twin critics + delayed policy + target smoothing	TD3
Exploration	ε-Greedy Strategy	Probability ε of random action	DQN family
Exploration	Softmax / Boltzmann Exploration	Temperature τ in softmax	Softmax policies
Exploration	Upper Confidence Bound (UCB)	Optimism in face of uncertainty	UCB1, bandits
Exploration	Intrinsic Motivation / Curiosity	Prediction error as intrinsic reward	ICM, RND, Curiosity-driven RL
Exploration	Entropy Regularization	Bonus term αH(π)	SAC, maximum-entropy RL
Hierarchical RL	Options Framework	High-level policy over options (temporally extended actions)	Option-Critic
Hierarchical RL	Feudal Networks / Hierarchical Actor-Critic	Manager-worker hierarchy	Feudal RL
Hierarchical RL	Skill Discovery	Unsupervised emergence of reusable skills	DIAYN, VALOR
Model-Based RL	Learned Dynamics Model	ˆP(s′\	Separate model network diagram (often RNN or transformer)
Model-Based RL	Model-Based Planning	Rollouts inside learned model	MuZero, DreamerV3
Model-Based RL	Imagination-Augmented Agents (I2A)	Imagination module + policy	I2A
Offline RL	Offline Dataset	Fixed batch of trajectories	BC, CQL, IQL
Offline RL	Conservative Q-Learning (CQL)	Penalty on out-of-distribution actions	CQL
Multi-Agent RL	Multi-Agent Interaction Graph	Agents communicating or competing	MARL, MADDPG
Multi-Agent RL	Centralized Training Decentralized Execution (CTDE)	Shared critic during training	QMIX, VDN, MADDPG
Multi-Agent RL	Cooperative / Competitive Payoff Matrix	Joint reward for multiple agents	Prisoner's Dilemma, multi-agent gridworlds
Inverse RL / IRL	Reward Inference	Infer reward from expert demonstrations	IRL, GAIL
Inverse RL / IRL	Generative Adversarial Imitation Learning (GAIL)	Discriminator vs. policy generator	GAIL, AIRL
Meta-RL	Meta-RL Architecture	Outer loop (meta-policy) + inner loop (task adaptation)	MAML for RL, RL²
Meta-RL	Task Distribution Visualization	Multiple MDPs sampled from meta-distribution	Meta-RL benchmarks
Advanced / Misc	Experience Replay Buffer	Stored (s,a,r,s′,done) tuples	DQN and all off-policy deep RL
Advanced / Misc	State Visitation / Occupancy Measure	Frequency of visiting each state	All algorithms (analysis)
Advanced / Misc	Learning Curve	Average episodic return vs. episodes / steps	Standard performance reporting
Advanced / Misc	Regret / Cumulative Regret	Sub-optimality accumulated	Bandits and online RL
Advanced / Misc	Attention Mechanisms (Transformers in RL)	Attention weights	Decision Transformer, Trajectory Transformer
Advanced / Misc	Diffusion Policy	Denoising diffusion process for action generation	Diffusion-RL policies
Advanced / Misc	Graph Neural Networks for RL	Node/edge message passing	Graph RL, relational RL
Advanced / Misc	World Model / Latent Space	Encoder-decoder dynamics in latent space	Dreamer, PlaNet
Advanced / Misc	Convergence Analysis Plots	Error / value change over iterations	DP, TD, value iteration
Advanced / Misc	RL Algorithm Taxonomy	Comprehensive classification of algorithms	All RL
Advanced / Misc	Probabilistic Graphical Model (RL as Inference)	Formalizing RL as probabilistic inference	Control as Inference, MaxEnt RL
Value & Policy	Distributional RL (C51 / Categorical)	Representing return as a probability distribution	C51, QR-DQN, IQN
Exploration	Hindsight Experience Replay (HER)	Learning from failures by relabeling goals	Sparse reward robotics, HER
Model-Based RL	Dyna-Q Architecture	Integration of real experience and model-based planning	Dyna-Q, Dyna-2
Function Approximation	Noisy Networks (Parameter Noise)	Stochastic weights for exploration	Noisy DQN, Rainbow
Exploration	Intrinsic Curiosity Module (ICM)	Reward based on prediction error	Curiosity-driven exploration, ICM
Temporal Difference	V-trace (IMPALA)	Asynchronous off-policy importance sampling	IMPALA, V-trace
Multi-Agent RL	QMIX Mixing Network	Monotonic value function factorization	QMIX, VDN
Advanced / Misc	Saliency Maps / Attention on State	Visualizing what the agent "sees" or prioritizes	Interpretability, Atari RL
Exploration	Action Selection Noise (OU vs Gaussian)	Temporal correlation in exploration noise	DDPG, TD3
Advanced / Misc	t-SNE / UMAP State Embeddings	Dimension reduction of high-dim neural states	Interpretability, SRL
Advanced / Misc	Loss Landscape Visualization	Optimization surface geometry	Training stability analysis
Advanced / Misc	Success Rate vs Steps	Percentage of successful episodes	Goal-conditioned RL, Robotics
Advanced / Misc	Hyperparameter Sensitivity Heatmap	Performance across parameter grids	Hyperparameter tuning
Dynamics	Action Persistence (Frame Skipping)	Temporal abstraction by repeating actions	Atari RL, Robotics
Model-Based RL	MuZero Dynamics Search Tree	Planning with learned transition and value functions	MuZero, Gumbel MuZero
Deep RL	Policy Distillation	Compressing knowledge from teacher to student	Kickstarting, multitask learning
Transformers	Decision Transformer Token Sequence	Sequential modeling of RL as a translation task	Decision Transformer, TT
Advanced / Misc	Performance Profiles (rliable)	Robust aggregate performance metrics	Reliable RL evaluation
Safety RL	Safety Shielding / Barrier Functions	Hard constraints on the action space	Constrained MDPs, Safe RL
Training	Automated Curriculum Learning	Progressively increasing task difficulty	Curriculum RL, ALP-GMM
Sim-to-Real	Domain Randomization	Generalizing across environment variations	Robotics, Sim-to-Real
Alignment	RL with Human Feedback (RLHF)	Aligning agents with human preferences	ChatGPT, InstructGPT
Neuro-inspired RL	Successor Representation (SR)	Predictive state representations	SR-Dyna, Neuro-RL
Inverse RL / IRL	Maximum Entropy IRL	Probability distribution over trajectories	MaxEnt IRL, Ziebart
Theory	Information Bottleneck	Mutual information $I(S;Z)$ and $I(Z;A)$ balance	VIB-RL, Information Theory
Evolutionary RL	Evolutionary Strategies Population	Population-based parameter search	OpenAI-ES, Salimans
Safety RL	Control Barrier Functions (CBF)	Set-theoretic safety guarantees	CBF-RL, Control Theory
Exploration	Count-based Exploration Heatmap	Visitation frequency and intrinsic bonus	MBIE-EB, RND
Exploration	Thompson Sampling Posteriors	Direct uncertainty-based action selection	Bandits, Bayesian RL
Multi-Agent RL	Adversarial RL Interaction	Competition between protaganist and antagonist	Robust RL, RARL
Hierarchical RL	Hierarchical Subgoal Trajectory	Decomposing long-horizon tasks	Subgoal RL, HIRO
Offline RL	Offline Action Distribution Shift	Mismatch between dataset and current policy	CQL, IQL, D4RL
Exploration	Random Network Distillation (RND)	Prediction error as intrinsic reward	RND, OpenAI
Offline RL	Batch-Constrained Q-learning (BCQ)	Constraining actions to behavior dataset	BCQ, Fujimoto
Training	Population-Based Training (PBT)	Evolutionary hyperparameter optimization	PBT, DeepMind
Deep RL	Recurrent State Flow (DRQN/R2D2)	Temporal dependency in state-action value	DRQN, R2D2
Theory	Belief State in POMDPs	Probability distribution over hidden states	POMDPs, Belief Space
Multi-Objective RL	Multi-Objective Pareto Front	Balancing conflicting reward signals	MORL, Pareto Optimal
Theory	Differential Value (Average Reward RL)	Values relative to average gain	Average Reward RL, Mahadevan
Infrastructure	Distributed RL Cluster (Ray/RLLib)	Parallelizing experience collection	Ray, RLLib, Ape-X
Evolutionary RL	Neuroevolution Topology Evolution	Evolving neural network architectures	NEAT, HyperNEAT
Continual RL	Elastic Weight Consolidation (EWC)	Preventing catastrophic forgetting	EWC, Kirkpatric
Theory	Successor Features (SF)	Generalizing predictive representations	SF-Dyna, Barreto
Safety	Adversarial State Noise (Perception)	Attacks on agent observation space	Adversarial RL, Huang
Imitation Learning	Behavioral Cloning (Imitation)	Direct supervised learning from experts	BC, DAGGER
Relational RL	Relational Graph State Representation	Modeling objects and their relations	Relational MDPs, BoxWorld
Quantum RL	Quantum RL Circuit (PQC)	Gate-based quantum policy networks	Quantum RL, PQC
Symbolic RL	Symbolic Policy Tree	Policies as mathematical expressions	Symbolic RL, GP
Control	Differentiable Physics Gradient Flow	Gradient-based planning through simulators	Brax, Isaac Gym
Multi-Agent RL	MARL Communication Channel	Information exchange between agents	CommNet, DIAL
Safety	Lagrangian Constraint Landscape	Constrained optimization boundaries	Constrained RL, CPO
Hierarchical RL	MAXQ Task Hierarchy	Recursive task decomposition	MAXQ, Dietterich
Agentic AI	ReAct Agentic Cycle	Reasoning-Action loops for LLMs	ReAct, Agentic LLM
Bio-inspired RL	Synaptic Plasticity RL	Hebbian-style synaptic weight updates	Hebbian RL, STDP
Control	Guided Policy Search (GPS)	Distilling trajectories into a policy	GPS, Levine
Robotics	Sim-to-Real Jitter & Latency	Temporal robustness in transfer	Sim-to-Real, Robustness
Policy Gradients	Deterministic Policy Gradient (DDPG) Flow	Gradient flow for deterministic policies	DDPG
Model-Based RL	Dreamer Latent Imagination	Learning and planning in latent space	Dreamer (V1-V3)
Deep RL	UNREAL Auxiliary Tasks	Learning from non-reward signals	UNREAL, A3C extension
Offline RL	Implicit Q-Learning (IQL) Expectile	In-sample learning via expectile regression	IQL
Model-Based RL	Prioritized Sweeping	Planning prioritized by TD error	Sutton & Barto classic MBRL
Imitation Learning	DAgger Expert Loop	Training on expert labels in agent-visited states	DAgger
Representation	Self-Predictive Representations (SPR)	Consistency between predicted and target latents	SPR, sample-efficient RL
Multi-Agent RL	Joint Action Space	Cartesian product of individual actions	MARL theory, Game Theory
Multi-Agent RL	Dec-POMDP Formal Model	Decentralized partially observable MDP	Multi-agent coordination
Theory	Bisimulation Metric	State equivalence based on transitions/rewards	State abstraction, bisimulation theory
Theory	Potential-Based Reward Shaping	Reward transformation preserving optimal policy	Sutton & Barto, Ng et al.
Training	Transfer RL: Source to Target	Reusing knowledge across different MDPs	Transfer Learning, Distillation
Deep RL	Multi-Task Backbone Arch	Single agent learning multiple tasks	Multi-task RL, IMPALA
Bandits	Contextual Bandit Pipeline	Decision making given context but no transitions	Personalization, Ad-tech
Theory	Theoretical Regret Bounds	Analytical performance guarantees	Online Learning, Bandits
Value-based	Soft Q Boltzmann Probabilities	Probabilistic action selection from Q-values	s) \propto \exp(Q/\tau)$
Robotics	Autonomous Driving RL Pipeline	End-to-end or modular driving stack	Wayve, Tesla, Comma.ai
Policy	Policy action gradient comparison	Comparison of gradient derivation types	PG Theorem vs DPG Theorem
Inverse RL / IRL	IRL: Feature Expectation Matching	Comparing expert vs learner feature visitor frequency	\mu(\pi^*) - \mu(\pi)
Imitation Learning	Apprenticeship Learning Loop	Training to match expert performance via reward inference	Apprenticeship Learning
Theory	Active Inference Loop	Agents minimizing surprise (free energy)	Free Energy Principle, Friston
Theory	Bellman Residual Landscape	Training surface of the Bellman error	TD learning, fitted Q-iteration
Model-Based RL	Plan-to-Explore Uncertainty Map	Systematic exploration in learned world models	Plan-to-Explore, Sekar et al.
Safety RL	Robust RL Uncertainty Set	Optimizing for the worst-case environment transition	Robust MDPs, minimax RL
Training	HPO Bayesian Opt Cycle	Automating hyperparameter selection with GP	Hyperparameter Optimization
Applied RL	Slate RL Recommendation	Optimizing list/slate of items for users	Recommender Systems, Ie et al.
Multi-Agent RL	Fictitious Play Interaction	Belief-based learning in games	Game Theory, Brown (1951)
Conceptual	Universal RL Framework Diagram	High-level summary of RL components	All RL
Offline RL	Offline Density Ratio Estimator	Estimating $w(s,a)$ for off-policy data	Importance Sampling, Offline RL
Continual RL	Continual Task Interference Heatmap	Measuring negative transfer between tasks	Lifelong Learning, EWC
Safety RL	Lyapunov Stability Safe Set	Invariant sets for safe control	Lyapunov RL, Chow et al.
Applied RL	Molecular RL (Atom Coordinates)	RL for molecular design/protein folding	Chemistry RL, AlphaFold-style
Architecture	MoE Multi-task Architecture	Scaling models with mixture of experts	MoE-RL, Sparsity
Direct Policy Search	CMA-ES Policy Search	Evolutionary strategy for policy weights	ES for RL, Salimans
Alignment	Elo Rating Preference Plot	Measuring agent strength over time	AlphaZero, League training
Explainable RL	Explainable RL (SHAP Attribution)	Local attribution of features to agent actions	Interpretability, SHAP/LIME
Meta-RL	PEARL Context Encoder	Learning latent task representations	PEARL, Rakelly et al.
Applied RL	Medical RL Therapy Pipeline	Personalized medicine and dosing	Healthcare RL, ICU Sepsis
Applied RL	Supply Chain RL Pipeline	Optimizing stock levels and orders	Logistics, Inventory Management
Robotics	Sim-to-Real SysID Loop	Closing the reality gap via parameter estimation	System Identification, Robotics
Architecture	Transformer World Model	Sequence-to-sequence dynamics modeling	DreamerV3, Transframer
Applied RL	Network Traffic RL	Optimizing data packet routing in graphs	Networking, Traffic Engineering
Training	RLHF: PPO with Reference Policy	Ensuring RL fine-tuning doesn't drift too far	InstructGPT, Llama 2/3
Multi-Agent RL	PSRO Meta-Game Update	Reaching Nash equilibrium in large games	PSRO, Lanctot et al.
Multi-Agent RL	DIAL: Differentiable Comm	End-to-end learning of communication protocols	DIAL, Foerster et al.
Batch RL	Fitted Q-Iteration Loop	Data-driven iteration with a supervised regressor	Ernst et al. (2005)
Safety RL	CMDP Feasible Region	Constrained optimization within a safety budget	Constrained MDPs, Altman
Control	MPC vs RL Planning	Comparison of control paradigms	Control Theory vs RL
AutoML	Learning to Optimize (L2O)	Using RL to learn an optimization update rule	L2O, Li & Malik
Applied RL	Smart Grid RL Management	Optimizing energy supply and demand	Energy RL, Smart Grids
Applied RL	Quantum State Tomography RL	RL for quantum state estimation	Quantum RL, Neural Tomography
Applied RL	RL for Chip Placement	Placing components on silicon grids	Google Chip Placement
Applied RL	RL Compiler Optimization (MLGO)	Inlining and sizing in compilers	MLGO, LLVM
Applied RL	RL for Theorem Proving	Automated reasoning and proof search	LeanRL, AlphaProof
Modern RL	Diffusion-QL Offline RL	Policy as reverse diffusion process	s,k)$ with noise injection
Principles	Fairness-reward Pareto Frontier	Balancing equity and returns	Fair RL, Jabbari et al.
Principles	Differentially Private RL	Privacy-preserving training	DP-RL, Agarwal et al.
Applied RL	Smart Agriculture RL	Optimizing crop yield and resources	Precision Agriculture
Applied RL	Climate Mitigation RL (Grid)	Environmental control policies	ClimateRL, Carbon Control
Applied RL	AI Education (Knowledge Tracing)	Personalized learning paths	ITS, Bayesian Knowledge Tracing
Modern RL	Decision SDE Flow	RL in continuous stochastic systems	Neural SDEs, Control
Control	Differentiable physics (Brax)	Gradients through simulators	Brax, PhysX, MuJoCo
Applied RL	Wireless Beamforming RL	Optimizing antenna signal directions	5G/6G Networking
Applied RL	Quantum Error Correction RL	Correcting noise in quantum circuits	Quantum Computing RL
Multi-Agent RL	Mean Field RL Interaction	Large population agent dynamics	MF-RL, Yang et al.
HRL	Goal-GAN Curriculum	Automatic goal generation	Goal-GAN, Florensa et al.
Modern RL	JEPA: Predictive Architecture	LeCun's world model framework	JEPA, I-JEPA
Offline RL	CQL Value Penalty Landscape	Conservatism in value functions	CQL, Kumar et al.
Applied RL	Causal RL	Causal Inverse RL Graph	DAG with $S, A, R$ and latent $U$
Quantum RL	VQE-RL Optimization	Quantum circuit param tuning	VQE, Quantum RL
Applied RL	De-novo Drug Discovery RL	Generating optimized lead molecules	Drug Discovery, Molecule RL
Applied RL	Traffic Signal Coordination RL	Multi-intersection coordination	IntelliLight, PressLight
Applied RL	Mars Rover Pathfinding RL	Navigation on rough terrain	Space RL, Mars Rover
Applied RL	Sports Player Movement RL	Predicting/Optimizing player actions	Sports Analytics, Ghosting
Applied RL	Cryptography Attack RL	Searching for keys/vulnerabilities	Crypto-RL, Learning to Attack
Applied RL	Humanitarian Resource RL	Disaster response allocation	AI for Good, Resource RL
Applied RL	Video Compression RL (RD)	Optimizing bit-rate vs distortion	Learned Video Compression
Applied RL	Kubernetes Auto-scaling RL	Cloud resource management	Cloud RL, K8s Scaling
Applied RL	Fluid Dynamics Flow Control RL	Airfoil/Turbulence control	Aero-RL, Flow Control
Applied RL	Structural Optimization RL	Topology/Material design	Structural RL, Topology Opt
Applied RL	Human Decision Modeling	Prospect Theory in RL	Behavioral RL, Prospect Theory
Applied RL	Semantic Parsing RL	Language to Logic transformation	Semantic Parsing, Seq2Seq-RL
Applied RL	Music Melody RL	Reward-based melody generation	Music-RL, Magenta
Applied RL	Plasma Fusion Control RL	Magnetic control of Tokamaks	DeepMind Fusion, Tokamak RL
Applied RL	Carbon Capture RL cycle	Adsorption/Desorption optimization	Carbon Capture, Green RL
Applied RL	Swarm Robotics RL	Decentralized swarm coordination	Swarm-RL, Multi-Robot
Applied RL	Legal Compliance RL Game	Regulatory games	Legal-RL, RegTech
Physics RL	Physics-Informed RL (PINN)	Constraint-based RL loss	PINN-RL, SciML
Modern RL	Neuro-Symbolic RL	Combining logic and neural nets	Neuro-Symbolic, Logic RL
Applied RL	DeFi Liquidity Pool RL	Yield farming/Liquidity balancing	DeFi-RL, AMM Optimization
Neuro RL	Dopamine Reward Prediction Error	Biological RL signal curves	Neuroscience-RL, Wolfram
Robotics	Proprioceptive Sensory-Motor RL	Low-level joint control	Proprioceptive RL, Unitree
Applied RL	AR Object Placement RL	AR visual overlay optimization	AR-RL, Visual Overlay
Reco RL	Sequential Bundle RL	Recommendation item grouping	Bundle-RL, E-commerce
Theoretical	Online Gradient Descent vs RL	Gradient-based learning comparison	Online Learning, Regret
Modern RL	Active Learning: Query RL	Query-based sample selection	Active-RL, Query Opt
Modern RL	Federated RL global Aggregator	Privacy-preserving distributed RL	Federated-RL, FedAvg-RL
Conceptual	Ultimate Universal RL Mastery Diagram	Final summary of 230 items	Absolute Mastery Milestone

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support