Council of Quants
RL Multi-Agent Swarm · Episodic Memory · Auto-Execute
Online RL Training Loop
Debate episodes are stored, rewarded after market movement, then used to weight future decisions.
Episodes
0
Rewarded
0
Pending
0
λ
0.01
State
market vector
Debate
3 agents
Memory
0 episodes
Policy
meta-controller
Reward
delayed
State-action tuple
Each debate stores market state, agent arguments, meta-action, confidence, and entry price.
Delayed reward
After the reward window, price movement marks whether the action was correct.
Policy update
Future debates retrieve similar episodes and weight agents by observed accuracy.
Learning trace
No training episodes yet. Run Swarm to create the first memory.
RL-Optimised Multi-Agent Swarm
Three adversarial agents · Episodic memory · Meta-controller · Auto-execute