Online Evolution of Agentic Memory from Continuous Feedback
1Oregon State University 2AG2 AI 3Penn State University
Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent self-evolving systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback.
We introduce LIVE-EVO, an online self-evolving memory system that learns from a stream of incoming data over time. LIVE-EVO decouples what happened from how to use it via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task.
Evaluated on Prophet Arena, a future-prediction benchmark spanning 10 weeks with 500 tasks, Live-Evo significantly outperforms all baselines.
| Method | W1 | W2 | W3 | W4 | W5 | W6 | W7 | W8 | W9 | W10 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Live-Evo (Ours) | 0.19 | 0.17 | 0.16 | 0.10 | 0.17 | 0.12 | 0.19 | 0.15 | 0.13 | 0.10 | 0.14 |
| ReMem | 0.19 | 0.23 | 0.14 | 0.11 | 0.21 | 0.18 | 0.19 | 0.17 | 0.15 | 0.11 | 0.16 |
| Qwen Deep Research | 0.17 | 0.22 | 0.23 | 0.15 | 0.22 | 0.22 | 0.21 | 0.19 | 0.20 | 0.13 | 0.20 |
| GPT-4.1-mini | 0.18 | 0.20 | 0.31 | 0.18 | 0.24 | 0.26 | 0.23 | 0.25 | 0.23 | 0.15 | 0.22 |
Investing $100 per week, Live-Evo achieves $1,408 vs $1,247 without experience - a $161 advantage over 10 weeks.
Live-Evo consistently outperforms the base agent, especially during volatile periods (Weeks 5-6).
| Base Model | Brier Score | Improvement | Market Return | Improvement |
|---|---|---|---|---|
| GPT-4.1-mini | 0.14 | +20.8% | 1.46 | +12.9% |
| GPT-4.1 | 0.17 | +3.0% | 1.18 | +4.4% |
| GPT-5-mini | 0.15 | +4.5% | 1.36 | +1.6% |
| Qwen3-8B | 0.18 | +3.5% | 1.21 | +0.5% |
Did you beat the market? Click on your selection for each event to compare it against the actual outcome.
Cincinnati Bengals vs Pittsburgh Steelers
Will the Fed cut rates in September 2025?
Who will win the 2025 Virginia gubernatorial election?
Will Apple beat Q3 2025 revenue estimates?
LIVE-EVO operates through a four-stage evolutionary loop: Retrieve, Compile, Act, and Update.
Given a task, the agent generates queries to retrieve relevant question-experience pairs from the Experience Bank, weighted by learned experience scores.
Retrieved experiences are compiled into a task-specific guideline, instructed by Meta-Guidelines that encode meta-heuristics for combining historical insights.
The agent performs ContrastiveEval by producing two independent predictions - one guided by the compiled guideline and one as a memory-free baseline.
Experience weights are updated based on observed performance gaps. If the guideline underperforms, a new meta-guideline is generated.
Stores past task interactions in a structured, reusable form. Each experience includes the question, failure reason, improvement suggestions, and missed information.
Stores higher-level composition instructions that specify how to transform retrieved experiences into task-adaptive guidelines under different conditions.
Separates what happened (Experience Bank) from how to use it (Meta-Guideline Bank)
Quantifies the causal impact of memory by comparing guided vs. memory-free predictions
Reinforces helpful experiences and gradually forgets misleading ones, like human memory
Only commits new experiences after re-evaluation confirms measurable improvement
See how Live-Evo processes a prediction task through its four-stage pipeline.
Which professional football team, Cincinnati or Pittsburgh, will win the game scheduled for Oct 16, 2025?
Relied on Pittsburgh's 4-1 record and winning streak. Heavily weighted betting odds favoring Steelers.
Identified close 33-31 victory conditions. Weighed resilience despite injuries and home advantage more heavily than static odds.
Retrieved experiences helped improve prediction (Brier improvement: 0.2829)
Memory-guided prediction outperformed baseline
Positive experiences (green) are reinforced over time, while negative experiences (red) are gradually forgotten.
@article{zhang2025liveevo,
title={Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback},
author={Zhang, Yaolun and Wu, Yiran and Yu, Yijiong and Wu, Qingyun and Wang, Huazheng},
journal={arXiv preprint arXiv:2501.xxxxx},
year={2025}
}