Live-Evo: Online Evolution of Agentic Memory

Abstract

Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent self-evolving systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback.

We introduce LIVE-EVO, an online self-evolving memory system that learns from a stream of incoming data over time. LIVE-EVO decouples what happened from how to use it via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task.

Performance

Evaluated on Prophet Arena, a future-prediction benchmark spanning 10 weeks with 500 tasks, Live-Evo significantly outperforms all baselines.

Brier Score Results (Lower is Better)

Method	W1	W2	W3	W4	W5	W6	W7	W8	W9	W10	Avg
Live-Evo (Ours)	0.19	0.17	0.16	0.10	0.17	0.12	0.19	0.15	0.13	0.10	0.14
ReMem	0.19	0.23	0.14	0.11	0.21	0.18	0.19	0.17	0.15	0.11	0.16
Qwen Deep Research	0.17	0.22	0.23	0.15	0.22	0.22	0.21	0.19	0.20	0.13	0.20
GPT-4.1-mini	0.18	0.20	0.31	0.18	0.24	0.26	0.23	0.25	0.23	0.15	0.22

Cumulative Portfolio Value

Investing $100 per week, Live-Evo achieves $1,408 vs $1,247 without experience - a $161 advantage over 10 weeks.

Brier Score Comparison

Live-Evo consistently outperforms the base agent, especially during volatile periods (Weeks 5-6).

Generalization Across Models

Base Model	Brier Score	Improvement	Market Return	Improvement
GPT-4.1-mini	0.14	+20.8%	1.46	+12.9%
GPT-4.1	0.17	+3.0%	1.18	+4.4%
GPT-5-mini	0.15	+4.5%	1.36	+1.6%
Qwen3-8B	0.18	+3.5%	1.21	+0.5%

Check Your Prediction Accuracy

Did you beat the market? Click on your selection for each event to compare it against the actual outcome.

Sports / NFL

Who will win the game?

Cincinnati Bengals vs Pittsburgh Steelers

Economics

Fed Rate Decision

Will the Fed cut rates in September 2025?

Politics

Virginia Governor Race

Who will win the 2025 Virginia gubernatorial election?

Technology

Apple Earnings

Will Apple beat Q3 2025 revenue estimates?

Our Framework

LIVE-EVO operates through a four-stage evolutionary loop: Retrieve, Compile, Act, and Update.

Retrieve

Given a task, the agent generates queries to retrieve relevant question-experience pairs from the Experience Bank, weighted by learned experience scores.

Compile

Retrieved experiences are compiled into a task-specific guideline, instructed by Meta-Guidelines that encode meta-heuristics for combining historical insights.

Act

The agent performs ContrastiveEval by producing two independent predictions - one guided by the compiled guideline and one as a memory-free baseline.

Update

Experience weights are updated based on observed performance gaps. If the guideline underperforms, a new meta-guideline is generated.

Experience Bank

Stores past task interactions in a structured, reusable form. Each experience includes the question, failure reason, improvement suggestions, and missed information.

Meta-Guideline Bank

Stores higher-level composition instructions that specify how to transform retrieved experiences into task-adaptive guidelines under different conditions.

Why Live-Evo?

Traditional Self-Evolving Memory

Train Set

Build

Memory

Test

Test Set

Static train/test split
Memory frozen at test time
Brittle under distribution shift
Cannot adapt to new patterns

Live-Evo (Our Approach)

Week 1

Week 2

...

Week k

Memory continuously evolves

Continuous memory evolution
Learns from streaming feedback
Adapts to distribution shift
Reinforcement-and-decay dynamics

Key Innovations

🧠

Decoupled Memory

Separates what happened (Experience Bank) from how to use it (Meta-Guideline Bank)

⚖

Contrastive Evaluation

Quantifies the causal impact of memory by comparing guided vs. memory-free predictions

🔄

Adaptive Weighting

Reinforces helpful experiences and gradually forgets misleading ones, like human memory

✅

Selective Acquisition

Only commits new experiences after re-evaluation confirms measurable improvement

Example

See how Live-Evo processes a prediction task through its four-stage pipeline.

Task Input

Question:

Which professional football team, Cincinnati or Pittsburgh, will win the game scheduled for Oct 16, 2025?

Cincinnati Pittsburgh

Retrieve Relevant Experiences

Experience #1 (Sports/NCAAF)

Failure Reason: Over-relied on pre-game betting odds and recent season trends without accounting for roster changes or "home advantage" dynamics.

Improvement: Incorporate dynamic, up-to-date info (roster, coaching) as the event approaches. Avoid static betting odds.

Weight: 1.45

Experience #2 (Sports/MLS)

Failure Reason: Failed to update prediction to reflect rescheduling of the match, basing probabilities on outdated timing.

Improvement: Always verify the event date and confirm the prediction is relative to the current schedule.

Weight: 1.32

Compile Task-Specific Guideline

Synthesized Guideline

Dynamic Information: Prioritize authoritative sources (e.g., official injury reports, press releases) close to the game date over early betting odds or historical reputation.
Schedule Verification: Implement a workflow step to confirm the exact game date and update contextual data to avoid outdated inputs.
Scenario Analysis: Explicitly model the impact of key player absences (e.g., injury reports) and home vs. away advantages.

Contrastive Evaluation

Without Memory

Cincinnati: 35%

Pittsburgh: 65%

Relied on Pittsburgh's 4-1 record and winning streak. Heavily weighted betting odds favoring Steelers.

Brier Score: 0.5329

With Memory Guideline

Cincinnati: 55%

Pittsburgh: 45%

Identified close 33-31 victory conditions. Weighed resilience despite injuries and home advantage more heavily than static odds.

Brier Score: 0.2500

Ground Truth: Cincinnati won!

Update Memory

↑

Experience weights increased

Retrieved experiences helped improve prediction (Brier improvement: 0.2829)

✓

No new meta-guideline needed

Memory-guided prediction outperformed baseline

Positive experiences (green) are reinforced over time, while negative experiences (red) are gradually forgotten.

Citation

@misc{zhang2026liveevoonlineevolutionagentic,
      title={Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback}, 
      author={Yaolun Zhang and Yiran Wu and Yijiong Yu and Qingyun Wu and Huazheng Wang},
      year={2026},
      eprint={2602.02369},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.02369},
}