Live-Evo

Online Evolution of Agentic Memory from Continuous Feedback

Yaolun Zhang1,2,*, Yiran Wu2,3,*, Yijiong Yu1, Qingyun Wu2,3, Huazheng Wang1,2

1Oregon State University   2AG2 AI   3Penn State University

Live-Evo powered AG2 Agent achieves Top 1 on market return and maintains the position for 5 weeks! View Prophet Arena Leaderboard →

Abstract

Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent self-evolving systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback.

We introduce LIVE-EVO, an online self-evolving memory system that learns from a stream of incoming data over time. LIVE-EVO decouples what happened from how to use it via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task.

Performance

Evaluated on Prophet Arena, a future-prediction benchmark spanning 10 weeks with 500 tasks, Live-Evo significantly outperforms all baselines.

Brier Score Results (Lower is Better)

Method W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 Avg
Live-Evo (Ours) 0.19 0.17 0.16 0.10 0.17 0.12 0.19 0.15 0.13 0.10 0.14
ReMem 0.19 0.23 0.14 0.11 0.21 0.18 0.19 0.17 0.15 0.11 0.16
Qwen Deep Research 0.17 0.22 0.23 0.15 0.22 0.22 0.21 0.19 0.20 0.13 0.20
GPT-4.1-mini 0.18 0.20 0.31 0.18 0.24 0.26 0.23 0.25 0.23 0.15 0.22

Cumulative Portfolio Value

Cumulative Portfolio Value

Investing $100 per week, Live-Evo achieves $1,408 vs $1,247 without experience - a $161 advantage over 10 weeks.

Brier Score Comparison

Brier Score Comparison

Live-Evo consistently outperforms the base agent, especially during volatile periods (Weeks 5-6).

Generalization Across Models

Base Model Brier Score Improvement Market Return Improvement
GPT-4.1-mini 0.14 +20.8% 1.46 +12.9%
GPT-4.1 0.17 +3.0% 1.18 +4.4%
GPT-5-mini 0.15 +4.5% 1.36 +1.6%
Qwen3-8B 0.18 +3.5% 1.21 +0.5%

Check Your Prediction Accuracy

Did you beat the market? Click on your selection for each event to compare it against the actual outcome.

Sports / NFL

Who will win the game?

Cincinnati Bengals vs Pittsburgh Steelers

Economics

Fed Rate Decision

Will the Fed cut rates in September 2025?

Politics

Virginia Governor Race

Who will win the 2025 Virginia gubernatorial election?

Technology

Apple Earnings

Will Apple beat Q3 2025 revenue estimates?

Our Framework

LIVE-EVO operates through a four-stage evolutionary loop: Retrieve, Compile, Act, and Update.

Live-Evo Framework Architecture
1

Retrieve

Given a task, the agent generates queries to retrieve relevant question-experience pairs from the Experience Bank, weighted by learned experience scores.

2

Compile

Retrieved experiences are compiled into a task-specific guideline, instructed by Meta-Guidelines that encode meta-heuristics for combining historical insights.

3

Act

The agent performs ContrastiveEval by producing two independent predictions - one guided by the compiled guideline and one as a memory-free baseline.

4

Update

Experience weights are updated based on observed performance gaps. If the guideline underperforms, a new meta-guideline is generated.

Experience Bank

Stores past task interactions in a structured, reusable form. Each experience includes the question, failure reason, improvement suggestions, and missed information.

Meta-Guideline Bank

Stores higher-level composition instructions that specify how to transform retrieved experiences into task-adaptive guidelines under different conditions.

Why Live-Evo?

Traditional Self-Evolving Memory

Train Set
Build
Memory
Test
Test Set
  • Static train/test split
  • Memory frozen at test time
  • Brittle under distribution shift
  • Cannot adapt to new patterns

Live-Evo (Our Approach)

Week 1
Week 2
...
Week k
Memory continuously evolves
  • Continuous memory evolution
  • Learns from streaming feedback
  • Adapts to distribution shift
  • Reinforcement-and-decay dynamics

Key Innovations

🧠

Decoupled Memory

Separates what happened (Experience Bank) from how to use it (Meta-Guideline Bank)

Contrastive Evaluation

Quantifies the causal impact of memory by comparing guided vs. memory-free predictions

🔄

Adaptive Weighting

Reinforces helpful experiences and gradually forgets misleading ones, like human memory

Selective Acquisition

Only commits new experiences after re-evaluation confirms measurable improvement

Example

See how Live-Evo processes a prediction task through its four-stage pipeline.

Task Input

Question:

Which professional football team, Cincinnati or Pittsburgh, will win the game scheduled for Oct 16, 2025?

Cincinnati Pittsburgh

Retrieve Relevant Experiences

Experience #1 (Sports/NCAAF)
Failure Reason: Over-relied on pre-game betting odds and recent season trends without accounting for roster changes or "home advantage" dynamics.

Improvement: Incorporate dynamic, up-to-date info (roster, coaching) as the event approaches. Avoid static betting odds.
Weight: 1.45
Experience #2 (Sports/MLS)
Failure Reason: Failed to update prediction to reflect rescheduling of the match, basing probabilities on outdated timing.

Improvement: Always verify the event date and confirm the prediction is relative to the current schedule.
Weight: 1.32

Compile Task-Specific Guideline

Synthesized Guideline

  • Dynamic Information: Prioritize authoritative sources (e.g., official injury reports, press releases) close to the game date over early betting odds or historical reputation.
  • Schedule Verification: Implement a workflow step to confirm the exact game date and update contextual data to avoid outdated inputs.
  • Scenario Analysis: Explicitly model the impact of key player absences (e.g., injury reports) and home vs. away advantages.

Contrastive Evaluation

Without Memory

Cincinnati: 35%
Pittsburgh: 65%

Relied on Pittsburgh's 4-1 record and winning streak. Heavily weighted betting odds favoring Steelers.

Brier Score: 0.5329

With Memory Guideline

Cincinnati: 55%
Pittsburgh: 45%

Identified close 33-31 victory conditions. Weighed resilience despite injuries and home advantage more heavily than static odds.

Brier Score: 0.2500
Ground Truth: Cincinnati won!

Update Memory

Experience weights increased

Retrieved experiences helped improve prediction (Brier improvement: 0.2829)

No new meta-guideline needed

Memory-guided prediction outperformed baseline

Experience Weight Evolution

Positive experiences (green) are reinforced over time, while negative experiences (red) are gradually forgotten.

Citation

@article{zhang2025liveevo,
  title={Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback},
  author={Zhang, Yaolun and Wu, Yiran and Yu, Yijiong and Wu, Qingyun and Wang, Huazheng},
  journal={arXiv preprint arXiv:2501.xxxxx},
  year={2025}
}