CaseStudies

MassGen v0.1.5: Persistent Memory with Semantic Retrieval

MassGen is focused on case-driven development. This case study demonstrates the introduction of persistent memory with semantic retrieval, enabling agents to build cumulative knowledge across multi-turn sessions and achieve true self-evolution through research-to-implementation workflows.

📋 PLANNING PHASE
🚀 TESTING PHASE
- 📦 Implementation Details
- 🤖 Agents
📊 EVALUATION & ANALYSIS
- Results
- 🎯 Conclusion
📌 Status Tracker

📋 PLANNING PHASE

📝 Evaluation Design

Prompt

Two-turn research-to-implementation workflow:

Turn 1 (Research):

Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.

Turn 2 (Implementation):

Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.

This prompt tests whether agents can:

Research external sources and store findings
Retrieve relevant research in a follow-up turn
Apply research to self-improvement recommendations

Baseline Config

Multi-turn conversations were already supported in MassGen, but without persistent memory. Turn 2 only had access to:

The full conversation history from Turn 1 (in context window)
Any workspace files created during Turn 1

Limitation: No semantic search, no fact extraction, no persistent knowledge base across sessions.

🔧 Evaluation Analysis

Results & Failure Modes

Before Persistent Memory (multi-turn only):

✅ What worked:

Agents could have multi-turn conversations
Turn 2 could reference Turn 1’s full output in context
Workspace files persisted across turns

❌ What was missing:

No semantic augmentation: Turn 2 had Turn 1’s answer but no additional extracted facts
No structured knowledge: Research stored only as raw conversation text
Generic recommendations: Without structured facts, recommendations lacked specificity

Baseline Turn 2 Result (session log_20251029_064846, 132 lines):

Generic architectural proposals: “add massgen/voting.py”
Theoretical recommendations: “pluggable voting”, “layered memory”
Less grounded in current codebase structure
More abstract: “CoordinationStrategy interface and implementations”

Success Criteria

Persistent memory should enable:

Automatic Fact Extraction: System extracts structured facts from Turn 1 research
Semantic Augmentation: Turn 2 gets Turn 1’s answer PLUS relevant extracted facts automatically
Persistent Storage: Facts stored in vector database
More Specific Recommendations: Turn 2 provides concrete file paths, implementation steps, grounded in both research and current architecture

🎯 Desired Features

To enable self-evolution through research-to-implementation:

Fact Extraction: Automatically extract important facts from conversations
Vector Storage: Store facts with embeddings in persistent vector database
Semantic Retrieval: Automatically retrieve relevant facts based on context
Cross-Turn Continuity: Facts from Turn 1 available in Turn 2
Quality Extraction: Custom prompts to ensure useful, self-contained facts
Multi-Agent Support: Concurrent fact storage from multiple agents

🚀 TESTING PHASE

📦 Implementation Details

Version

v0.1.5 - Introduction of persistent memory system

✨ New Features

PersistentMemory Integration (massgen/memory/_persistent.py)
- Wraps mem0’s AsyncMemory with MassGen-specific logic
- Automatic fact extraction on turn completion
- Semantic retrieval via vector search
- Metadata tracking (session_id, agent_id, turn number)
Custom Fact Extraction Prompts (massgen/memory/_fact_extraction_prompts.py)
- MASSGEN_UNIVERSAL_FACT_EXTRACTION_PROMPT designed for quality facts
- Intended to filter out: agent comparisons, voting details, file paths, system internals
- Focuses on: domain knowledge, insights, capabilities, recommendations
- Enforces self-contained facts (understandable without original context)
Qdrant Vector Store Integration
- Server mode support for multi-agent concurrency
- Vector similarity search with metadata filtering
- Persistent storage across sessions
Memory Configuration YAML
- memory.persistent_memory section for mem0 configuration
- LLM and embedding model settings
- Qdrant connection parameters
- Retrieval and compression settings
Automatic Recording & Retrieval (in ChatAgent)
- Records facts after each turn completion
- Retrieves relevant facts when context window approaches limit
- Injects as system message: “Relevant memories: …”

New Config

massgen/configs/memory/gpt5mini_gemini_research_to_implementation.yaml

Key Memory Settings:

memory:
  enabled: true

  persistent_memory:
    enabled: true  # 🆕 NEW: Persistent memory
    session_name: "research_to_implementation"  # Cross-turn continuity
    vector_store: "qdrant"

    llm:
      provider: "openai"
      model: "gpt-4.1-nano-2025-04-14"  # Fact extraction

    embedding:
      provider: "openai"
      model: "text-embedding-3-small"  # Vector embeddings

    qdrant:
      mode: "server"
      host: "localhost"
      port: 6333

  retrieval:
    limit: 10  # Number of facts to retrieve

Command

Prerequisites:

# Start Qdrant server
docker run -d -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/.massgen/qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant

# Start crawl4ai (for web scraping)
docker run -d -p 11235:11235 --name crawl4ai \
  --shm-size=1g unclecode/crawl4ai:latest

Run Session:

uv run massgen --config @examples/memory/gpt5mini_gemini_research_to_implementation.yaml

Turn 1 Prompt:

Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.

Turn 2 Prompt (in same session):

Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.

🤖 Agents

Agent A: gpt-5-mini (with crawl4ai tools)
Agent B: gemini-2.5-flash (with crawl4ai tools)

Session: session_20251029_072105 Duration: 11 minutes across 2 turns Memory Stats:

Facts stored (Turn 1): 54
Facts retrieved (Turn 2): 10

🎥 Demo

Watch the recorded demo:

📊 EVALUATION & ANALYSIS

Results

Persistent memory dramatically improved Turn 2’s ability to provide specific, actionable recommendations by retrieving relevant research findings from Turn 1.

The Collaborative Process

Turn 1 - Research Phase (5 minutes):

Agents used crawl4ai to scrape arXiv
Retrieved 20+ papers on multi-agent systems from late 2025
Analyzed coordination mechanisms, voting strategies, tool patterns, architectures
Generated comprehensive research summary (~133 lines)
🆕 Memory recorded 54 facts automatically

Example facts stored:

“Multi-layer memory folding that includes short-term windows, episodic timelines, and semantic summaries allows agents to manage large contexts efficiently, reducing token usage while maintaining factual recall, which is crucial for long-horizon tasks and fine-tuning.”

“In 2025, multi-agent and agentic-AI systems evolved from ad-hoc multi-LLM setups to using structured workflows including hierarchical planning, task graphs, and planner-executor separations, which improve coherence, scalability, and fault tolerance.”

Turn 2 - Implementation Phase (6 minutes):

Agents have Turn 1’s full answer in context (standard multi-turn)
🆕 System automatically retrieves 10 relevant facts from Turn 1 via semantic search
🆕 Facts injected as system message: “Relevant memories: …”
Read MassGen codebase (massgen/ and docs/ directories)
Cross-referenced Turn 1 answer + retrieved facts + current architecture
Generated prioritized implementation plan (~110 lines)

Example automatic memory retrieval:

When Turn 2 starts, system automatically searches memories and injects relevant facts:

Retrieved fact:

“Using argumentation frameworks with evidence scoring, proficiency or reputation-weighted voting, multi-stage consensus, and human-in-the-loop arbitration are advanced voting strategies in 2025…”

This fact (from Turn 1 research) was automatically added to Turn 2’s context, enabling:

“1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)

Replace naive majority with weighted aggregation using per‑agent proficiency scores plus evidence strength…”

The Voting Pattern

No changes to voting in this release - standard MassGen voting applied. The improvement came from what agents could reference during answer generation, not how they voted.

The Final Answer

Turn 2 Quality Comparison (Both sessions have Turn 1 answer in context):

Without Persistent Memory (log_20251029_064846, 132 lines):

Generic architectural proposals: “add massgen/voting.py (or massgen/voting/ package)”
Theoretical interfaces: “CoordinationStrategy”, “Aggregator base”
Broad phases: “Phase 0 (1-2 sprints)”, “Phase 1 (2-6 weeks)”
Less grounded: Treats implementation like greenfield architecture

Sample from baseline:

1) Pluggable Voting & Aggregation + Adaptive Early Stopping
- Where to change: add massgen/voting.py (or massgen/voting/ package)
- Suggested API / design sketch:
  - Aggregator (base)
    - add_vote(agent_id, result, confidence, trajectory, metadata)

With Persistent Memory (log_20251029_072105, 110 lines):

✅ Specific existing file paths: workflow_toolkits/vote.py, coordination_tracker.py, orchestrator.py
✅ Concrete implementation steps: Numbered steps for each feature
✅ Test metrics: Specific KPIs for measuring success
✅ Sprint planning: Concrete deliverables per sprint
✅ Grounded in current architecture: References actual MassGen files

Sample from memory-enabled:

Top recommendations (what + where to change in repo)

1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)
Where to implement (explicit paths):
  - workflow_toolkits/vote.py — extend to accept evidence payloads and compute weighted scores
  - message_templates.py — add evidence schema to agent message format
  - coordination_tracker.py — track per‑agent proficiency/calibration
  - orchestrator.py — surface evidence into coordinator logs and call Judge

Concrete implementation steps:
  1. Extend message template with evidence: {claims:[...], tool_outputs:[...], confidence:float}
  2. Implement per‑agent scoreboard (moving average success) in coordination_tracker.py
  3. Update vote.py: compute final_score = α*proficiency + β*evidence_score + γ*vote_strength
  4. Create Judge agent that can (a) fetch supporting sources, (b) re-run tool calls...

Key Difference:

The memory-enabled version provides:

Actual file paths that exist (workflow_toolkits/vote.py vs. “add massgen/voting.py”)
Numbered implementation steps
Specific integration points
Grounded in both research AND current codebase

This specificity comes from:

Turn 1 research stored as structured facts (automatic)
Turn 2 has Turn 1 answer PLUS 10 relevant facts (automatic semantic retrieval)
Facts provide additional semantic context beyond raw conversation history
Agents combine: Turn 1 answer + extracted facts + codebase analysis = concrete actionable plan

Memory System Performance

Memory example:

“Coordination mechanisms that improve long-term coherence include hierarchical recursive planning, task decomposition with DAG structures, and planner-executor systems that maintain shared memory and intermediate artifacts.”

Retrieval Performance:

Turn 2 context triggers automatic semantic search
System found 10 most relevant facts from 54 stored
Latency: < 100ms
Facts augment Turn 1 answer already in context

Cost Analysis:

Fact extraction: gpt-4.1-nano @ $0.15/M tokens
Embeddings: text-embedding-3-small @ $0.020/M tokens
54 facts extracted + embedded: < $0.001
Storage: ~108 KB (54 × 2KB per fact)

🎯 Conclusion

Why Persistent Memory Improves Self-Evolution

Before (multi-turn with conversation history):

Turn 2 had Turn 1’s full answer in context ✓
But: No additional semantic augmentation
Result: Generic architectural proposals (132 lines, mostly abstract interfaces)

After (persistent memory + conversation history):

Turn 2 has Turn 1’s answer PLUS 10 automatically retrieved facts
Facts provide additional semantic context extracted from Turn 1
System automatically searches and injects relevant memories
Result: Specific file paths, numbered steps, grounded in actual architecture (110 lines, more concrete)

The Compound Effect

Within this session, memory enabled:

Turn 1: Research 20+ papers → Extract and store 54 structured facts
Turn 2: Automatically retrieve 10 relevant facts → More specific recommendations

The architecture supports future cross-session retrieval, though not demonstrated in this case study.

Broader Implications

Persistent memory enables:

Self-Evolution: Agents can learn about themselves through research-to-implementation
Research-to-Implementation: Bridge external research to internal development
Semantic Augmentation: Additional structured facts supplement conversation history
Knowledge Storage: Facts persist in vector database for future retrieval
Improved Specificity: Extracted facts lead to more concrete, actionable recommendations

Future Improvements

Memory Quality (Current: 72% good, 28% system internals):

The custom fact extraction prompts significantly improve memory quality, but ~28% of stored facts are still system internals (voting details, agent comparisons, meta-instructions). Planned improvements:

Stricter pattern matching for procedural language
Two-pass extraction (extract, then validate against exclusion rules)
Domain-specific prompts for research vs. implementation tasks
Active learning from user feedback on memory relevance

Cross-Session Loading:

Load facts from previous sessions by session_name
Session management UI
Memory pruning and maintenance

Retrieval Intelligence:

Multi-query retrieval (expand queries to multiple search vectors)
Temporal weighting (recent facts ranked higher)
Cross-session memory fusion (merge related facts)