CaseStudies

MassGen v0.1.5: Persistent Memory with Semantic Retrieval

MassGen is focused on case-driven development. This case study demonstrates the introduction of persistent memory with semantic retrieval, enabling agents to build cumulative knowledge across multi-turn sessions and achieve true self-evolution through research-to-implementation workflows.


Table of Contents


đź“‹ PLANNING PHASE

📝 Evaluation Design

Prompt

Two-turn research-to-implementation workflow:

Turn 1 (Research):

Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.

Turn 2 (Implementation):

Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.

This prompt tests whether agents can:

  1. Research external sources and store findings
  2. Retrieve relevant research in a follow-up turn
  3. Apply research to self-improvement recommendations

Baseline Config

Multi-turn conversations were already supported in MassGen, but without persistent memory. Turn 2 only had access to:

Limitation: No semantic search, no fact extraction, no persistent knowledge base across sessions.

đź”§ Evaluation Analysis

Results & Failure Modes

Before Persistent Memory (multi-turn only):

âś… What worked:

❌ What was missing:

  1. No semantic augmentation: Turn 2 had Turn 1’s answer but no additional extracted facts
  2. No structured knowledge: Research stored only as raw conversation text
  3. Generic recommendations: Without structured facts, recommendations lacked specificity

Baseline Turn 2 Result (session log_20251029_064846, 132 lines):

Success Criteria

Persistent memory should enable:

  1. Automatic Fact Extraction: System extracts structured facts from Turn 1 research
  2. Semantic Augmentation: Turn 2 gets Turn 1’s answer PLUS relevant extracted facts automatically
  3. Persistent Storage: Facts stored in vector database
  4. More Specific Recommendations: Turn 2 provides concrete file paths, implementation steps, grounded in both research and current architecture

🎯 Desired Features

To enable self-evolution through research-to-implementation:

  1. Fact Extraction: Automatically extract important facts from conversations
  2. Vector Storage: Store facts with embeddings in persistent vector database
  3. Semantic Retrieval: Automatically retrieve relevant facts based on context
  4. Cross-Turn Continuity: Facts from Turn 1 available in Turn 2
  5. Quality Extraction: Custom prompts to ensure useful, self-contained facts
  6. Multi-Agent Support: Concurrent fact storage from multiple agents

🚀 TESTING PHASE

📦 Implementation Details

Version

v0.1.5 - Introduction of persistent memory system

✨ New Features

  1. PersistentMemory Integration (massgen/memory/_persistent.py)
    • Wraps mem0’s AsyncMemory with MassGen-specific logic
    • Automatic fact extraction on turn completion
    • Semantic retrieval via vector search
    • Metadata tracking (session_id, agent_id, turn number)
  2. Custom Fact Extraction Prompts (massgen/memory/_fact_extraction_prompts.py)
    • MASSGEN_UNIVERSAL_FACT_EXTRACTION_PROMPT designed for quality facts
    • Intended to filter out: agent comparisons, voting details, file paths, system internals
    • Focuses on: domain knowledge, insights, capabilities, recommendations
    • Enforces self-contained facts (understandable without original context)
  3. Qdrant Vector Store Integration
    • Server mode support for multi-agent concurrency
    • Vector similarity search with metadata filtering
    • Persistent storage across sessions
  4. Memory Configuration YAML
    • memory.persistent_memory section for mem0 configuration
    • LLM and embedding model settings
    • Qdrant connection parameters
    • Retrieval and compression settings
  5. Automatic Recording & Retrieval (in ChatAgent)
    • Records facts after each turn completion
    • Retrieves relevant facts when context window approaches limit
    • Injects as system message: “Relevant memories: …”

New Config

massgen/configs/memory/gpt5mini_gemini_research_to_implementation.yaml

Key Memory Settings:

memory:
  enabled: true

  persistent_memory:
    enabled: true  # 🆕 NEW: Persistent memory
    session_name: "research_to_implementation"  # Cross-turn continuity
    vector_store: "qdrant"

    llm:
      provider: "openai"
      model: "gpt-4.1-nano-2025-04-14"  # Fact extraction

    embedding:
      provider: "openai"
      model: "text-embedding-3-small"  # Vector embeddings

    qdrant:
      mode: "server"
      host: "localhost"
      port: 6333

  retrieval:
    limit: 10  # Number of facts to retrieve

Command

Prerequisites:

# Start Qdrant server
docker run -d -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/.massgen/qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant

# Start crawl4ai (for web scraping)
docker run -d -p 11235:11235 --name crawl4ai \
  --shm-size=1g unclecode/crawl4ai:latest

Run Session:

uv run massgen --config @examples/memory/gpt5mini_gemini_research_to_implementation.yaml

Turn 1 Prompt:

Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.

Turn 2 Prompt (in same session):

Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.

🤖 Agents

Session: session_20251029_072105 Duration: 11 minutes across 2 turns Memory Stats:

🎥 Demo

Watch the recorded demo:

MassGen Case Study


📊 EVALUATION & ANALYSIS

Results

Persistent memory dramatically improved Turn 2’s ability to provide specific, actionable recommendations by retrieving relevant research findings from Turn 1.

The Collaborative Process

Turn 1 - Research Phase (5 minutes):

  1. Agents used crawl4ai to scrape arXiv
  2. Retrieved 20+ papers on multi-agent systems from late 2025
  3. Analyzed coordination mechanisms, voting strategies, tool patterns, architectures
  4. Generated comprehensive research summary (~133 lines)
  5. 🆕 Memory recorded 54 facts automatically

Example facts stored:

“Multi-layer memory folding that includes short-term windows, episodic timelines, and semantic summaries allows agents to manage large contexts efficiently, reducing token usage while maintaining factual recall, which is crucial for long-horizon tasks and fine-tuning.”

“In 2025, multi-agent and agentic-AI systems evolved from ad-hoc multi-LLM setups to using structured workflows including hierarchical planning, task graphs, and planner-executor separations, which improve coherence, scalability, and fault tolerance.”

Turn 2 - Implementation Phase (6 minutes):

  1. Agents have Turn 1’s full answer in context (standard multi-turn)
  2. 🆕 System automatically retrieves 10 relevant facts from Turn 1 via semantic search
  3. 🆕 Facts injected as system message: “Relevant memories: …”
  4. Read MassGen codebase (massgen/ and docs/ directories)
  5. Cross-referenced Turn 1 answer + retrieved facts + current architecture
  6. Generated prioritized implementation plan (~110 lines)

Example automatic memory retrieval:

When Turn 2 starts, system automatically searches memories and injects relevant facts:

Retrieved fact:

“Using argumentation frameworks with evidence scoring, proficiency or reputation-weighted voting, multi-stage consensus, and human-in-the-loop arbitration are advanced voting strategies in 2025…”

This fact (from Turn 1 research) was automatically added to Turn 2’s context, enabling:

“1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)

The Voting Pattern

No changes to voting in this release - standard MassGen voting applied. The improvement came from what agents could reference during answer generation, not how they voted.

The Final Answer

Turn 2 Quality Comparison (Both sessions have Turn 1 answer in context):

Without Persistent Memory (log_20251029_064846, 132 lines):

Sample from baseline:

1) Pluggable Voting & Aggregation + Adaptive Early Stopping
- Where to change: add massgen/voting.py (or massgen/voting/ package)
- Suggested API / design sketch:
  - Aggregator (base)
    - add_vote(agent_id, result, confidence, trajectory, metadata)

With Persistent Memory (log_20251029_072105, 110 lines):

Sample from memory-enabled:

Top recommendations (what + where to change in repo)

1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)
Where to implement (explicit paths):
  - workflow_toolkits/vote.py — extend to accept evidence payloads and compute weighted scores
  - message_templates.py — add evidence schema to agent message format
  - coordination_tracker.py — track per‑agent proficiency/calibration
  - orchestrator.py — surface evidence into coordinator logs and call Judge

Concrete implementation steps:
  1. Extend message template with evidence: {claims:[...], tool_outputs:[...], confidence:float}
  2. Implement per‑agent scoreboard (moving average success) in coordination_tracker.py
  3. Update vote.py: compute final_score = α*proficiency + β*evidence_score + γ*vote_strength
  4. Create Judge agent that can (a) fetch supporting sources, (b) re-run tool calls...

Key Difference:

The memory-enabled version provides:

This specificity comes from:

  1. Turn 1 research stored as structured facts (automatic)
  2. Turn 2 has Turn 1 answer PLUS 10 relevant facts (automatic semantic retrieval)
  3. Facts provide additional semantic context beyond raw conversation history
  4. Agents combine: Turn 1 answer + extracted facts + codebase analysis = concrete actionable plan

Memory System Performance

Memory example:

“Coordination mechanisms that improve long-term coherence include hierarchical recursive planning, task decomposition with DAG structures, and planner-executor systems that maintain shared memory and intermediate artifacts.”

Retrieval Performance:

Cost Analysis:

🎯 Conclusion

Why Persistent Memory Improves Self-Evolution

Before (multi-turn with conversation history):

After (persistent memory + conversation history):

The Compound Effect

Within this session, memory enabled:

The architecture supports future cross-session retrieval, though not demonstrated in this case study.

Broader Implications

Persistent memory enables:

  1. Self-Evolution: Agents can learn about themselves through research-to-implementation
  2. Research-to-Implementation: Bridge external research to internal development
  3. Semantic Augmentation: Additional structured facts supplement conversation history
  4. Knowledge Storage: Facts persist in vector database for future retrieval
  5. Improved Specificity: Extracted facts lead to more concrete, actionable recommendations

Future Improvements

Memory Quality (Current: 72% good, 28% system internals):

The custom fact extraction prompts significantly improve memory quality, but ~28% of stored facts are still system internals (voting details, agent comparisons, meta-instructions). Planned improvements:

Cross-Session Loading:

Retrieval Intelligence:


📌 Status Tracker