MassGen is focused on case-driven development. This case study demonstrates the introduction of persistent memory with semantic retrieval, enabling agents to build cumulative knowledge across multi-turn sessions and achieve true self-evolution through research-to-implementation workflows.
Two-turn research-to-implementation workflow:
Turn 1 (Research):
Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.
Turn 2 (Implementation):
Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.
This prompt tests whether agents can:
Multi-turn conversations were already supported in MassGen, but without persistent memory. Turn 2 only had access to:
Limitation: No semantic search, no fact extraction, no persistent knowledge base across sessions.
Before Persistent Memory (multi-turn only):
âś… What worked:
❌ What was missing:
Baseline Turn 2 Result (session log_20251029_064846, 132 lines):
Persistent memory should enable:
To enable self-evolution through research-to-implementation:
v0.1.5 - Introduction of persistent memory system
massgen/memory/_persistent.py)
massgen/memory/_fact_extraction_prompts.py)
memory.persistent_memory section for mem0 configurationChatAgent)
massgen/configs/memory/gpt5mini_gemini_research_to_implementation.yaml
Key Memory Settings:
memory:
enabled: true
persistent_memory:
enabled: true # 🆕 NEW: Persistent memory
session_name: "research_to_implementation" # Cross-turn continuity
vector_store: "qdrant"
llm:
provider: "openai"
model: "gpt-4.1-nano-2025-04-14" # Fact extraction
embedding:
provider: "openai"
model: "text-embedding-3-small" # Vector embeddings
qdrant:
mode: "server"
host: "localhost"
port: 6333
retrieval:
limit: 10 # Number of facts to retrieve
Prerequisites:
# Start Qdrant server
docker run -d -p 6333:6333 -p 6334:6334 \
-v $(pwd)/.massgen/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
# Start crawl4ai (for web scraping)
docker run -d -p 11235:11235 --name crawl4ai \
--shm-size=1g unclecode/crawl4ai:latest
Run Session:
uv run massgen --config @examples/memory/gpt5mini_gemini_research_to_implementation.yaml
Turn 1 Prompt:
Use crawl4ai to research the latest multi-agent AI papers and techniques from 2025.
Focus on: coordination mechanisms, voting strategies, tool-use patterns, and architectural innovations.
Turn 2 Prompt (in same session):
Based on the multi-agent research from earlier, which techniques should we implement in MassGen
to make it more state-of-the-art? Consider MassGen's current architecture and what would be most impactful.
Session: session_20251029_072105
Duration: 11 minutes across 2 turns
Memory Stats:
Watch the recorded demo:
Persistent memory dramatically improved Turn 2’s ability to provide specific, actionable recommendations by retrieving relevant research findings from Turn 1.
Turn 1 - Research Phase (5 minutes):
Example facts stored:
“Multi-layer memory folding that includes short-term windows, episodic timelines, and semantic summaries allows agents to manage large contexts efficiently, reducing token usage while maintaining factual recall, which is crucial for long-horizon tasks and fine-tuning.”
“In 2025, multi-agent and agentic-AI systems evolved from ad-hoc multi-LLM setups to using structured workflows including hierarchical planning, task graphs, and planner-executor separations, which improve coherence, scalability, and fault tolerance.”
Turn 2 - Implementation Phase (6 minutes):
massgen/ and docs/ directories)Example automatic memory retrieval:
When Turn 2 starts, system automatically searches memories and injects relevant facts:
Retrieved fact:
“Using argumentation frameworks with evidence scoring, proficiency or reputation-weighted voting, multi-stage consensus, and human-in-the-loop arbitration are advanced voting strategies in 2025…”
This fact (from Turn 1 research) was automatically added to Turn 2’s context, enabling:
“1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)
- Replace naive majority with weighted aggregation using per‑agent proficiency scores plus evidence strength…”
No changes to voting in this release - standard MassGen voting applied. The improvement came from what agents could reference during answer generation, not how they voted.
Turn 2 Quality Comparison (Both sessions have Turn 1 answer in context):
Without Persistent Memory (log_20251029_064846, 132 lines):
Sample from baseline:
1) Pluggable Voting & Aggregation + Adaptive Early Stopping
- Where to change: add massgen/voting.py (or massgen/voting/ package)
- Suggested API / design sketch:
- Aggregator (base)
- add_vote(agent_id, result, confidence, trajectory, metadata)
With Persistent Memory (log_20251029_072105, 110 lines):
workflow_toolkits/vote.py, coordination_tracker.py, orchestrator.pySample from memory-enabled:
Top recommendations (what + where to change in repo)
1) Evidence‑aware, proficiency‑weighted voting + Judge (High impact, low→medium effort)
Where to implement (explicit paths):
- workflow_toolkits/vote.py — extend to accept evidence payloads and compute weighted scores
- message_templates.py — add evidence schema to agent message format
- coordination_tracker.py — track per‑agent proficiency/calibration
- orchestrator.py — surface evidence into coordinator logs and call Judge
Concrete implementation steps:
1. Extend message template with evidence: {claims:[...], tool_outputs:[...], confidence:float}
2. Implement per‑agent scoreboard (moving average success) in coordination_tracker.py
3. Update vote.py: compute final_score = α*proficiency + β*evidence_score + γ*vote_strength
4. Create Judge agent that can (a) fetch supporting sources, (b) re-run tool calls...
Key Difference:
The memory-enabled version provides:
workflow_toolkits/vote.py vs. “add massgen/voting.py”)This specificity comes from:
Memory example:
“Coordination mechanisms that improve long-term coherence include hierarchical recursive planning, task decomposition with DAG structures, and planner-executor systems that maintain shared memory and intermediate artifacts.”
Retrieval Performance:
Cost Analysis:
Before (multi-turn with conversation history):
After (persistent memory + conversation history):
Within this session, memory enabled:
The architecture supports future cross-session retrieval, though not demonstrated in this case study.
Persistent memory enables:
Memory Quality (Current: 72% good, 28% system internals):
The custom fact extraction prompts significantly improve memory quality, but ~28% of stored facts are still system internals (voting details, agent comparisons, meta-instructions). Planned improvements:
Cross-Session Loading:
Retrieval Intelligence: