CaseStudies

MassGen v0.1.3: Downloading and Analyzing Existing MassGen Case Study Videos

MassGen is focused on case-driven development. This case study demonstrates MassGen v0.1.3โ€™s multimodal understanding capabilities by having agents analyze their own case study videos to identify improvements and automation opportunitiesโ€”a meta-level demonstration of self-evolution.

๐Ÿค Contributing

To guide future versions of MassGen, we encourage anyone to submit an issue using the corresponding case-study issue template based on the โ€œPLANNING PHASEโ€ section found in this template.


Table of Contents


๐Ÿ“‹ PLANNING PHASE

๐Ÿ“ Evaluation Design

Prompt

The prompt tests whether MassGen agents can analyze their own documentation and videos to propose concrete improvements:

Download recent MassGen case study videos listed in the case study md files, analyze them, find out how to improve them and automate their creation.

This prompt requires agents to:

  1. Read local case study documentation (docs/case_studies)
  2. Extract YouTube video URLs from markdown files
  3. Download multiple videos using command-line execution (yt-dlp)
  4. Analyze video metadata and content
  5. Identify patterns, strengths, and weaknesses
  6. Propose concrete improvements to case study quality
  7. Suggest automation strategies for future case study creation

Baseline Config

Prior to v0.1.3, MassGen agents had no direct way to understand visual content. They could:

But they could not:

Baseline Command

# Pre-v0.1.3: No multimodal understanding capability
# Would need to manually:
# 1. Watch all case study videos
# 2. Write detailed text descriptions
# 3. Identify patterns manually
# 4. Suggest improvements based on human analysis
# Then provide those descriptions to agents

uv run massgen \
  --config massgen/configs/basic/multi/two_agents_gpt5.yaml \
  "Based on these summaries of recent MassGen case studies: [manual text summaries], suggest improvements and automation strategies"

๐Ÿ”ง Evaluation Analysis

Results & Failure Modes

Without multimodal understanding tools and autonomous video downloading, users would face:

No Direct Video Understanding:

Manual Analysis Bottleneck:

Limited Self-Evolution:

Success Criteria

The multimodal understanding tools would be considered successful if agents can:

  1. Autonomous Discovery: Find and extract video URLs from local documentation without human guidance
  2. Video Download: Use command-line tools (yt-dlp) to download videos autonomously
  3. Metadata Analysis: Extract and analyze video metadata (title, duration, formats)
  4. Concrete Improvements: Propose specific, actionable improvements to case study quality
  5. Automation Strategy: Suggest detailed strategies for automating case study creation
  6. Artifact Creation: Generate reusable scripts and documentation

๐ŸŽฏ Desired Features

To achieve the success criteria above, v0.1.3 needs to implement:

  1. understand_video Tool: Extract frames from video files and analyze using vision-capable models
  2. understand_image Tool: Analyze static images and screenshots
  3. understand_audio Tool: Process audio content (for video narration, podcasts, etc.)
  4. understand_file Tool: Automatically detect file type and route to appropriate analyzer
  5. Command Line Integration: Enable agents to download videos using tools like yt-dlp
  6. Docker Execution Mode: Provide isolated environment with necessary dependencies (ffmpeg, yt-dlp)
  7. Context Path Support: Allow agents to read local documentation directories
  8. Workspace-Aware Analysis: Tools should work with files in agent workspaces

๐Ÿš€ TESTING PHASE

๐Ÿ“ฆ Implementation Details

Version

MassGen v0.1.3 (October 24, 2025)

โœจ New Features

MassGen v0.1.3 introduces Custom Multimodal Understanding Tools:

Additional v0.1.3 Features:

New Configuration

Configuration file: massgen/configs/tools/custom_tools/multimodal_tools/youtube_video_analysis.yaml

Key features demonstrated:

agents:
  - id: "agent_a"
    backend:
      type: "openai"
      model: "gpt-5-mini"
      reasoning:
        effort: "medium"
        summary: "auto"
      custom_tools:
        - name: ["understand_video"]
          category: "multimodal"
          path: "massgen/tool/_multimodal_tools/understand_video.py"
          function: ["understand_video"]
      enable_mcp_command_line: true
      command_line_execution_mode: docker
      command_line_docker_enable_sudo: true
      command_line_docker_network_mode: "bridge"
      cwd: "workspace1"

  - id: "agent_b"
    backend:
      type: "claude_code"
      model: "claude-sonnet-4-5-20250929"
      custom_tools:
        - name: ["understand_video"]
          category: "multimodal"
          path: "massgen/tool/_multimodal_tools/understand_video.py"
          function: ["understand_video"]
      enable_mcp_command_line: true
      command_line_execution_mode: docker
      command_line_docker_enable_sudo: true
      command_line_docker_network_mode: "bridge"
      cwd: "workspace2"

orchestrator:
  context_paths:
    - path: "docs/case_studies"
      permission: "read"

Why Docker execution mode?

Why custom_tools?

Why read access to docs/case_studies?

Command

Running the YouTube Video Analysis:

uv run massgen \
  --config massgen/configs/tools/custom_tools/multimodal_tools/youtube_video_analysis.yaml \
  "Download recent MassGen case study videos listed in the case study md files, analyze them, find out how to improve them and automate their creation."

What Happens:

  1. Discovery: Agents read local case study files from docs/case_studies directory
  2. Extraction: Agents extract YouTube video URLs from markdown files (found 17 videos)
  3. Download: Agents use yt-dlp command to download videos and metadata
  4. Analysis: Agents analyze metadata (title, duration, formats, thumbnails)
  5. Pattern Recognition: Agents identify common patterns across case studies
  6. Script Creation: Agents create reusable Python scripts for automation
  7. Requirements: Agents generate requirements.txt for reproducibility
  8. Collaboration: Agents vote on best comprehensive analysis
  9. Output: Winning answer with improvement recommendations and automation plan

๐Ÿค– Agents

Both agents have identical capabilities, ensuring diverse perspectives on video analysis while maintaining consistent tooling. They can read local case study documentation to discover videos, download them autonomously, and collaborate through MassGenโ€™s voting mechanism.

๐ŸŽฅ Demo

Watch the v0.1.3 Multimodal Video Analysis demonstration:

MassGen v0.1.3 Multimodal Video Analysis Demo

Session Logs: .massgen/massgen_logs/log_20251024_075151

Duration: ~24 minutes Coordination Events: 23 events Restarts: 5 total (Agent A: 3, Agent B: 2) Answers: 2 total (1 per agent) Votes: 2 total (unanimous for Agent A)


๐Ÿ“Š EVALUATION & ANALYSIS

Results

The Collaborative Process

Both agents approached the meta-analysis task with complementary strategies:

Agent A (gpt-5-mini) - Action-Oriented Approach:

Agent B (claude_code) - Analysis-Oriented Approach:

Key Discoveries:

Technical Challenges Encountered:

The Voting Pattern

The voting revealed clear recognition of comprehensive, actionable deliverables:

Round 1 - Initial Vote:

Final Outcome:

Voting Statistics:

The Final Answer

Agent Aโ€™s winning response included:

1. Comprehensive Artifact Delivery:

2. Video Discovery Results:

3. Technical Root-Cause Analysis:

4. Practical Improvement Recommendations:

Creative & Metadata Improvements:

Discoverability Enhancements:

5. Automation Pipeline Proposal:

Two parallel streams:

Pipeline Components:

Suggested Repository Layout:

tools/video_pipeline/
  - generate_from_md.py
  - download_and_analyze.py
  - transcribe.py
  - upload_youtube.py
  - templates/intro.mp4, outro.mp4, music_bg.mp3
.github/workflows/build_videos.yml

6. Reproducible Commands:

# Install dependencies
sudo apt-get install -y ffmpeg
pip install -U yt-dlp

# Download videos
python3 download_videos_and_analyze.py

# Transcribe video
ffmpeg -i video.mp4 -ar 16000 -ac 1 audio.wav
whisper --model small --language en audio.wav --output_format srt

# Generate slides from markdown
pandoc case-study.md -t revealjs -s -o slides.html

# Assemble video
ffmpeg -loop 1 -i slide1.png -i narration.mp3 -c:v libx264 -c:a aac -shortest out.mp4

7. Success Metrics:

8. Prioritized Next Steps:

  1. Upgrade yt-dlp and retry downloads with cookies (high impact)
  2. Transcribe successfully downloaded videos with Whisper (high impact)
  3. Prototype one automated video from markdown (medium effort, high ROI)
  4. Create GitHub Actions workflow for CI/CD

๐ŸŽฏ Conclusion

This case study demonstrates MassGen v0.1.3โ€™s new capabilities for downloading and analyzing multimedia content. Agents successfully:

โœ… Discovered and extracted 17 YouTube video URLs from local case study documentation โœ… Downloaded video metadata autonomously using command-line tools (yt-dlp) โœ… Analyzed video content including titles, durations, formats, and thumbnails โœ… Created reusable scripts (Python download scripts, manifests, requirements.txt) โœ… Generated actionable recommendations for improving case study videos โœ… Proposed automation pipeline for future video creation and processing

Key Achievements:

  1. End-to-End Automation: Agents completed the entire workflow from discovery to actionable recommendations without human intervention

  2. Practical Deliverables: Generated immediately usable scripts and documentation that can automate future case study video creation

  3. Tool Integration: Successfully combined multiple capabilities:
    • Reading local documentation (context paths)
    • Command-line execution (yt-dlp)
    • MCP tools (filesystem, workspace management)
    • Custom multimodal tools (understand_video)
    • Docker isolation with network access
  4. Problem-Solving: When downloads failed, agents diagnosed root causes and proposed multiple solutions rather than giving up

Impact on MassGen Development:

This case study validates the v0.1.3 multimodal features and demonstrates how agents can:

The automation pipeline proposed by agents could reduce case study video creation time from hours to minutes, while maintaining consistency and quality. This demonstrates practical applications of multimodal understanding for content management and documentation workflows.

Future Directions:

Based on this session, potential future enhancements include:

This case study exemplifies how agents can autonomously download, analyze, and generate insights from real-world multimedia content, demonstrating practical applications of multimodal understanding for content analysis and workflow automation.


๐Ÿ“Œ Status Tracker

Related Issues: TBD Related PRs: TBD Version: v0.1.3 Date: October 24, 2025