DEV Community

Cover image for How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)
Bernabé Puente Moure
Bernabé Puente Moure

Posted on • Edited on

How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

Memory leak fixes and data quality drift

Background

I needed high-quality instruction datasets for fine-tuning local LLMs, but commercial options were prohibitively expensive ($500-2,000 for decent quality datasets).

So I asked myself: What if I could build a system that generates these datasets autonomously while I sleep?

Result: 1,065 professional instruction/response pairs in 72 hours, 100% local, zero API costs.

Here's exactly how I did it.


The Problem with Existing Datasets

When you want to fine-tune a local LLM for specific tasks, you face three options:

  1. Use generic datasets → Don't match your domain
  2. Manual creation → Exhausting, slow, doesn't scale
  3. Buy commercial datasets → Expensive ($500-2,000), limited customization

I wanted code instruction pairs with:

  • Python/AI/ML/DevOps/Database topics
  • Chain-of-Thought reasoning (not just answers)
  • Real-world complexity (no template responses)
  • Continuous generation (24/7 if possible)

None of the existing options delivered all four.


The Solution: Multi-Agent Autonomous System

I built a three-agent system inspired by academic research workflows:

┌──────────────────┐
│   Curator        │ ← Selects topics from knowledge base
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Producer       │ ← Generates instruction + response + CoT
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Critic         │ ← Validates logic, checks hallucinations
│   (DeepSeek 1.5B)│
└────────┬─────────┘
         │
         ▼
    [Accept/Reject]
Enter fullscreen mode Exit fullscreen mode

Agent Roles

Curator Agent:

  • Reads ChromaDB vector store of potential topics
  • Selects next topic based on diversity + priority
  • Prevents duplicates via similarity search
  • Think of it as a research librarian

Producer Agent:

  • Generates the actual instruction/response pair
  • Includes Chain-of-Thought reasoning
  • Uses domain knowledge to create realistic scenarios
  • Think of it as the expert writer

Critic Agent:

  • Reviews Producer's output for:
    • Hallucinations
    • Logical errors
    • Incomplete reasoning
    • Generic template responses
  • Binary decision: Accept or Reject
  • Think of it as the peer reviewer

Tech Stack

Core Components

# 1. Ollama (Local LLM Engine)
# Custom Modelfiles with 8k context:
qwen2.5-coder-8k:7b    # 4.7GB, main generator
deepseek-r1-8k:1.5b     # 1.1GB, validator

# 2. CrewAI (Agent Orchestration)
from crewai import Agent, Crew, Task

# 3. ChromaDB (Deduplication + Memory)
from chromadb import Client

# 4. Flask (Real-time Dashboard)
from flask import Flask, render_template
Enter fullscreen mode Exit fullscreen mode

Hardware

  • AMD Ryzen AI 9 HX 370 (12 cores, 32GB RAM shared as VRAM)
  • Mini PC (Geekom A9 Max, ~$899)
  • SSD NVMe (for fast model loading)

Total investment: $899 + ~$3.60 electricity for 72 hours


Implementation Details

Challenge 1: Memory Leaks in Long Runs

Problem: Agent instances accumulate state. After ~100 cycles, performance degrades → crashes.

Solution: Recreate all agents every cycle.

from crewai import Agent, Crew, Task

for cycle in range(1000):
    # Recreate agents from scratch (prevents memory leaks)
    curator = Agent(
        role="Topic Curator",
        goal="Select next topic to generate",
        llm=ollama_qwen
    )

    producer = Agent(
        role="Content Producer", 
        goal="Generate high-quality instruction pair",
        llm=ollama_qwen
    )

    critic = Agent(
        role="Quality Critic",
        goal="Validate logic and catch hallucinations",
        llm=ollama_deepseek
    )

    # Define sequential workflow
    tasks = [
        Task(description="Select topic", agent=curator),
        Task(description="Generate content", agent=producer),
        Task(description="Validate quality", agent=critic)
    ]

    crew = Crew(agents=[curator, producer, critic], tasks=tasks)
    result = crew.kickoff()

    # Explicit cleanup
    del crew, curator, producer, critic
Enter fullscreen mode Exit fullscreen mode

Result: Zero crashes in 72 hours. Stable RAM at 24.2 GB.


Challenge 2: CrewAI + Ollama Compatibility

Problem: CrewAI v1.12+ requires string-only responses, but Ollama returns objects by default.

Solution: Embed configuration in Modelfiles (permanent fix).

# Modelfile for qwen2.5-coder-8k:7b
FROM qwen2.5-coder:7b

PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM """You are an expert programmer..."""
Enter fullscreen mode Exit fullscreen mode

Run:

ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen
ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek
Enter fullscreen mode Exit fullscreen mode

This makes num_ctx: 8192 permanent at the model level. No runtime config needed.


Challenge 3: Duplicate Questions

Problem: Random topic generation creates duplicates.

Solution: ChromaDB similarity search + rejection.

import chromadb

client = chromadb.Client()
collection = client.create_collection("generated_questions")

def is_duplicate(new_question, threshold=0.85):
    """Check if question already exists via embedding similarity"""
    results = collection.query(
        query_texts=[new_question],
        n_results=1
    )

    if not results['ids']:
        return False

    similarity = results['distances'][0][0]
    return similarity > threshold

# Before saving
if not is_duplicate(instruction):
    save_to_dataset(entry)
    collection.add(
        documents=[instruction],
        ids=[unique_id]
    )
Enter fullscreen mode Exit fullscreen mode

Result: 452 unique topics across 1,065 entries (zero duplicates).

🔥 Update (April 20, 2026)

The system kept running.

After the initial 72-hour run, I left the autonomous system active. Six days later, it had generated 2,772 total entries — a 160% increase from the original dataset.

New Metrics

Metric Original (72h) Now (9 days) Growth
Total Entries 1,065 2,772 +160% 🚀
Python 346 925 +167%
AI/ML 287 725 +153%
Database 99 310 +213%
Backend/API 126 296 +135%
DevOps/Cloud 99 225 +127%
Other 108 291 +169%
System Stability 0 crashes 0 crashes

Key insight: The autonomous architecture proved its long-term viability. Zero crashes, consistent quality (acceptance rate remained ~50%), and continuous 24/7 generation while I worked on other projects.

Updated free datasets: All 5 HuggingFace LITE repos now contain the expanded data (462, 362, 155, 148, 112 entries respectively). Check the links below for the latest versions.


Results After 72 Hours

Metrics

Metric Value
Total Entries 1,065
Success Rate 50.9% (605 accepted / 460 rejected)
Unique Topics 452
Velocity 14.4 entries/hour
RAM Usage 24.2 GB (stable)
CPU Usage 31.1% average
Crashes 0

Distribution by Category

  • Python: 346 entries (33%)
  • AI/ML: 287 entries (27%)
  • Backend/API: 126 entries (12%)
  • Database/SQL: 99 entries (9%)
  • DevOps/Cloud: 99 entries (9%)
  • Other: 108 entries (10%)

Quality Samples

Example 1: Python Development

{
  "instruction": "How to create a Dockerfile for a Python application?",
  "input": "",
  "output": "# Step-by-step guide to create a production-ready Dockerfile:\n\n1. **Choose Base Image**\n   - Use official Python image: `python:3.11-slim`\n   - For smaller size: `python:3.11-alpine`\n\n2. **Set Working Directory**\n   ```

dockerfile\n   WORKDIR /app\n

   ```\n\n3. **Copy Requirements First** (leverage Docker cache)\n   ```

dockerfile\n   COPY requirements.txt .\n   RUN pip install --no-cache-dir -r requirements.txt\n

   ```\n\n4. **Copy Application Code**\n   ```

dockerfile\n   COPY . .\n

   ```\n\n5. **Set Environment Variables**\n   ```

dockerfile\n   ENV PYTHONUNBUFFERED=1\n   ENV PORT=8000\n

   ```\n\n6. **Expose Port**\n   ```

dockerfile\n   EXPOSE $PORT\n

   ```\n\n7. **Define Entrypoint**\n   ```

dockerfile\n   CMD [\"python\", \"app.py\"]\n

   ```\n\n**Complete Example:**\n```

dockerfile\nFROM python:3.11-slim\nWORKDIR /app\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\nCOPY . .\nENV PYTHONUNBUFFERED=1\nEXPOSE 8000\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n

```\n\n**Best Practices:**\n- Use multi-stage builds for smaller images\n- Don't run as root (create user)\n- Use .dockerignore to exclude unnecessary files\n- Pin dependency versions in requirements.txt",

  "thought": "User needs containerization setup. Key considerations: 1) Layer caching optimization (copy requirements first), 2) Security (non-root user), 3) Image size (slim vs alpine tradeoff), 4) Production readiness (health checks, signals). Template should be practical and immediately usable."
}
Enter fullscreen mode Exit fullscreen mode

Character counts:

  • Output: 928 chars
  • Thought: 266 chars

Example 2: AI/ML Engineering

{
  "instruction": "Explain the difference between EC2 instances and Lambda functions for ML model deployment",
  "input": "",
  "output": "**EC2 Instances vs Lambda for ML Deployment:**\n\n**EC2 Instances (Virtual Machines)**\n\n*Pros:*\n- Full control over environment (custom libraries, CUDA, etc.)\n- No execution time limits\n- Better for large models (multi-GB)\n- GPU support (P3, G4 instances)\n- Persistent connections (WebSockets, streaming)\n- Predictable latency\n\n*Cons:*\n- Always running = always paying ($50-500+/month)\n- Manual scaling configuration\n- Need to manage OS, security patches\n- Cold start when scaling from 0\n\n*Best for:*\n- Real-time inference with steady traffic\n- Models >1GB\n- GPU-intensive workloads\n- Long-running predictions (>15 min)\n\n**Lambda Functions (Serverless)**\n\n*Pros:*\n- Pay per invocation ($0.20 per 1M requests)\n- Auto-scaling (0 to thousands)\n- No server management\n- Built-in HA/fault tolerance\n\n*Cons:*\n- 15-minute timeout\n- Limited to 10GB RAM\n- No GPU support\n- Cold starts (1-5 seconds)\n- Package size limit (250MB unzipped)\n\n*Best for:*\n- Batch predictions\n- Small models (<100MB)\n- Sporadic/unpredictable traffic\n- Budget constraints\n\n**Hybrid Approach:**\nUse SageMaker Serverless Inference:\n- Lambda-like pricing\n- Support for larger models\n- Automatic scaling\n- GPU available\n\n**Decision Tree:**\n```

\nModel size >1GB? → EC2\nGPU required? → EC2\nConstant traffic? → EC2\nBudget <$50/mo? → Lambda\nUnpredictable spikes? → Lambda\n

```",

  "thought": "User comparing infrastructure options for ML deployment. Critical factors: cost structure (fixed vs pay-per-use), model size constraints, GPU requirements, latency tolerance, traffic patterns. Should provide concrete decision criteria and mention SageMaker Serverless as middle ground. Real-world cost examples help contextualize."
}
Enter fullscreen mode Exit fullscreen mode

Character counts:

  • Output: 1,876 chars
  • Thought: 261 chars

Cost Comparison

My Approach (Local)

  • Hardware: $899 one-time (reusable forever)
  • Electricity: $3.60 for 72 hours
  • Per entry: $0.85 amortized (or $0.003 electricity only)

Commercial APIs (for 1,065 entries)

  • GPT-4 API: ~$850 (800 tokens avg × $0.001/token × 1,065)
  • Claude API: ~$1,200 (longer responses × higher price)
  • OpenRouter DeepSeek: ~$150 (cheapest API option)

Savings: 99%+ (after hardware pays for itself in 2-3 runs)


Lessons Learned

1. Quality Over Quantity

A 50.9% rejection rate seems harsh, but it's a feature, not a bug.

The Critic agent catches:

  • ❌ Hallucinated facts ("Python 4.0 was released...")
  • ❌ Incomplete reasoning ('Just use library X" without explanation)
  • ❌ Generic templates ("Here are 5 steps...")
  • ❌ Logic errors (contradictory statements)

Only entries that pass rigorous validation make it through.

Result: Every entry is production-quality.


2. Hardware Performs Better Than Expected

I was skeptical about running two LLMs simultaneously on 32GB RAM. Turns out AMD's shared VRAM architecture works great:

  • During generation: 24.2 GB RAM (87% utilization)
  • CPU: 31.1% average (not maxed, room for optimization)
  • Cycle time: ~6 minutes per entry (2 models sequentially)
  • Stability: Zero crashes in 72 hours

For comparison, this would cost $40-100/month in cloud GPU time.


3. Automation Compounds

The system ran while I:

  • Slept (8 hours × 3 nights = ~400 entries)
  • Worked (8 hours × 3 days = ~400 entries)
  • Coded other projects (8 hours × 3 days = ~265 entries)

Total: 1,065 entries with ~10 minutes of manual intervention (checking dashboard once/day).

That's the power of autonomous systems.


Free Datasets Released

I just uploaded 5 LITE versions (50% of data, 100% free) to HuggingFace:

🐍 Python Developer Dataset

173 entries covering:

  • Docker/containerization
  • Testing patterns
  • API development
  • Package management
  • Best practices

🤖 AI/ML Engineering Dataset

143 entries covering:

  • Model deployment (EC2, Lambda, SageMaker)
  • MLOps patterns
  • Training optimization
  • Cloud ML services
  • LLM fine-tuning

☁️ DevOps & Cloud Dataset

49 entries covering:

  • CI/CD pipelines
  • Infrastructure as Code
  • Kubernetes/Docker
  • Monitoring & logging
  • AWS/GCP/Azure

🔌 Backend & API Dataset

63 entries covering:

  • REST/GraphQL design
  • Authentication patterns
  • Database optimization
  • Caching strategies
  • Microservices

💾 Database & SQL Dataset

49 entries covering:

  • Query optimization
  • Schema design
  • Transactions & ACID
  • NoSQL patterns
  • Performance tuning

Format: JSONL with instruction, input, output, thought fields

License: MIT (use however you want)

Ready for: HuggingFace Transformers, Axolotl, LLaMA Factory


What's Next?

V2 Features (in progress)

  1. Multilingual datasets (Spanish, French, German)
  2. Difficulty levels (Beginner/Intermediate/Advanced)
  3. Custom domains (on-demand generation for specific niches)
  4. Video tutorials (system walkthrough + setup guide)

Open-Source Plans

If there's enough interest, I'll release:

  • Complete agent definitions
  • Modelfile configurations
  • ChromaDB setup scripts
  • Dashboard code (Flask + real-time metrics)
  • Consolidation scripts (JSONL → categorized products)

Update: Based on comments, I'll publish setup guide by end of week.


Try It Yourself

Quick Start

  1. Download a dataset:
   wget https://huggingface.co/datasets/bernabepuente/python-instruction-dataset/resolve/main/dataset.jsonl
Enter fullscreen mode Exit fullscreen mode
  1. Load in Python:
   import json

   with open('dataset.jsonl', 'r') as f:
       data = [json.loads(line) for line in f]

   print(f"Loaded {len(data)} entries")
   print(f"Sample: {data[0]}")
Enter fullscreen mode Exit fullscreen mode
  1. Fine-tune with Transformers:
   from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer

   # Load your base model
   model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
   tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

   # Prepare dataset
   def format_prompt(entry):
       return f"Instruction: {entry['instruction']}\n\nResponse: {entry['output']}"

   # Train!
   # (full training code in dataset README)
Enter fullscreen mode Exit fullscreen mode

Questions I'm Happy to Answer

Drop a comment if you want to know about:

  • Optimizing Ollama for long runs
  • CrewAI agent patterns for production
  • ChromaDB deduplication strategies
  • Dataset quality validation
  • Hardware recommendations for local LLM workflows
  • Monetization experiments (Scale AI, Gumroad, HF Premium)

Conclusion

Building autonomous systems that work while you sleep is incredibly satisfying.

This project taught me:

  • Local LLMs can match commercial quality (with good prompts)
  • Multi-agent validation catches errors humans would miss
  • Hardware investment pays for itself quickly
  • Automation compounds exponentially

Total time invested: ~20 hours (setup + monitoring)

Total output: 1,065 professional datasets

Time per entry: 1.1 minutes of human time

Try the datasets and let me know what you think!


Links:

Tech Stack:
Ollama • CrewAI • ChromaDB • Flask • Python 3.11 • AMD Ryzen AI 9

License: MIT (all datasets)

Top comments (10)

Collapse
 
automate-archit profile image
Archit Mittal

1,065 entries over 72 hours is a useful data point for anyone planning CrewAI + local model runs. The bit I'd love to see quantified: what percentage passed your quality filter? In my experience, unsupervised generation with small local models hits a long tail of near-duplicate entries past ~200 unless you add semantic dedup in the loop. A trick that helped me: embed each generated entry as you go and reject anything with cosine similarity >0.9 to existing entries. Kills the duplicate spiral and keeps the distribution wider.

Collapse
 
robopilingui profile image
Bernabé Puente Moure

Thank you so much!! I'm taking note of your tip...

Collapse
 
deadbyapril profile image
Survivor Forge

The memory leak problem you solved with explicit cleanup is the same class of issue we hit running n8n automation workflows over 72+ hour windows — agent frameworks that reuse object instances across many tasks tend to accumulate state references that the GC never collects because the orchestrator holds a live reference to the agent object. Your solution (recreate agent instances per batch) is the right one; the alternative is a subprocess-per-task model which adds overhead but gives you a clean slate from the OS. One non-obvious risk with the Critic-reject loop at scale: if your Critic is too conservative it creates a feedback signal that biases the Producer toward safe, generic outputs over time because those pass more reliably — you may want to track your accept rate per topic category and tune the Critic threshold separately for domains where generality is actually fine vs. domains where precision matters.

Collapse
 
robopilingui profile image
Bernabé Puente Moure

Exactly! That's right, and that's how we've corrected it...

Collapse
 
itskondrat profile image
Mykola Kondratiuk

72 hours continuous run is the real flex here. one thing I kept hitting with similar long-running agent setups - quality drift after the first 24h. curious if you saw variance in the dataset entries as the run extended.

Collapse
 
robopilingui profile image
Bernabé Puente Moure

Hi! I've continued running the system and the data quality has remained consistent over time...

Collapse
 
itskondrat profile image
Mykola Kondratiuk

Good data point — consistent quality past 24h usually means your state management is solid. The drift pattern I've seen kicks in when context accumulates without pruning. What's your checkpoint frequency looking like?

Collapse
 
motedb profile image
mote

The three-agent Curator-Evaluator-Generator pipeline is a solid architectural choice — separating selection from generation gives you a natural quality gate before storage.

For the 72-hour continuous run, how did you handle context window saturation? As the conversation history grows, do you reset the agent states periodically, or does Ollama handle long-context reasoning effectively at that scale?

Also curious about the evaluation criteria — did you find that the Generator improved in quality over time within a single run, or was each cycle essentially independent?

Collapse
 
pendragonstudios profile image
Curtis Reker

72-hour autonomous run is impressive. I've been doing something similar with a persistent agent — the key insight is checkpointing. If your pipeline dies at hour 48, you don't want to restart from zero. Did you build in resume logic, or did it just run clean the whole way through?

Collapse
 
sleepyquant profile image
SleepyQuant

The "autonomous while I sleep" framing stuck with me. My own stack has a paper-trading loop running 24/7 and the cost math is wild once you stop paying per call. Different domain, same thesis.
How did you decide when the 72-hour run was "done"? Entry count, time cap, or some quality signal from the Critic telling you to stop?