The Ultimate Fine-Tuning Dataset Builder
Create, manage, and export high-quality datasets for AI model training - no coding required
Quick Start | Features | Installation | User Guide | Use Cases
DataForge Studio is a visual tool that helps you prepare training data for AI language models (like ChatGPT, Llama, or Mistral).
Think of it as a "dataset kitchen" where you can:
- Import your data from various sources (files, websites, HuggingFace)
- Clean & Improve your data with AI-powered assistance
- Validate your data meets quality standards
- Export ready-to-use training packages for popular frameworks
No programming skills required — everything happens through an intuitive web interface.
| You Are... | DataForge Helps You... |
|---|---|
| AI Researcher | Quickly prepare and validate datasets for fine-tuning experiments |
| ML Engineer | Convert between formats and generate training configs automatically |
| Data Scientist | Analyze dataset quality, find issues, and fix them in bulk |
| Content Creator | Turn your documents (PDFs, Word files) into training data |
| Hobbyist/Learner | Get started with fine-tuning without complex data pipelines |
| Source | What You Can Import |
|---|---|
| Files | JSONL, JSON, CSV, Parquet, Excel, PDF, Word documents |
| Clipboard | Paste JSON/JSONL directly from your clipboard |
| Web | Import from URLs or HuggingFace Hub datasets |
| Documents | Convert PDFs, Word docs, and PowerPoints into Q&A pairs |
DataForge automatically recognizes your data format:
- Alpaca —
instruction,input,outputfields - ShareGPT —
conversationsarray format - ChatML/OpenAI —
messagesarray with roles
No manual configuration needed — just upload and go!
Improve your dataset quality with one click:
- Improve Quality — Enhance response clarity and accuracy
- Add Reasoning — Include step-by-step thinking in responses
- Expand Responses — Make answers more detailed
- Add Code Examples — Insert relevant code snippets
- Simplify — Make complex responses easier to understand
Catch problems before training:
- Empty or missing messages
- Unbalanced response lengths
- Encoding issues
- Duplicate entries
- Sensitive data (PII) detection
Auto-fix common issues with a single click!
Generate ready-to-train packages for:
| Framework | What You Get |
|---|---|
| Axolotl | YAML config + formatted dataset |
| Unsloth | Python training script + dataset |
| LLaMA-Factory | dataset_info.json + dataset |
| Torchtune | YAML config + dataset |
Each export includes a README with training instructions!
Pre-configured for the latest models:
- Meta: Llama 4, Llama 3.3, Llama 3.2, Llama 3.1
- Alibaba: Qwen 3, Qwen 2.5, Qwen 2.5 Coder
- Google: Gemma 3, Gemma 2
- Mistral: Mistral Large, Small, Nemo
- Microsoft: Phi-4, Phi-3.5
- DeepSeek: DeepSeek V3, DeepSeek R1
- Others: Command R, OLMo 2, SmolLM2
Get DataForge running in under 5 minutes:
# Clone the repository
git clone https://github.com/natekali/dataforge.git
cd dataforge
# Start with Docker
docker compose upOpen http://localhost:3000 in your browser. Done!
# Clone the repository
git clone https://github.com/natekali/dataforge.git
cd dataforge
# Install dependencies
pnpm install
# Start the application
pnpm devOpen http://localhost:3000 in your browser.
| Software | Version | Download Link |
|---|---|---|
| Node.js | 22 or newer | nodejs.org |
| Python | 3.12 or newer | python.org |
| pnpm | 9 or newer | Run: npm install -g pnpm |
| uv | Latest | Run: pip install uv |
git clone https://github.com/natekali/dataforge.git
cd dataforgepnpm installcd apps/api
uv sync
cd ../..Run everything together:
pnpm devOr run separately (recommended for development):
Terminal 1 - Start the web interface:
pnpm dev:webTerminal 2 - Start the API server:
cd apps/api
uv run uvicorn dataforge_api.main:app --reload --port 8000| Service | URL | Description |
|---|---|---|
| Web App | http://localhost:3000 | Main interface |
| API Docs | http://localhost:8000/docs | Interactive API documentation |
For a containerized setup with no local dependencies:
# Development mode (with live code reloading)
docker compose --profile dev up
# Production mode
docker compose up -d- Open DataForge at http://localhost:3000
- Click "New Project"
- Give your project a name (e.g., "Customer Support Bot")
- Optionally select your target model (e.g., "Llama 3.3")
- Click "Create"
You have several options:
- Click the "Upload" area or drag-and-drop your file
- Supported formats:
.jsonl,.json,.csv,.parquet,.xlsx,.pdf,.docx - DataForge will auto-detect the format
- Review the preview and click "Import"
- Click the "Paste" tab
- Paste your JSON or JSONL data
- Click "Import"
- Click "HuggingFace" tab
- Search for a dataset (e.g., "tatsu-lab/alpaca")
- Select the split (train/test/validation)
- Click "Import"
- Click "Documents" tab
- Upload a PDF, Word doc, or PowerPoint
- DataForge will generate Q&A pairs from the content
- Review and import the generated examples
After importing, you'll see your dataset in a table view:
- Browse: Scroll through all your examples
- Search: Find specific content using the search bar
- Edit: Click any example to open the conversation editor
- Delete: Remove unwanted examples individually or in bulk
- Click on any example row
- The Conversation Editor opens
- You can:
- Edit the text of any message
- Add new messages (system, user, or assistant)
- Reorder messages
- Delete messages
- Click "Save" when done
- Go to the "Quality" section
- Click "Analyze Quality"
- Review the quality score and any issues found:
| Issue Type | What It Means |
|---|---|
| Empty Messages | Some messages have no content |
| Missing Roles | Conversations missing user or assistant messages |
| Length Imbalance | Response is too short/long compared to the question |
| Duplicates | Same content appears multiple times |
| PII Detected | Personal information found (emails, phone numbers) |
- Click "Auto-Fix" to automatically resolve common issues
Make your dataset better with AI assistance:
- Go to the "Enhance" section
- Choose an enhancement type:
- Improve Quality: Better grammar, clarity, accuracy
- Add Reasoning: Include step-by-step thinking
- Expand: Make responses more detailed
- Add Code: Include code examples
- Simplify: Make responses easier to understand
- Select which examples to enhance
- Click "Enhance" and wait for processing
- Go to the "Export" section
- Choose your target framework:
- Axolotl — Popular, flexible training framework
- Unsloth — Fast 4-bit training
- LLaMA-Factory — Easy-to-use trainer
- Torchtune — PyTorch native training
- Configure options:
- Include/exclude system prompts
- Create train/test split
- Click "Download"
- You'll get a ZIP file containing:
- Your formatted dataset
- Training configuration file
- README with instructions
Goal: Create a bot that answers questions about your product
- Import your existing FAQ document (PDF or Word)
- DataForge generates Q&A pairs automatically
- Review and edit the generated conversations
- Enhance responses with AI to add more detail
- Export for Axolotl and train your model
Goal: Train a model to help with programming tasks
- Import coding examples from HuggingFace
- Convert to ChatML format for your target model
- Validate against your model's context length
- Add reasoning to show step-by-step problem solving
- Export with code-optimized settings
Goal: Improve a messy dataset you found online
- Import the dataset from HuggingFace or upload a file
- Run quality checks to find issues
- Auto-fix common problems (duplicates, encoding)
- Remove low-quality examples
- Export the cleaned dataset
Goal: Use an Alpaca dataset with a ShareGPT-expecting tool
- Import your Alpaca-format dataset
- Go to Format Conversion
- Select ShareGPT as target format
- Preview the conversion
- Apply and export in new format
Goal: Create training data with chain-of-thought reasoning
- Import your instruction-response pairs
- Use "Add Reasoning" enhancement
- AI adds
<think>tags with step-by-step reasoning - Validate the enhanced responses
- Export for DeepSeek R1 or similar reasoning models
| Feature | Description |
|---|---|
| Multi-Format Support | JSONL, JSON, CSV, Parquet, Excel, PDF, Word, PowerPoint, HTML, Markdown |
| Auto-Detection | Automatically identifies Alpaca, ShareGPT, or ChatML format |
| Field Mapping | Smart suggestions for mapping your columns to standard fields |
| Large File Handling | Efficiently processes files with millions of examples |
| Validation | Checks data structure before import |
| Feature | Description |
|---|---|
| Table View | Browse all examples in a sortable, searchable table |
| Conversation Editor | Visual editor for multi-turn conversations |
| Bulk Operations | Select and modify multiple examples at once |
| Undo Support | Revert changes if something goes wrong |
| Live Preview | See formatted output as you edit |
| Feature | Description |
|---|---|
| Quality Scoring | 0-100 score for each example and overall dataset |
| Issue Detection | Finds 15+ types of common problems |
| Auto-Fix | One-click fixes for encoding, whitespace, duplicates |
| PII Masking | Automatically redact sensitive information |
| Model Validation | Check compatibility with your target model |
| Feature | Description |
|---|---|
| Quality Improvement | Better grammar, clarity, factual accuracy |
| Reasoning Addition | Chain-of-thought with <think> tags |
| Response Expansion | More detailed, comprehensive answers |
| Code Examples | Relevant code snippets added where appropriate |
| Simplification | Complex responses made accessible |
| Synthetic Generation | Create new examples from existing ones |
| Feature | Description |
|---|---|
| Framework Configs | Auto-generated YAML/JSON for training tools |
| Format Conversion | Convert between Alpaca, ShareGPT, ChatML |
| Train/Test Split | Automatically split data for evaluation |
| HuggingFace Push | Upload directly to HuggingFace Hub |
| Batch Export | Export multiple projects at once |
| Feature | Description |
|---|---|
| Dataset Statistics | Example count, token counts, message distribution |
| Quality Metrics | Score distribution, issue breakdown |
| Token Analysis | Estimate training costs and time |
| System Logs | Real-time activity monitoring for debugging |
| Job Tracking | Monitor background processing tasks |
| Provider | Models | Context Length | Best For |
|---|---|---|---|
| Meta | Llama 4 Scout/Maverick | 10M tokens | General purpose, production |
| Meta | Llama 3.3 70B | 128K tokens | Fine-tuning, production |
| Meta | Llama 3.2 1B-90B | 128K tokens | Edge deployment, vision |
| Alibaba | Qwen 3 (0.6B-235B) | 32K tokens | Multilingual, fine-tuning |
| Alibaba | Qwen 2.5 Coder | 32K tokens | Code generation |
| Gemma 3 (1B-27B) | 128K tokens | Vision, fine-tuning | |
| Mistral | Mistral Large/Small/Nemo | 32-128K tokens | Fine-tuning, production |
| Microsoft | Phi-4 14B | 16K tokens | Edge deployment |
| DeepSeek | DeepSeek R1 | 64K tokens | Reasoning, fine-tuning |
| Others | OLMo 2, SmolLM2 | 4-8K tokens | Research, edge deployment |
Create a .env file in the project root:
# API Server
DEBUG=true
HOST=0.0.0.0
PORT=8000
# Database (default: SQLite, no setup needed)
DATABASE_URL=sqlite+aiosqlite:///./dataforge.db
# AI Providers (optional - for enhancement features)
OLLAMA_BASE_URL=http://localhost:11434
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
# Frontend
NEXT_PUBLIC_API_URL=http://localhost:8000To use AI enhancement features, configure at least one provider:
- Install Ollama from ollama.ai
- Run:
ollama pull llama3.2 - DataForge will auto-detect Ollama
- Get API key from platform.openai.com
- Add to
.env:OPENAI_API_KEY=sk-...
- Get API key from console.anthropic.com
- Add to
.env:ANTHROPIC_API_KEY=sk-ant-...
# Kill process on port 3000
npx kill-port 3000
# Kill process on port 8000
npx kill-port 8000cd apps/api
uv sync --refreshrm -rf node_modules
pnpm install# Delete database and restart (data will be lost)
rm dataforge.db dataforge_analytics.duckdb
# Restart the application- Check your AI provider is configured in
.env - Verify the provider is running (e.g.,
ollama list) - Check the System Logs tab for error messages
- Check the System Logs tab in the app for detailed error messages
- Open an issue on GitHub
- Review the API Documentation for technical details
dataforge/
├── apps/
│ ├── web/ # Web interface (Next.js)
│ │ ├── src/
│ │ │ ├── app/ # Pages and routes
│ │ │ ├── components/ # UI components
│ │ │ └── lib/ # Utilities and hooks
│ │ └── package.json
│ │
│ └── api/ # Backend API (FastAPI)
│ ├── src/dataforge_api/
│ │ ├── main.py # Application entry
│ │ ├── routers/ # API endpoints
│ │ └── database.py # Data storage
│ └── tests/ # API tests
│
├── packages/
│ └── core/ # Data processing library
│ ├── src/dataforge_core/
│ │ ├── detection.py # Format detection
│ │ ├── importers.py # File importers
│ │ ├── formatters.py # Data formatting
│ │ ├── exporters.py # Export generation
│ │ ├── quality.py # Quality validation
│ │ └── models.py # Model registry
│ └── tests/
│
├── docker/ # Docker configuration
├── docker-compose.yml # Container orchestration
└── README.md # This file
Full API documentation is available at http://localhost:8000/docs when running the application.
Create a Project:
curl -X POST http://localhost:8000/api/v1/projects \
-H "Content-Type: application/json" \
-d '{"name": "My Dataset", "description": "Training data for my bot"}'Add Examples:
curl -X POST http://localhost:8000/api/v1/datasets/{project_id}/examples \
-H "Content-Type: application/json" \
-d '{
"examples": [{
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi! How can I help you?"}
]
}]
}'Export Dataset:
curl -X POST http://localhost:8000/api/v1/export/{project_id} \
-H "Content-Type: application/json" \
-d '{"format": "axolotl"}' \
--output dataset.zipWe welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes
- Test your changes:
pnpm test - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
- Run
pnpm lintbefore committing - Add tests for new features
- Update documentation for user-facing changes
MIT License — see LICENSE for details.
Built with:
- FastAPI — Python web framework
- Next.js — React framework
- shadcn/ui — UI components
- Polars — Fast data processing
- DuckDB — Analytical database
- LiteLLM — Multi-provider AI integration
Made with care for the AI fine-tuning community