- Introduction
- Schema
- The Pipeline
- The Data Collection Pipeline
- The Feature Pipeline: Dataset Generation
- The Training Pipeline: Supervised Fine-Tuning (SFT)
- The Preference Alignment Pipeline (DPO)
- The Evaluation Pipeline
- The Inference Pipeline
- The Monitoring Flow
- Accessing the Services
- Cloudflare Tunnel Deployment
- Tracing with LangSmith
- Business Feedback Events
- Inspecting Metrics and Feedback Data
- GitHub Secrets
- Trigger Deployment
- Environment and Tools
- Data and Model Registry
- Hugging Face Artifacts
- Tasks and Invoke
- Quick Run (End-to-End)
- Special Notes
- Pending work
This project is not meant to be a perfect reference for one single thing, but a practical end-to-end walkthrough of many things at once: turning text into emojis while covering the full ML lifecycle.
We focus on the following capabilities:
- Data lifecycle
- Collect, generate, ingest, clean, and validate data.
- Apply quality controls such as filtering, deduplication, and consistency checks.
- Training and alignment
- Train models with Supervised Fine-Tuning (SFT).
- Explore preference alignment techniques such as Direct Preference Optimization (DPO).
- Evaluation and analysis
- Evaluate models with relevant metrics.
- Analyze outcomes to compare approaches and identify trade-offs.
- Inference and serving
- Deploy and consume the application.
- Optimize inference for CPU-efficient production usage.
- Observability and user feedback
- Track core runtime signals such as inference time, latency, and throughput.
- Collect user feedback events to support continuous improvement.
- MLOps and delivery
- Version, track, and share datasets and models.
- Automate deployment, evaluation, reporting, and training workflows.
Repository layout (high-level):
.
├── pyproject.toml <- Project dependencies and tool config
├── Dockerfile <- API container image build
├── docker-compose.yml <- Local stack (API, monitoring, services)
├── .github <- CI/CD workflows
│
├── app
│ ├── settings.py <- Central runtime configuration
│ │
│ ├── pipelines <- ZenML pipeline entrypoints
│ │ ├── data_collection.py <- ETL pipeline (HF, Unsloth, OpenAI)
│ │ ├── generate_dataset.py <- Dataset creation + publication pipeline
│ │ ├── sft.py <- Supervised Fine-Tuning pipeline
│ │ ├── dpo.py <- Preference alignment (DPO) pipeline
│ │ └── evaluation.py <- Evaluation pipeline
│ │
│ ├── steps <- Reusable pipeline building blocks
│ │ ├── etl <- Collection, normalization, ingestion steps
│ │ ├── dataset <- Dataset split/readme/publish steps
│ │ ├── sft <- SFT load/prepare/train/save/model-card steps
│ │ ├── dpo <- DPO dataset generation + training steps
│ │ └── evaluation <- Evaluators, datasets, and scoring logic
│ │
│ ├── tasks <- `invoke` commands to run workflows
│ │ ├── data_collection.py
│ │ ├── generate_dataset.py
│ │ ├── sft.py
│ │ ├── dpo.py
│ │ ├── evaluation.py
│ │ └── serve.py
│ │
│ ├── configs <- YAML configs injected into pipelines/tasks
│ │ ├── etl_hf.yaml
│ │ ├── etl_unsloth.yaml
│ │ ├── etl_openai.yaml
│ │ ├── generate_dataset.yaml
│ │ ├── sft.yaml
│ │ ├── dpo_generate_batch.yaml
│ │ ├── dpo_collect_batch.yaml
│ │ ├── dpo_generate_dataset.yaml
│ │ ├── dpo_train.yaml
│ │ ├── evaluation.yaml
│ │ └── agent_query.yaml
│ │
│ ├── api <- FastAPI serving layer
│ │ ├── main.py <- API bootstrap
│ │ ├── schemas.py <- Request/response models
│ │ └── routes/emoji.py <- Emoji inference + feedback endpoints
│ │
│ ├── agent <- Inference agent graph and orchestration
│ │ ├── factory.py
│ │ ├── graph.py
│ │ ├── nodes.py
│ │ ├── state.py
│ │ └── callbacks.py
│ │
│ ├── observability <- Metrics and business feedback handling
│ │ ├── metrics.py
│ │ └── feedback_store.py
│ │
│ ├── infrastructure
│ │ └── db/mongo.py <- MongoDB integration
│ │
│ ├── domain <- Core domain models/exceptions
│ │ ├── data_jobs.py
│ │ ├── documents.py
│ │ ├── nosql.py
│ │ ├── dpo.py
│ │ └── exceptions.py
│ │
│ ├── materializers
│ │ └── peft_model.py <- Custom ZenML model materializer
│ │
│ ├── utils <- Shared utilities (training/inference/gguf)
│ │ ├── training.py
│ │ ├── inference.py
│ │ ├── gguf.py
│ │ ├── prompts.py
│ │ ├── emojis.py
│ │ ├── config_loader.py
│ │ └── serialization.py
│ │
│ └── notebooks <- Experiments and local validation notebooks
│
└── misc <- Plots/assets used in docs and analysis
The data collection pipeline integrates examples from multiple sources, including public datasets and synthetically generated data. In practice, this layer serves as the project’s data warehouse.
We implement an Extract, Transform, Load (ETL) pipeline that:
- Extracts (and synthetically generates) data from multiple sources. We parse datasets from Hugging Face, generate synthetic samples with open-source LLMs via Unsloth, and generate additional synthetic data with proprietary models such as OpenAI.
- Transforms the data by cleaning and standardizing it into a consistent schema suitable for storage and downstream analysis.
- Loads the transformed data into a warehouse/database. We use MongoDB as the project’s NoSQL data warehouse.
- Pipeline definition (ZenML):
app/pipelines/data_collection.py- Implements a single ZenML pipeline
data_collection(source: str)with dynamic routing based onsource:huggingface→ download from HF Hubunsloth→ local synthetic generation using an open modelopenai→ synthetic generation via OpenAI API
- Implements a single ZenML pipeline
- How it’s executed
- We run the same ZenML pipeline with different ZenML YAML configs via
invoketasks inapp/tasks/data_collection.py:uv run invoke data-collection.run-hf→ usesapp/configs/etl_hf.yamluv run invoke data-collection.run-unsloth→ usesapp/configs/etl_unsloth.yamluv run invoke data-collection.run-openai→ usesapp/configs/etl_openai.yaml
- Each task calls
data_collection.with_options(run_name=..., config_path=...)()so ZenML injects step parameters from the YAML.
- We run the same ZenML pipeline with different ZenML YAML configs via
- What it implements (and where)
- Job tracking + lineage:
app/steps/etl/common/create_data_job.py- Creates a
DataDownloadJob(HF) orDataGenerationJob(Unsloth/OpenAI). - Uses the ZenML pipeline run id as
job_id, so every producedDocumentcan be linked back to a specific pipeline run.
- Creates a
- Data quality / cleaning:
app/steps/etl/common/clean_normalize.py- Filters out documents with empty
text/emojification. - Filters out rows where
emojificationis not “emoji-only” (viaapp/utils/emojis.py).
- Filters out documents with empty
- Deduplication at ingestion:
app/steps/etl/common/save_to_db.py- Filters out incoming rows whose
textalready exists in MongoDB (prevents duplicates across runs/sources).
- Filters out incoming rows whose
- Job status finalization:
app/steps/etl/common/update_job_status.py- Marks the job as
COMPLETED/FAILEDand recordsdocuments_addedand any error message.
- Marks the job as
- Job tracking + lineage:
The feature pipeline ingests raw documents, processes them, and produces the features/targets consumed by training and inference workflows. Instead of sending these artifacts directly to the model stack, we persist them in a feature-store-like layer so they can be versioned, tracked, and shared reliably across runs.
Estimating the right number of samples is context-dependent and rarely straightforward. For very large models (70B+ params), a relatively small set of high-quality examples can be enough (for example, ~1k in LIMA). That does not usually transfer to smaller models (for example ~7B), which often need more data just to internalize the expected chat template, especially when starting from base checkpoints. A practical approach is to bootstrap from related open-source datasets and adapt them for your fine-tuning objective.
- General-purpose models: Because they must generalize across many topics and intents, they usually need much broader coverage. In practice, strong general-purpose instruct tuning often starts around 1M instruction samples.
- Task-specific models: These are optimized for one objective, so focused corpora are often enough; depending on task difficulty, useful dataset sizes can range from ~100 to ~100k samples.
- Domain-specific models: These adapt the model to a field’s concepts and vocabulary. Data needs depend on domain breadth/complexity and on how well that domain is represented in pre-training.
Data curation usually combines repurposed examples from existing datasets with newly generated samples. In broad or specialized domains, curation is harder and often needs domain experts to source and validate relevant material (papers, technical docs, and other domain-native text).
Rule-based filtering enforces quality through explicit deterministic checks. It is fast, scalable, consistent across samples, and transparent to audit, which reduces manual review load. The trade-off is that poorly designed rules can introduce or amplify dataset bias.
- Length filtering: Enforce minimum/maximum response lengths to remove under-informative short outputs and overly long, noisy outputs. Good thresholds are task- and domain-dependent.
- Keyword exclusion: Filter samples containing terms associated with low-quality, unsafe, spammy, or off-topic content, including domain-specific indicators of irrelevance.
- Format checking: Validate structural compliance with the expected schema/format, especially for code, JSON, or other strongly structured outputs.
Data deduplication is foundational. Duplicates and near-duplicates can cause overfitting (memorization over generalization), biased performance (over-represented patterns), inefficient training (wasted compute), and inflated evaluation metrics (leakage-like overlap effects). Common approaches include:
- Exact deduplication: Removes byte/string-identical samples after normalization, usually by hashing each sample (for example with MD5 or SHA-256) and dropping repeated hashes.
- Fuzzy deduplication: Targets near-duplicates rather than exact string matches. A common method is MinHash, where compact signatures are compared with metrics like Jaccard similarity.
- Semantic similarity: Deduplicates by meaning instead of lexical overlap. Samples are embedded with methods/models such as FastText, BERT, or sentence transformers, and then compared in vector space. At larger scale, clustering approaches can group similar vectors so one representative is kept and the rest are marked as duplicates.
A practical decontamination strategy is to include your evaluation set during the deduplication phase and remove any overlapping samples from the instruction dataset. You should also filter samples likely derived from the same sources as the evaluation data.
Data quality evaluation is critical. Traditional human annotation is usually high quality but expensive and slow at scale, so teams commonly combine it with automated scoring approaches:
- LLM-as-a-judge: Prompts one or more LLMs to score sample quality using ratings, custom rubrics, or pairwise comparisons. Known biases include position bias (mitigate by randomizing order), verbosity bias (mitigate with length controls), and same-model favoritism (mitigate with model diversity). In practice, a jury of models tends to improve consistency.
- Reward models: Models that score an instruction-response pair. A common setup is a decoder-only backbone (for example Gemma/Llama) with a linear scoring head, optionally producing multi-dimensional scores such as helpfulness, correctness, or verbosity; see reward models.
- Classifiers or encoder-only models: Add a classification head on top of an embedding/encoder model to predict quality classes or labels. Encoder-only architectures are typically smaller and well-suited for this classification workload.
Data exploration is how we build intuition for the training corpus before committing to training decisions. It combines manual inspection and automated analysis.
- Manual dataset exploration: Although time-intensive, direct sample review is still the best way to catch formatting defects, annotation mistakes, incoherent reasoning, and factual issues.
- Statistical analysis: Complements manual review by quantifying vocabulary diversity, potential bias, and concept coverage. Tooling such as NLTK or spaCy helps tokenize and profile large corpora, surfacing composition patterns and cultural/contextual skews that can affect model behavior.
- Topic clustering: Groups semantically related texts to reveal themes and blind spots in coverage. Hugging Face's text-clustering offers a practical pipeline: embed text, reduce dimensionality with UMAP, cluster with DBSCAN, auto-label clusters via an LLM, and visualize the results.
Data generation becomes necessary when available instruction data is insufficient, especially for underrepresented slices. LLM-based synthetic generation is an efficient, scalable way to expand coverage, and with solid prompt design it can produce high-quality data at a scale that manual authoring cannot match.
Synthetic pipelines usually start from a curated prompt set (often called a taxonomy) designed to induce diverse examples. Prompts typically include explicit instructions, examples, and constraints so outputs match target format and intent. Mature pipelines add multi-stage quality controls (validation and rule checks) for accuracy and relevance. You can also steer generation attributes such as instruction complexity, response length, writing style/tone, use of structured generation, and topic/domain focus. Since synthetic outputs can inherit model bias and model errors, mitigation commonly includes human review, prompt diversification, and extra filtering stages.
Data augmentation aims to increase both dataset size and sample quality. A classic approach is Evol-Instruct, where LLMs iteratively evolve simpler prompts into stronger instructions. It uses two complementary strategies:
- In-depth evolving: Increases the complexity of existing instructions.
- Constraints: Add extra requirements or limitations so the task becomes harder and more specific.
- Deepening: Reformulate prompts into deeper questions that require more complete, higher-effort answers.
- Concretizing: Replace abstract concepts with specific, detailed variants to reduce ambiguity.
- Reasoning steps: Explicitly require multi-step reasoning to elicit more structured problem solving.
- Complicating input: Introduce more complex input modalities or structures (for example XML, JSON, or code snippets).
- In-breadth evolving: Expands instruction diversity by generating new prompts inspired by existing ones, with emphasis on rarer and long-tail examples within the same domain.
- Pipeline definition (ZenML):
app/pipelines/generate_dataset.py - How it’s executed
uv run invoke generate-dataset.run(task:app/tasks/generate_dataset.py)- Uses ZenML YAML config:
app/configs/generate_dataset.yaml
- What it implements (and where)
- Dataset “data quality gate” at export time:
app/steps/dataset/load_documents.py- Drops documents marked
deprecated. - Drops documents whose job is deprecated or whose job status is not
COMPLETED. - This ensures the published HF dataset is sourced only from “successful” pipeline runs.
- Drops documents marked
- Split + canonical columns:
app/steps/dataset/create_train_test_split.py- Shuffles with a fixed seed, then creates
train/test. - Converts to HF
DatasetDictwith columns:text,emojification.
- Shuffles with a fixed seed, then creates
- Dataset card generation:
app/steps/dataset/generate_readme.py- Builds the dataset card content used as
README.mdin the dataset repo.
- Builds the dataset card content used as
- Publishing:
app/steps/dataset/push_to_huggingface.py- Pushes the dataset to HF Hub using
settings.HUGGINGFACE_ACCESS_TOKEN. - Uploads the generated dataset card as
README.mdin the dataset repo.
- Pushes the dataset to HF Hub using
- Dataset “data quality gate” at export time:
- Main config options (
app/configs/generate_dataset.yaml)parameters.dataset_id: HF dataset repo id to publish to (e.g.username/name).create_train_test_split.parameters.test_split_size: fraction used fortest.create_train_test_split.parameters.shuffle_seed: ensures deterministic splits.
The training pipeline consumes the prepared features/labels and produces a trained model. This model is then stored in a model registry so it can be versioned, tracked, and shared with the inference stack.
It is recommended to have a manual step before accepting a new production model.
In this stage, we apply Supervised Fine-Tuning (SFT). SFT refines model behavior using instruction-response pairs produced by previous pipelines. The goal is to teach the model the expected conversational format and adapt a general base to perform well on targeted tasks or specific domains.
During fine-tuning, we can optimize either on the full prompt-response text or on responses only (see train_on_responses_only in train_model.py). Instruction datasets also depend on template conventions. For example, Alpaca-style records may include input and system fields, which can be treated as instruction context: input carries task-specific data, while system acts as a control prompt for desired behavior.
train_on_responses_onlydoes not make thesystemprompt useless. The model still attends tosystem+usertext as context when predicting the assistant tokens; only the loss is masked so optimization focuses on the assistant response. This means we do not penalize the model for reproducing prompt/template text, which often leads to cleaner output-focused optimization.
- System: Translate this text to emoji:
- Instruction: A beautiful starry night sky
- Output: 🌌✨🌠
The instruction field contains the task input (here, the text to translate). The output field is the target response; it is not necessarily the only valid answer, but it should be a high-quality one. When curating instruction data, the dataset should reflect real production usage. We can then filter examples using quality dimensions such as:
- Accuracy: Responses should be factually correct and aligned with the instruction intent.
- Diversity: Data should cover the range of expected user requests across topics, contexts, lengths, and writing styles.
- Complexity: Include challenging and multi-step tasks so the model learns beyond trivial patterns.
In many cases, it is better to start with prompt engineering before investing in SFT. Techniques like few-shot prompting or retrieval-augmented generation (RAG) can solve many problems with lower cost and risk, while also helping establish a solid evaluation baseline. Fine-tuning should still be approached carefully: studies show that injecting new knowledge via SFT can increase hallucinations, and some approaches can erase previously learned capabilities (catastrophic forgetting).
Instruction datasets use structured schemas to organize prompts and responses. In practice, each sample is typically a JSON/Python dictionary with fields such as system, instruction, input, output, or conversation turns.
| Name | JSONL format |
|---|---|
| Alpaca | {"instruction": "...", "input": "...", "output": "..."} |
{"instruction": "...", "output": "..."} |
|
| ShareGPT | {"conversations": [{"from": "...", "value": "..."}, ...]} |
| OpenAI | {"conversations": [{"role": "...", "content": "..."}, ...]} |
| OASST | {"INSTRUCTION": "...", "RESPONSE": "..."} |
| Raw text | {"text": "..."} |
Raw text format is mainly used when doing continual pre-training rather than instruction tuning.
Alpaca is enough for single-turn instruction-response data (one prompt, one answer). For multi-turn conversations, formats like ShareGPT or OpenAI-style chat records are generally a better fit.
After parsing dataset records, we format them with a chat template. Chat templates provide a consistent way to serialize messages for the model.
Templates also include special tokens to mark message boundaries and speaker roles (system, user, assistant, tool, etc.). Since base models are not inherently instruction-formatted, you can choose the template during fine-tuning. For already instruction-tuned models, you should usually keep the original template; changing it can degrade performance.
As with dataset schemas, multiple chat template families exist (ChatML, Llama, Mistral, Gemma, etc.). In open-source workflows, ChatML is common: it uses <|im_start|> and <|im_end|> to mark each turn.
<|im_start|>system
Translate this text to emoji:<|im_end|>
<|im_start|>user
A beautiful starry night sky<|im_end|>
<|im_start|>assistant
🌌✨🌠<|im_end|>
At inference time, the target answer is not provided. We pass only the system and user turns and append the assistant prefix (for example <|im_start|>assistant\n) to trigger generation. Because the model was trained on this template, it learns to continue with an answer that matches both user intent and system guidance.
A common failure mode is that whitespace and line breaks are tokenization-significant. Even tiny formatting changes can alter tokenization and hurt quality. For this reason, robust template systems like Jinja are recommended.
| Name | Jinja template |
|---|---|
| Alpaca | ### Instruction: What is the capital of France? |
### Response: The capital of France is Paris.<EOS> |
|
| ChatML | <|im_start|>user |
What is the capital of France?<|im_end|> |
|
<|im_start|>assistant |
|
The capital of France is Paris.<|im_end|> |
|
| Gemma | <bos><start_of_turn>user |
What is the capital of France?<end_of_turn> |
|
<start_of_turn>model |
|
The capital of France is Paris.<end_of_turn> |
Jinja supports loops and conditionals, enabling one template definition for both training and inference (via add_generation_prompt).
Although the literature is broad, practical SFT workflows usually center on three approaches:
- Full fine-tuning: Updates all base-model parameters. It can deliver strong quality but requires substantial compute/memory and is destructive because it overwrites pre-trained weights.
- LoRA: Fine-tunes efficiently by adding trainable low-rank adapters while keeping base weights frozen. This drastically reduces trainable parameters, improves memory/runtime efficiency, and often reaches quality close to (or occasionally better than) full fine-tuning. Adapter sets can also be swapped per task/domain without retraining the full model.
- QLoRA: Combines LoRA with quantized base weights (typically 4-bit NF4), enabling fine-tuning on smaller GPUs. It keeps LoRA’s adapter mechanism but trades extra compute time for lower memory usage; in many settings quality remains close to LoRA.
When fine-tuning LLMs, a small set of hyperparameters drives convergence, stability, and generalization:
- Learning rate and scheduler: Usually the highest-impact knob. Too low leads to slow/underpowered learning; too high causes instability or divergence. Schedulers typically warm up/decay LR to balance early progress and late-stage refinement.
- Batch size: Controls samples per optimizer step. Larger effective batches stabilize gradients and can speed training. When memory is limited, gradient accumulation approximates larger batches by accumulating gradients across multiple mini-batches before an update.
- Maximum length: Upper bound on sequence length, constrained by task needs and GPU memory. Longer inputs are truncated (left or right, depending on strategy).
- Packing: Improves token efficiency by concatenating short samples into longer packed sequences (for example, ~5x 200-token samples in a 1k-token slot). Correct attention masking is required to prevent cross-sample attention leakage.
- Number of epochs: Full passes over the dataset (often ~1-10 for LLM fine-tuning). Too few can underfit; too many can overfit. Validation monitoring plus early stopping helps pick the right stopping point.
- Optimizers: Update parameters to minimize loss. AdamW is a common and reliable default for SFT.
- Weight decay: Regularizes by penalizing large weights, often improving generalization. Values that are too high can suppress learning; too low may under-regularize. With AdamW,
0.01is a common baseline. - Gradient checkpointing: Saves memory by storing fewer forward activations and recomputing them during backpropagation. This trades extra compute time for lower memory usage.
Several specialized stacks can run SFT, including TRL, Axolotl, and Unsloth.
When selecting a base model for fine-tuning, key factors include:
- License: Some checkpoints are restricted to non-commercial use.
- Budget: Smaller models are generally cheaper to train and serve.
- Performance: Benchmark base models on relevant tasks, ideally domain/use-case-specific rather than generic only.
During training, three monitoring signals are especially important:
- Training loss: Tracks fit on the training objective. It should trend downward overall (typically sharp early drop, then slower plateau). Repeated spikes or upward drift can indicate instability from data quality, tokenizer issues, or hyperparameter mismatch (for example LR/batch size).
- Validation loss: Measures generalization on held-out data. Healthy runs usually show both training and validation loss decreasing and then stabilizing with a small train/val gap. Falling training loss with rising validation loss indicates overfitting; both staying high/flat suggests underfitting.
- Gradient norm: Measures update magnitude. Very large norms can signal instability, especially when paired with train/val divergence. Stable or decreasing norms generally indicate convergence; gradient clipping helps cap unstable updates.
SFT training dashboard example (loss and optimization signals over steps).
SFT qualitative predictions sample table.
- Pipeline definition (ZenML):
app/pipelines/sft.py - How it’s executed
uv run invoke sft.train(task:app/tasks/sft.py)- Uses ZenML YAML config:
app/configs/sft.yaml
- What it implements (and where)
- Load dataset from HF Hub:
app/steps/sft/load_dataset.py- Loads
train/testsplits fromdataset_id.
- Loads
- Initialize base model + PEFT/LoRA + chat template:
app/steps/sft/initialize_model.py- Loads model with optional 4-bit/8-bit quantization.
- Applies LoRA (target modules, rank
lora_r, etc.). - Crucial Step: Chat Template Application. Because base models (like Gemma 3) are not instruction-tuned, we must use
get_chat_templateto inject a formatting structure (e.g.,gemma3) so the model learns how to recognize system, user, and assistant turns.
- Prepare training text exactly as the model sees it:
app/steps/sft/prepare_dataset.py- Builds structured
messages(system,user,assistant). - Uses
tokenizer.apply_chat_template(..., add_generation_prompt=False)and strips<bos>to avoid duplicated BOS issues.
- Builds structured
- Training:
app/steps/sft/train_model.py- Uses TRL
SFTTrainerand then appliestrain_on_responses_only(...)so the loss is computed only on the assistant turn.
- Uses TRL
- Model card generation:
app/steps/sft/generate_model_card.py- Builds a training-aware model card used when publishing artifacts.
- Model persistence and serving-friendly exports:
app/steps/sft/save_model.py- Supports saving:
lora_only(small, requires base model at inference time)merged_16bit(LoRA merged into base weights; convenient for serving frameworks like vLLM)merged_4bit_forced(merged + quantized)
- Supports saving:
- Load dataset from HF Hub:
- Special implementations
- Custom ZenML materializer for PEFT models:
app/materializers/peft_model.py- ZenML artifacts normally rely on pickling; PEFT/Accelerate mixed-precision objects can be problematic to pickle.
PeftModelMaterializersaves models viasave_pretrained()and restores them viaFastLanguageModel.from_pretrained(...), plus writesmetadata.jsonfor reconstruction (quantization + LoRA details).- This is wired into training via
@step(output_materializers={"trained_model": PeftModelMaterializer})inapp/steps/sft/train_model.py.
- More realistic eval metrics by simulating inference (instead of relying only on teacher-forced loss):
- Implemented in
app/utils/training.pyasGenerateEvalCallback. - During
on_evaluate, it:- Crops the eval batch to the generation prompt boundary (
process_batch_for_generation(...)), - Runs
model.generate(...), - Extracts only the assistant response (
app/utils/inference.py:extract_response), - Computes emoji-only accuracy (via
app/utils/emojis.py:is_full_emoji_text).
- Crops the eval batch to the generation prompt boundary (
- Metrics are injected into the trainer logs as
eval_emoji_only_accuracy,eval_num_emoji_only_preds,eval_total_preds. app/steps/sft/train_model.pyadditionally logs these metrics + a qualitative examples table to Weights & Biases when a run is active.
- Implemented in
- Custom ZenML materializer for PEFT models:
- Main config options (
app/configs/sft.yaml)steps.load_sft_dataset.parameters.dataset_id: HF dataset id to train on.steps.initialize_model.parameters.*:model_name,max_seq_lengthload_in_4bit/load_in_8bit- LoRA:
lora_r,lora_alpha,lora_dropout,target_modules chat_template: tokenizer formatting (must match the base model’s expected template).
steps.prepare_dataset.parameters.system_prompt_file: system prompt used as the first message.steps.train_model.parameters.*:instruction_part/response_part: delimiters used bytrain_on_responses_onlyand by the generation-based evaluation callback.- Trainer knobs: batch size, grad accumulation, LR/scheduler,
eval_steps, etc.
steps.generate_model_card.parameters.*: metadata used to build the model card (dataset/model/training fields, language/license,hf_repo_id).steps.save_model.parameters.*:hf_repo_idandhf_token: optional push to HF model hubsave_method:lora_onlyvs merged formats (important for downstream serving).export_gguf,gguf_quantization_method,push_gguf_to_hub: optional GGUF export + publishing settings.
Supervised Fine-Tuning (SFT) is strong for imitation, but it can miss the subtle preference signals humans care about, especially across long-tail interactions. This is why modern post-SFT workflows add a preference-alignment stage.
Preference alignment augments training with direct human or model-based feedback. In this project, we focus on Direct Preference Optimization because it is practical and efficient.
Preference datasets are less standardized than instruction datasets because each alignment algorithm has different requirements. In general, preference data is a set of candidate responses for the same instruction, ranked by humans or judge models. For DPO, the format is simple: each prompt has one chosen response and one rejected response. The training objective pushes the model toward chosen behavior and away from rejected behavior. For example:
{
"instruction": "A beautiful starry night sky",
"chosen": "🌌✨🌠",
"rejected": "🦠🚀💤"
}The rejected sample is as important as the chosen one. Rejected responses encode behaviors we explicitly want to suppress. Preference datasets are especially useful in settings like:
- Chatbots: Conversational quality depends on subjective factors like naturalness, engagement, and contextual fit. SFT alone may miss fine-grained response preferences.
- Content moderation: Policy decisions are nuanced; preference pairs help teach acceptable vs unacceptable responses more explicitly.
- Summarization: Multiple technically correct summaries can differ in conciseness, coherence, and usefulness.
- Code generation: There are often many valid implementations, but humans prefer solutions that are cleaner, safer, or more efficient.
- Creative writing: Style and emotional impact are highly subjective, making pairwise preference supervision particularly valuable.
- Translation: Multiple translations can be correct, but preference data helps optimize for fluency and native-speaker naturalness.
DPO datasets usually need fewer examples than instruction-tuning datasets. DPO is generally less destructive than SFT, so model behavior shifts are often more controlled. As with instruction data, required volume depends on model size and task complexity: larger models are usually more sample-efficient, while harder tasks need more preference pairs. Data quality remains critical, and larger high-quality preference sets are typically beneficial.
Task-specific alignment targets a narrow behavior (for example style transfer or refusal behavior), and can often be done with smaller datasets, roughly 100 to 10k preference pairs depending on difficulty. For very narrow constraints, even ~200-500 pairs can be enough (for example teaching a specific policy statement such as not claiming the model was trained by a specific provider).
The usual workflow is: generate multiple candidate answers, then rank them to form chosen/rejected pairs. There are several data-creation strategies, each with different quality/cost/scalability trade-offs:
- Human-generated, human-evaluated datasets: Highest control and often highest quality, especially for complex tasks, but expensive and hard to scale.
- Human-generated, LLM-evaluated datasets: Rare in practice because it still requires significant human generation effort while adding judge-model overhead.
- LLM-generated, human-evaluated datasets: A strong quality/efficiency balance. LLMs scale candidate generation, humans provide preference labels. This is common in production pipelines.
- LLM-generated, LLM-evaluated datasets: Fully synthetic and highly scalable, but requires careful prompt design and can propagate model bias/limitations.
Applications with active user traffic can collect preferences in-product (for example like/dislike or richer textual feedback). Another practical approach is to use a stronger model to generate likely chosen outputs and a weaker or intentionally constrained model to generate likely rejected outputs, which creates clearer supervision signals.
You can also compare model outputs against human-written references to identify style and quality gaps, then use those gaps to build preference pairs that steer tone and behavior.
When generating preference data, prompts should encourage diversity and complexity. Explicitly requesting different styles/approaches broadens the output distribution. Output variability can be increased through temperature and sampling strategy choices. Using multiple generator models can further improve diversity because different models have different strengths.
Preference evaluation can be done by human raters or automated with LLM judges. For LLM-based judging, you define explicit criteria, encode them in the judge prompt, and ask the judge to select preferred/rejected outputs. Evaluation quality depends strongly on both the judge model and rubric quality. As judge models improve, synthetic preference labeling quality can improve as well.
LLM-based evaluation can be implemented as absolute scoring or pairwise ranking. In absolute scoring, each response gets a score/label from a rubric; this is simple but can be inconsistent across prompts/sessions. In pairwise ranking, the judge compares two responses directly; this often matches human labeling patterns better and can be more stable.
LLM judges can still be biased:
- Position bias: In pairwise comparisons, the first answer may be favored disproportionately.
- Length bias: Longer responses may be overrated even when they are less useful.
- Family bias: Judges can prefer outputs from their own model family.
Mitigations include randomizing answer order (position bias), providing few-shot calibration examples with balanced scoring (length/family bias), and using multiple judge models as a jury instead of relying on a single judge.
We use LLM-generated and LLM-evaluated datasets for simplicity and efficiency, as this project is intended as a practical exercise rather than a production-ready system. For real-world applications, LLM-generated but human-evaluated datasets are generally recommended to reduce bias and evaluation leakage.
In our emojify setup, rejected responses are produced by a previous SFT version of our model, while preferred responses are generated by a stronger model (OpenAI’s suite). This introduces an implicit teacher–student distillation effect, encouraging the smaller Gemma model to learn preference patterns and stylistic behavior closer to a more capable model.
- Pipeline definitions (ZenML):
app/pipelines/dpo.pydpo_generate_batch: Create and submit OpenAI batch job for preference datadpo_collect_batch: Automatically find pending batch jobs and collect resultsdpo_generate_dataset: Generate rejected responses (requires a runningserve.vllminstance) and push DPO dataset to HFdpo_train: Train model using DPO
- How it's executed
uv run invoke dpo.generate-batch→ usesapp/configs/dpo_generate_batch.yamluv run invoke dpo.collect-batch→ usesapp/configs/dpo_collect_batch.yaml(automatically finds pending jobs)uv run invoke dpo.generate-dataset→ usesapp/configs/dpo_generate_dataset.yamluv run invoke dpo.train→ usesapp/configs/dpo_train.yamluv run invoke dpo.list-batch-jobs→ list all DPO batch jobs from database
- What it implements (and where)
- Batch input generation:
app/steps/dpo/generate_batch/generate_batch_input.py- Creates JSONL file with topic-based prompts for OpenAI batch API
- Uses structured outputs for consistent emoji generation format
- Batch job submission:
app/steps/dpo/generate_batch/submit_batch_job.py- Uploads batch file and creates OpenAI batch job
- Tracks job in MongoDB via
DPOBatchJobmodel
- Batch results collection:
app/steps/dpo/collect_batch/collect_all_pending_batches.py- Finds all pending batch jobs (non-terminal status) in database
- Polls batch status, downloads results when completed
- Creates
DPODocumentrecords with chosen (preferred) responses
- Rejected response generation:
app/steps/dpo/generate_dataset/generate_rejected_responses.py- Calls a running vLLM server to generate rejected responses
- Updates
DPODocumentrecords with rejected field
- DPO dataset creation:
app/steps/dpo/generate_dataset/create_dpo_dataset.py- Creates train/test splits from complete DPO documents
- Converts to HuggingFace DatasetDict format (prompt, chosen, rejected)
- Initialize model for DPO:
app/steps/dpo/train/initialize_model.py- Loads the previously trained SFT model as the base.
- Note on Chat Templates: Unlike SFT, we do not call
get_chat_templatehere. The SFT model already has the template baked into its tokenizer. Re-applying it could lead to inconsistent tokenization or "broken" turn structures. We must maintain the exact structure the model learned during SFT to ensure the preference optimization remains aligned.
- Dataset load + formatting for DPO loss:
app/steps/dpo/train/load_dataset.pyapp/steps/dpo/train/prepare_dataset.py- Loads
prompt/chosen/rejectedsplits from HF and formats prompts with the system + chat template expected by the model.
- DPO training:
app/steps/dpo/train/train_model.py- Uses Unsloth's
PatchDPOTrainerand TRL'sDPOTrainerfor efficient preference alignment.
- Uses Unsloth's
- Model card + persistence:
app/steps/dpo/train/generate_model_card.pyapp/steps/dpo/train/save_model.py- Generates training metadata and saves/pushes the aligned model.
- Dataset publishing:
app/steps/dpo/generate_dataset/push_dpo_dataset.py
- Batch input generation:
DPO training dashboard example with preference-optimization metrics.
- Domain models:
app/domain/dpo.pyDPOBatchJob: Tracks OpenAI batch job status and metadataDPODocument: Stores preference pairs (text, chosen, rejected)
- Main config options
dpo_generate_batch.yaml:target_samples: Number of preference samples to generatetopics_path: Path to topics.json for diverse prompts
dpo_generate_dataset.yaml:sft_model_path: HF repo or path to SFT model for rejected responsesvllm_url: vLLM server URL (default:http://localhost:8787)dataset_id: Target HuggingFace dataset ID for DPO data
dpo_train.yaml:steps.load_dpo_dataset.parameters.dataset_id: HF DPO dataset idsteps.train_model.parameters.*: DPO knobs likebeta, batch size, epochs, learning rate, and schedulersteps.save_model.parameters.*: model hub target + save mode for deployment
The generate_rejected_responses step uses a vLLM server for inference. You must start the server first:
# Terminal 1: Start vLLM server with the SFT model
uv run invoke serve.vllm --model-name=marioparreno/emojify-sft
# Terminal 2: Run the pipeline (once server is ready)
uv run invoke dpo.generate-datasetThe pipeline will check if the vLLM server is healthy before proceeding. If it's not running, you'll see a clear error message with instructions.
Evaluation is critical for understanding how an LLM behaves in real usage, not just how it scores on isolated benchmarks. For domain- or task-specific systems, evaluation should be narrow enough to reflect production behavior, but broad enough to catch edge cases. In practice, this means evaluating the whole application path, including components around the model (for example retrievers, validators, and post-processing).
Compared with traditional ML, LLM evaluation is less purely numeric and more behavior-driven. We still track objective metrics, but we also need qualitative judgment for coherence, relevance, and contextual fit. Three practical differences are:
- Metrics are multi-objective: LLMs handle varied tasks, so no single scalar metric usually captures quality.
- Feature engineering shifts to prompt/system design: LLMs ingest raw text directly, so evaluation focuses more on prompt behavior and output quality than handcrafted features.
- Interpretability is indirect: We usually cannot inspect model internals in a straightforward way, so we rely on traces, explanations, and targeted test sets.
A robust evaluation strategy usually combines general benchmarks with domain- and task-specific suites. Good suites should be:
- Challenging: Include cases that separate strong outputs from weak ones.
- Diverse: Cover broad topics, tones, and difficulty levels.
- Operationally practical: Easy to run repeatedly in development and CI/CD.
When a task maps cleanly to classic supervised formats, custom benchmarks (including multiple-choice style tasks) are still useful. For open-ended generation, LLM-as-a-judge is often a better fit. If ground truth is available, passing it as context improves judge reliability; otherwise, scoring by explicit dimensions (for example relevance, toxicity, or style adherence) keeps evaluation interpretable.
Judge models have known weaknesses: verbosity bias, assertiveness bias, domain gaps, and scoring inconsistency. Mitigations include better rubric design, multiple judges, and hybrid evaluation (judge scores + deterministic checks + task-specific metrics).
To improve parsing and reproducibility, require structured outputs in evaluation prompts (or use structured generation).
We have two different evaluation pipelines defined in ZenML, allowing both for full systematic assessments and quick developer feedback.
-
Prerequisite (required): start an OpenAI-compatible LLM endpoint first
- The evaluation agent uses
EVAL_BASE_URLandEVAL_MODELfrom settings/env. - If no server is running at
EVAL_BASE_URL, you'll seeAgent query failed: Connection error. - Quick setup with llama.cpp (DPO model):
docker run --rm -it \ -p 8080:8080 \ ghcr.io/ggml-org/llama.cpp:server \ --hf-repo marioparreno/emojify-dpo \ --hf-file emojify-dpo.Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -t 2
- Then configure env (for local machine):
EVAL_BASE_URL=http://127.0.0.1:8080/v1 EVAL_MODEL=marioparreno_emojify-dpo_emojify-dpo.Q4_K_M.gguf
- Optional quick health test:
curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "marioparreno_emojify-dpo_emojify-dpo.Q4_K_M.gguf", "messages": [{"role": "user", "content": "A beautiful starry night sky"}], "temperature": 0.2, "max_tokens": 16 }'
- After that, run:
uv run invoke evaluation.run-agent-query --query="A beautiful starry night sky" - To run the full LangSmith evaluation suite (dataset + evaluators), run:
uv run invoke evaluation.run
- The evaluation agent uses
-
Pipeline definitions (ZenML):
app/pipelines/evaluation.pyevaluation_pipeline: The full-scale evaluation process. It automates loading/creating the dataset in LangSmith, running the agent against every example, and triggering the multi-dimensional evaluator suite.agent_query_pipeline: A lightweight pipeline to run a single query through the agent. This is ideal for testing the agent's reasoning, retry logic, and tool use in a traceable environment without running a full suite.
-
How it's executed
uv run invoke evaluation.run- Executes the fullevaluation_pipeline.uv run invoke evaluation.run-agent-query --query="Your text here"- Executes theagent_query_pipelinefor a single query.
-
What it implements (and where)
- Agent workflow (LangGraph):
app/agent/state.py: DefinesAgentStateandErrorTypefor the workflow.nodes.py: Generation and validation nodes using an OpenAI-compatible LLM endpoint.graph.py: LangGraph workflow with retry logic that allows the agent to self-correct if it fails format validation.
- Dataset management:
app/steps/evaluation/dataset/manager.py: LangSmith dataset creation/retrieval utilities.eval-dataset.json: Default evaluation examples with expected context.
- Evaluators:
app/steps/evaluation/evaluators/- Format checkers: Validate output is emoji-only, not empty, and has no excessive repetitions.
- LLM-as-judge: Uses the configured judge model (
settings.EVAL_JUDGE_MODEL) to score relevance, appropriateness, and expressiveness.
- Agent workflow (LangGraph):
-
LangSmith Integration:
- Full experiment tracking and trace analysis. Each run creates a new experiment in LangSmith where you can drill down into individual agent turns and evaluator decisions.
LangSmith dashboard visualization after an evaluation.run invocation, showing metrics across the dataset.
The inference pipeline consumes features (and labels, when available for validation) from the feature store plus a trained artifact from the model registry. With those inputs, we can serve predictions in either batch mode or real-time mode.
Inference optimization is mainly about improving time-to-first-token (latency), tokens/sec (throughput), and memory efficiency. Naive serving setups usually cause poor hardware utilization, which translates into weak latency and throughput. Modern serving stacks apply a set of techniques that can substantially speed up generation.
Autoregressive decoding is inherently sequential: to produce token t+1, the model depends on tokens 1..t. This creates an iterative token-by-token loop that underuses parallel hardware if left unoptimized. Most inference optimizations are designed to reduce this bottleneck.
[Model optimization]
- KV Cache: To predict token 100, the model needs tokens 1-99. To predict token 101, it again needs tokens 1-99 plus token 100. Recomputing this context every step is expensive. The KV cache stores key/value tensors from self-attention, so each new step reuses prior work and only computes KV for the newest token. One caveat: a dynamically growing KV cache limits compatibility with
torch.compile, which relies on static shapes for kernel fusion. A static KV cache pre-allocates the maximum cache size, enablingtorch.compileand yielding up to ~4x forward-pass speedups. - Continuous batching: Larger batches amortize model-weight memory costs and better saturate GPU parallel compute. Continuous batching (a.k.a. in-flight batching) keeps utilization high by immediately admitting a new request when another request in the batch finishes. You still start by filling a batch, but completed requests are continuously evicted and replaced. This keeps the accelerator processing a full batch as often as possible, improving end-to-end utilization.
- Speculative decoding: Generate multiple candidate tokens at once using a smaller draft model (for example, a distilled or pruned proxy), then verify those candidates with the full model. In practice, the draft model proposes ~5-10 tokens; the main model validates them, keeps the longest matching prefix, and discards mismatches. If the draft model tracks the target model well, several tokens can be accepted per step and generation speeds up. Speedup depends on draft quality and draft size. Both models must share the same tokenizer; otherwise token boundaries will not align and verification fails.
- Paged Attention: An autoregressive decoding optimization that stores KV cache in fixed-size memory blocks (pages), similar to virtual memory. Instead of requiring contiguous KV memory per sequence, pages can be allocated non-contiguously, which reduces fragmentation and improves memory efficiency. In serving systems, this typically enables longer context windows and higher request concurrency, which often improves throughput.
- Flash Attention: Flash Attention tiles attention computation into small blocks that fit GPU on-chip SRAM (much faster than HBM). This minimizes memory traffic between main memory and compute units. It also uses online softmax (running max + running exp sum per block), so attention probabilities can be computed without materializing large intermediate matrices.
[Model parallelism]
- Data parallelism: The simplest parallelism strategy. You replicate the full model across GPUs and shard the data/requests. In training, gradients are synchronized (typically averaged) to keep replicas aligned. In inference, this is useful for handling concurrent traffic by routing different requests to different GPUs. It only works when the full model fits on a single GPU.
- Pipeline parallelism: Split the model by layers across GPUs so each device hosts a stage. During forward pass, activations flow stage-by-stage; backward pass mirrors this in reverse during training. The number of stages/GPU partitions defines pipeline degree. The main drawback is "pipeline bubbles" (stages waiting idle). Micro-batching helps reduce bubble time and improve utilization.
- Tensor parallelism: Split large tensors (for example, MLP weight matrices or attention projections/heads) across GPUs. Each device computes on its shard, and partial outputs are merged (typically with all-reduce/all-gather collectives). In self-attention, this works especially well because attention heads are naturally parallelizable, so each GPU can process a subset of heads independently. Limitations: not every layer parallelizes cleanly (e.g., ops with full-input dependencies such as LayerNorm or Dropout), and good performance depends on high-bandwidth, low-latency interconnects.
[Model quantization] Quantization means representing model weights/activations with lower-precision datatypes instead of standard FP16/FP32. For LLM inference, this is a primary lever to reduce memory footprint and often improve speed. The challenge is that naive quantization struggles with outlier features (extreme values). Simply discarding outliers hurts quality, so practical methods often mix quantization strategies across layers or use outlier-aware schemes. The two main weight-quantization approaches are:
- Post-training quantization (PTQ): Convert a pretrained model to lower precision without retraining. It is simple and fast to apply, but can reduce output quality.
- Quantization-aware training (QAT): Inject quantization effects during training/fine-tuning so the model learns to be robust to lower precision. Usually higher quality than PTQ, but it needs more compute and representative training data. Additionally, there are multiple libraries that can help with quantization, such as:
- llama.cpp and GGUF:
llama.cppis an open-source C++ inference stack for many LLMs. Compared with CUDA-centric stacks, it runs on a broader hardware surface. Its native GGUF format stores tensors plus metadata and supports multiple bit-widths (roughly 1-8 bit quantization options). - GPTQ and EXL2: While GGUF/
llama.cppis strong for CPU inference (with optional GPU offload), GPTQ and EXL2 are quantization formats focused on GPU serving. They are typically faster thanllama.cppin pure GPU inference. EXL2, especially with ExLlamaV2, often delivers very high throughput and supports mixed/custom precision. GPTQ is usually fixed at 4-bit. Trade-off: GPTQ/EXL2 are generally less universally supported than GGUF. - AWQ: A quantization method that protects important weights selected by activation magnitude (rather than raw weight magnitude). It is well supported by modern inference engines.
- QuIP# and HQQ: A growing trend is ultra-low-bit quantization (1-2 bit). QuIP# and HQQ target this regime and aim to preserve model quality better than naive ultra-low-bit methods. This can be especially compelling for very large models (>30B), which may end up smaller than 7B/13B models while still producing stronger outputs.
1. What are the different ways to combine training and quantization? There are three common strategies. In practice, there is no universally "perfect" alignment: the right choice depends on your hardware budget and required quality.
| Method | What happens? | When is it quantized? | Accuracy |
|---|---|---|---|
| PTQ (Post-Training) | Train in full 16-bit, then squash it to 4-bit at the very end. | After training | Good |
| QLoRA (The Standard) | Load a 4-bit "base," train 16-bit "adapters" on top, then merge. | During & After | Very Good |
| QAT (Awareness) | Simulates 4-bit errors during training so the model adapts. | During | Best |
2. Why not just always use QAT if it can be better? Because "best quality" is not always "best overall trade-off."
- More training time: QAT usually adds overhead to the forward/backward path due to fake-quant simulation and often needs extra epochs to converge well.
- More tuning: You may need to retune learning rate, warmup, quantization granularity (per-channel/group-wise), and clipping/calibration behavior for stable results.
- More failure modes: Training can become unstable (loss spikes, gradient noise, or underfitting) when quantization noise is introduced too early or too aggressively.
- Less portability: QAT can overfit to one quantization setup/backend; if you change runtime kernels or bit-width strategy, gains may shrink.
In practice, teams usually start with PTQ/QLoRA, benchmark quality-latency-memory, and only take on QAT complexity when needed.
3. Why use load_in_4bit = True if the final model is stored in 16-bit?
We use 4-bit loading (QLoRA) during training mainly to reduce VRAM usage. Even for smaller models like Gemma-3 270M, 4-bit loading makes training feasible on consumer GPUs while leaving headroom for longer context or larger batches. The "16-bit" artifact later pushed to Hugging Face is the merged output, which preserves high-precision adapter updates instead of immediately re-quantizing them.
4. If I'm training a "4-bit model," why are the updates in 16-bit? Neural networks learn via very small gradient updates. A 4-bit grid is too coarse to represent these updates reliably. Keeping LoRA adapters in 16-bit on top of a 4-bit base preserves learning precision while still keeping most model memory compressed.
5. Why does my Hugging Face config say bfloat16 after training?
When saving with save_method: "merged_16bit", Unsloth merges the 4-bit base and 16-bit trained adapters into one high-precision master checkpoint. Hugging Face then reports bfloat16 because the merged weights are high precision. This is typically the best artifact to share, since it serves as a near-lossless source for future conversions.
6. Why use merged_16bit instead of merged_4bit_forced?
In short: To protect your model's "IQ."
merged_16bit(Best Quality): Restores a high-resolution checkpoint so your training updates are merged cleanly. Think of it as a high-fidelity master copy. It is also required for reliable GGUF export for CPU deployment, since the quantizer expects a clean high-precision source.merged_4bit_forced(Worst Quality): Forces those 16-bit training gains back into a coarse 4-bit grid, which can erase nuance and introduce "double quantization" artifacts.
- Serving default in this repo: Use
llama.cppvia Docker for inference serving (simpler and reproducible). - Local llama.cpp build: Only needed if you want local GGUF tooling (for example, quantization/export workflows). See llama.cpp Local Build (optional, for GGUF tools).
- Serving entrypoint (vLLM):
app/tasks/serve.py- Run:
uv run invoke serve.vllm --model-name=marioparreno/emojify-sft- or serve a local path:
uv run invoke serve.vllm --model-name=./models/emojify-sft-YYYYMMDD_HHMMSS
- This wraps
vllm serve ...and exposes an OpenAI-compatible HTTP API.
- Run:
- Why SFT export format matters for inference
- If you train with LoRA, many serving stacks expect a single set of base weights.
- Use
save_model_step.parameters.save_method: merged_16bitinapp/configs/sft.yamlto produce a merged model artifact that is straightforward to serve with vLLM.
- GGUF artifacts for CPU-friendly inference
- SFT and DPO save steps can now export GGUF artifacts using Unsloth.
- In this repo, both
app/configs/sft.yamlandapp/configs/dpo_train.yamlare configured with:export_gguf: truegguf_quantization_method: "q4_k_m"
- This keeps a llama.cpp-ready artifact aligned with our intended low-CPU deployment path.
- Chat Templates and vLLM
- vLLM automatically loads the chat template (Jinja2 format) defined in a model's
tokenizer_config.jsonon Hugging Face, so we don't need to take care about it manually. - We can use the OpenAI-compatible API with standard chat completions:
messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ]
- vLLM automatically loads the chat template (Jinja2 format) defined in a model's
For low-CPU servers, llama.cpp + GGUF is often more stable than vLLM CPU.
- Run OpenAI-compatible server with your Q4_K_M GGUF
Note: you need to have the GGUF file in the
modelsdirectory.
For the initial supervised fine-tuning, we can use:
docker run --rm -it \
-p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server \
--hf-repo marioparreno/emojify-sft \
--hf-file emojify-sft.Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-t 2For DPO, we can use:
docker run --rm -it \
-p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server \
--hf-repo marioparreno/emojify-dpo \
--hf-file emojify-dpo.Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-t 2- Test
For the initial supervised fine-tuning, we can use:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "marioparreno_emojify-sft_emojify-sft.Q4_K_M.gguf",
"messages": [
{"role": "system", "content": "Convert user text to emojis only."},
{"role": "user", "content": "A beautiful starry night sky"}
],
"temperature": 0.2,
"max_tokens": 32
}'For DPO, we can use:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "marioparreno_emojify-dpo_emojify-dpo.Q4_K_M.gguf",
"messages": [
{"role": "system", "content": "Convert user text to emojis only."},
{"role": "user", "content": "A beautiful starry night sky"}
],
"temperature": 0.2,
"max_tokens": 32
}'Note: if outputs degrade after export, verify chat template/eos alignment between training and inference.
We now use one shared LangGraph agent implementation for both evaluation and production API:
- Shared implementation:
app/agent/ - FastAPI app:
app/api/
Environment variables used by deployment:
API_CORS_ORIGINS: comma-separated list or JSON array of allowed originsAPI_BASE_URL: OpenAI-compatible base URL (http://<llama-host>:<port>/v1)API_MODEL: model identifier to send in ChatOpenAI requestsAPI_TEMPERATURE,API_MAX_COMPLETION_TOKENS,API_MAX_RETRIESLLAMA_CPP_HF_REPO,LLAMA_CPP_HF_FILE,LLAMA_CPP_THREADS(for compose llama service)
Choose one deployment path:
- FastAPI on host + llama.cpp in Docker
Use this when you want to run the API process locally, but keep model serving in a container.
Start an OpenAI-compatible llama.cpp server:
docker run --rm -it \
-p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server \
--hf-repo marioparreno/emojify-dpo \
--hf-file emojify-dpo.Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-t 2In another terminal, run FastAPI locally and point it to llama.cpp:
uv sync
export API_BASE_URL=http://127.0.0.1:8080/v1
uv run uvicorn app.api.main:app --host 0.0.0.0 --port 8424- Full stack with Docker Compose (recommended for end-to-end setup)
Use this when you want api, llama-server, and observability services together.
# optional: initialize and edit .env first
# cp .env.example .env
docker compose up --build -dStop services:
docker compose down- API usage
curl -s http://127.0.0.1:8424/api/v1/emoji \
-H "Content-Type: application/json" \
-d '{"query":"A beautiful starry night sky"}'Response shape:
{
"output": "🌌✨🌙",
"success": true,
"error": "none",
"request_id": "7ec5b669-9bb5-458f-b721-5b22bf4ea1f9"
}Healthcheck endpoint:
curl -s http://127.0.0.1:8424/health- Monitoring stack (Prometheus + Grafana + business events)
We've set up a complete monitoring stack to track both technical performance and user behavior. If you are new to monitoring, here is a breakdown of how it works:
-
The Application (
api): Our FastAPI application (app/api/main.py) does the actual work. It uses two mechanisms to track data:- Prometheus Fastapi Instrumentator: Automatically measures basic HTTP metrics like request counts, latency, and status codes. By calling
.expose(), it dynamically creates a new GET route at/metrics. - Custom Metrics (
app/observability/metrics.py): We've added custom counters and histograms to track specific details like token usage, inference duration, and business events (e.g., when a user clicks "copy" or "regenerate"). These are updated in our endpoints (app/api/routes/emoji.py). - How they connect: Both mechanisms use the same
prometheus_clientlibrary, which maintains a hidden, global "default registry" in memory. When you define a custom metric, it automatically registers itself there. When Prometheus scrapes the/metricsendpoint, the Instrumentator reads this global registry, formatting both its automatic HTTP metrics and your custom metrics into plain text. You do not need to manually pass your metrics to the Instrumentator.
- Prometheus Fastapi Instrumentator: Automatically measures basic HTTP metrics like request counts, latency, and status codes. By calling
-
The Database (
postgres): While Prometheus is great for numbers over time, we also want to keep the exact text of what users liked or disliked. When a user provides feedback,app/observability/feedback_store.pysaves the raw query and response directly into a PostgreSQL database. -
The Scraper (
prometheus): Prometheus is a time-series database. It is configured viaprometheus/prometheus.ymlto automatically "scrape" (download) the/metricsendpoints from our API and thellama-serverevery 15 seconds. It stores this data so we can query it over time using PromQL. (Note:llama.cppexposes its own/metricsendpoint automatically when run with the--metricsflag, requiring no custom code). -
The Dashboard (
grafana): Grafana is our visualization UI. We use "Provisioning" to automatically configure it on startup:- Datasources (
grafana/provisioning/datasources/prometheus.yml): Tells Grafana how to connect to Prometheus and PostgreSQL. - Dashboards (
grafana/provisioning/dashboards/dashboards.yml): Tells Grafana to load the JSON dashboard files. - The UI (
grafana/dashboards/emojify-monitoring.json): This JSON file defines all our graphs, charts, and the data table that queries PostgreSQL for recent feedback events.
- Datasources (
The docker compose stack spins up the following services:
- FastAPI metrics endpoint:
http://127.0.0.1:8424/metrics - Prometheus UI:
http://127.0.0.1:9090 - Grafana UI:
http://127.0.0.1:3001(Credentials: admin/admin unless changed in.env) - PostgreSQL service:
postgres:5432(Internal compose network) - llama.cpp metrics endpoint:
http://llama-server:8080/metrics(Scraped internally)
Environment variables (from .env):
FEEDBACK_DATABASE_URL: PostgreSQL DSN used by API to persist raw feedback events.PROMETHEUS_HOST_PORT,PROMETHEUS_RETENTIONGRAFANA_HOST_PORT,GRAFANA_ADMIN_USER,GRAFANA_ADMIN_PASSWORDFEEDBACK_DB_NAME,FEEDBACK_DB_USER,FEEDBACK_DB_PASSWORD,FEEDBACK_DB_HOST,FEEDBACK_DB_PORT
For a public deployment behind Cloudflare Tunnel, this stack is typically exposed through 4 public hostnames:
| Public hostname | Tunnel target (origin) |
|---|---|
emojify.maparla.es |
http://YOUR_MACHINE_IP:4785 |
emojify-api.maparla.es |
http://YOUR_MACHINE_IP:8424 |
emojify-grafana.maparla.es |
http://YOUR_MACHINE_IP:3001 |
emojify-prometheus.maparla.es |
http://YOUR_MACHINE_IP:9090 |
Notes:
- The host ports above must match your server runtime (
API_HOST_PORT,GRAFANA_HOST_PORT,PROMETHEUS_HOST_PORT). - Keep
llama-server(8080) andpostgres(5432) private (no public tunnel). - For security, protect
grafanaand especiallyprometheuswith Cloudflare Access policies.
Example cloudflared ingress config:
tunnel: emojify
credentials-file: /etc/cloudflared/<tunnel-id>.json
ingress:
- hostname: emojify.maparla.es
service: http://YOUR_MACHINE_IP:4785
- hostname: emojify-api.maparla.es
service: http://YOUR_MACHINE_IP:8424
- hostname: emojify-grafana.maparla.es
service: http://YOUR_MACHINE_IP:3001
- hostname: emojify-prometheus.maparla.es
service: http://YOUR_MACHINE_IP:9090
- service: http_status:404Quick checks after creating/updating DNS + tunnel rules:
curl -I https://emojify.maparla.es
curl -I https://emojify-api.maparla.es/health
curl -I https://emojify-grafana.maparla.es/login
curl -I https://emojify-prometheus.maparla.es/-/readyTracing is complementary to Prometheus/Grafana. While Prometheus gives you aggregate trends (latency, throughput, error rates), LangSmith allows per-run/per-node drill-down (inputs, retries, checker outcomes, model calls).
Required env vars in .env:
LANGCHAIN_TRACING_V2=trueLANGSMITH_ENDPOINT=https://api.smith.langchain.comLANGSMITH_API_KEY=<your_key>LANGSMITH_PROJECT=<project_name>
Quick verification:
- Trigger a generation:
curl -s http://127.0.0.1:8424/api/v1/emoji \ -H "Content-Type: application/json" \ -d '{"query":"a calm beach at sunset"}'
- Open LangSmith and select the configured
LANGSMITH_PROJECT. - Confirm new runs appear.
Notes on troubleshooting missing traces:
- Verify the running
apicontainer has tracing vars:docker compose exec api /bin/sh -lc "env | grep -E 'LANGSMITH|LANGCHAIN_TRACING_V2'"
- Rebuild/restart API after env updates:
docker compose up -d --build api
Raw feedback events are persisted in PostgreSQL with data like event_type (copy or regenerate), query, response, and request_id.
To submit a test event:
curl -s http://127.0.0.1:8424/api/v1/emoji/feedback \
-H "Content-Type: application/json" \
-d '{
"event_type": "copy",
"query": "a calm beach at sunset",
"response": "🏖️🌅😌",
"request_id": "7ec5b669-9bb5-458f-b721-5b22bf4ea1f9",
"session_id": "browser-session-id",
"source": "ui",
"created_at": "2026-02-18T16:18:00Z"
}'Signal interpretation:
copyis a positive intent proxy.regenerateis a negative intent proxy.- no feedback event for a generation is neutral/unknown.
1) In Grafana:
- Open the
Emojify Monitoringdashboard athttp://127.0.0.1:3001. - View high-level metrics via Prometheus queries.
- View the raw feedback in the
Feedback Events (Latest 100 Rows)table panel (powered by the PostgreSQL datasource).
2) Useful PromQL examples:
- API throughput:
sum(rate(http_requests_total{handler!="/metrics"}[5m])) - Emoji generation p95 latency:
histogram_quantile(0.95, sum(rate(emoji_agent_inference_duration_seconds_bucket[5m])) by (le)) - Token throughput:
sum(rate(emoji_agent_total_tokens_total[5m])) - Copy rate proxy:
sum(rate(emoji_feedback_events_total{event_type="copy"}[5m])) / clamp_min(sum(rate(emoji_agent_requests_total{result="success"}[5m])), 1e-9)
3) Accessing PostgreSQL records directly: API endpoint for recent records:
curl -s "http://127.0.0.1:8424/api/v1/emoji/feedback/events?limit=50&offset=0"Terminal table view from PostgreSQL container:
docker compose exec postgres psql -U "${FEEDBACK_DB_USER:-text_to_emoji}" -d "${FEEDBACK_DB_NAME:-text_to_emoji}" \
-c "SELECT id, created_at, event_type, source, request_id, session_id, \"query\", response FROM feedback_events ORDER BY created_at DESC LIMIT 50;"- GitHub Actions deployment
To enable deployment, configure these repository secrets in: Settings > Secrets and variables > Actions
Strictly required (without these, deploy fails):
SERVER_HOST: The IP address or domain of your server.SERVER_USER: The SSH username (for example,rootorubuntu).SERVER_SSH_KEY: The private SSH key used for server authentication.
Optional runtime secrets (the workflow has defaults, but set them for production control):
API_CORS_ORIGINS: Allowed browser origins for CORS in FastAPI (for example, frontend domains).API_HOST_PORT: Host port mapped to the FastAPI container (8424by default).OPENAI_API_KEY: API key used by the LangChain OpenAI-compatible client; for localllama.cppendpoints a dummy value is acceptable.LANGCHAIN_TRACING_V2: Set totrueto enable LangSmith tracing (default:true). Set tofalseto disable.LANGSMITH_ENDPOINT: LangSmith API base URL (default:https://api.smith.langchain.com).LANGSMITH_API_KEY: LangSmith key used to upload traces.LANGSMITH_PROJECT: LangSmith project name where API traces are stored.LLAMA_CPP_HOST_PORT: Host port mapped to thellama-servercontainer (8080by default).LLAMA_CPP_HF_REPO: Hugging Face repo used byllama.cppto fetch the GGUF model.LLAMA_CPP_HF_FILE: GGUF filename pulled from the repo above.LLAMA_CPP_THREADS: CPU threads used byllama.cppduring inference.FEEDBACK_DB_NAME: PostgreSQL database name for feedback events.FEEDBACK_DB_USER: PostgreSQL username for feedback events.FEEDBACK_DB_PASSWORD: PostgreSQL password for feedback events.FEEDBACK_DB_PORT: PostgreSQL service port inside compose network (5432by default).PROMETHEUS_HOST_PORT: Host port mapped to Prometheus (9090by default).PROMETHEUS_RETENTION: Prometheus retention duration (for example,15d).GRAFANA_HOST_PORT: Host port mapped to Grafana (3001by default).GRAFANA_ADMIN_USER: Grafana admin username.GRAFANA_ADMIN_PASSWORD: Grafana admin password.
Notes:
GITHUB_TOKENis provided automatically by GitHub Actions; you do not create it manually.- If you do not set an optional secret, the deploy workflow writes a default value into the generated server
.env.
Deployment is automatically triggered when pushing a tag that starts with v (for example, v1.0.0).
- Create a new tag:
git tag v1.0.0- Push the tag:
git push origin v1.0.0This triggers the backend GitHub Actions workflow, which connects to your server and deploys with docker compose.
To list tags:
git tag --sort=version:refnameBenchmark results for gemma-3-270m-it.Q4_K_M.gguf using ghcr.io/ggml-org/llama.cpp:full (build ff4affb4c, CPU backend libggml-cpu-haswell.so).
System info (from lscpu)
| Field | Value |
|---|---|
| Architecture | x86_64 |
| CPU model | AMD EPYC-Milan Processor |
| vCPU / topology | 2 vCPU (1 socket x 1 core x 2 threads) |
| Hypervisor | KVM (full virtualization, QEMU BIOS) |
| NUMA | 1 node |
| Cache | L1d 32 KiB, L1i 32 KiB, L2 512 KiB, L3 32 MiB |
Commands used
docker run --rm -it \
-v /models:/models:ro \
--entrypoint /app/llama-bench \
ghcr.io/ggml-org/llama.cpp:full \
-m /models/gemma-3-270m-it.Q4_K_M.gguf -t 1Use -t to set CPU threads (for example, -t 1 for single-thread and -t 2 for two threads).
Results
pp512: prompt processing throughput (tokens/sec while ingesting a 512-token prompt). Higher is better for long-input latency.tg128: token generation throughput (tokens/sec while generating 128 output tokens). Higher is better for response speed.
| Threads | pp512 (t/s) |
tg128 (t/s) |
|---|---|---|
| 1 | 171.29 +- 5.86 |
45.89 +- 0.30 |
| 2 | 292.28 +- 6.28 |
66.29 +- 1.80 |
From -t 1 to -t 2, throughput improves by about 1.71x for pp512 and 1.44x for tg128.
Create a .env file in the project root with the following configuration:
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL=gpt-5.2-2025-12-11We specifically use gpt-5.2-2025-12-11 for synthetic data generation. You can find pricing information and a complete list of available models at https://openai.com/api/pricing/.
This project uses uv for dependency management. Dependencies are organized into groups so API deployment and ML workflows can be installed independently:
| Group | Description | Platforms |
|---|---|---|
| Default | Lean API/runtime dependencies (FastAPI, LangGraph agent runtime, etc.) | macOS, Linux |
pipelines |
Data/ETL/evaluation stack (ZenML, MongoDB client, Redis, OpenAI, LangSmith, datasets) | macOS, Linux |
training |
Model training stack (Torch, Transformers, TRL, W&B) | macOS, Linux |
cuda |
GPU-accelerated packages (vLLM, FlashInfer, Unsloth, etc.) | Linux only |
dev |
Development tools (pytest, ruff, jupyter, etc.) | All |
Separating API, pipelines, training, and CUDA dependencies keeps Docker/API images lightweight and avoids pulling large ML/GPU wheels unless they are explicitly needed.
- Run only the API with a small dependency footprint
- Install data/training dependencies only in environments that need them
- Add CUDA-specific packages only on Linux GPU machines
# API/runtime only (default)
uv sync
# API + data pipelines
uv sync --group pipelines
# API + pipelines + training
uv sync --group pipelines --group training
# Linux GPU machine (add CUDA packages)
uv sync --group pipelines --group training --group cuda
# Install everything (default + all dependency groups)
uv sync --all-groups
# API/runtime + development tools (any platform)
uv sync --group devNote: The FlashInfer packages require a custom index URL (
https://flashinfer.ai/whl/cu130), which is configured inpyproject.tomlunder[tool.uv].
Use this only when you need local llama.cpp binaries (for example, llama-quantize) during GGUF export/packaging workflows. For serving, prefer the Docker path documented in Serving with llama.cpp (GGUF).
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j 12 --clean-first --target llama-quantize llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp build/bin/llama-* .We need to install the hooks for pre-commit. You can do this by running the following command:
uv run pre-commit installInstalling ZenML and getting started.
https://docs.zenml.io/getting-started/installation
uv run zenml login --localhttps://docs.wandb.ai/guides/registry/
Published artifacts from this repository:
- SFT dataset: https://huggingface.co/datasets/marioparreno/emojify-sft
- DPO dataset: https://huggingface.co/datasets/marioparreno/emojify-dpo
- SFT model: https://huggingface.co/marioparreno/emojify-sft
- DPO model: https://huggingface.co/marioparreno/emojify-dpo
We will be using invoke for calling tasks...
uv run invoke --listTable with available tasks and docs....
ZenML YAML configuration for pipeline and steps: https://docs.zenml.io/concepts/steps_and_pipelines/yaml_configuration
To suppress verbosity from debug logs:
LOGURU_LEVEL=INFO uv run invoke data-collection.run-hfMinimal command flow (assuming uv is already installed and MongoDB is running):
# 1) Install all dependency groups
uv sync --all-groups
# 2) Login to local ZenML
uv run zenml login --local
# 3) Collect raw data into MongoDB (required before dataset generation)
uv run invoke data-collection.run-hf
# 4) Create instruction dataset (will create a HuggingFace dataset)
uv run invoke generate-dataset.run
# 5) Train SFT model (will create a HuggingFace model)
uv run invoke sft.train
# 6) Create DPO preference data (chosen responses) (will create a HuggingFace dataset)
uv run invoke dpo.generate-batch
uv run invoke dpo.collect-batch # This will take a while...
# 7) Create full DPO dataset (generates rejected responses; requires running vLLM)
uv run invoke serve.vllm --model-name=marioparreno/emojify-sft
uv run invoke dpo.generate-dataset
# 8) Train DPO model
uv run invoke dpo.train
# 9) Serve final model
uv run invoke serve.vllm --model-name=marioparreno/emojify-dpo
# 10) Quick test
uv run invoke evaluation.run-agent-query --query="A beautiful starry night sky"You can notice how the <bos> token is removed just after using apply_chat_template. If we take an example:
<pad><bos><start_of_turn>user
Translate this text to emoji:
A beautiful starry night sky<end_of_turn>
<start_of_turn>model
🌌✨🌠<end_of_turn>
This is a full conversation, including both user and model turns — basically what appears in a training sample. During supervised fine-tuning (SFT), the model learns from this full context:
<bos>— beginning of sequence (special token).<start_of_turn>user ... <end_of_turn>— marks the user’s message.<start_of_turn>model ... <end_of_turn>— marks the model’s response.<pad>— only used for padding to align sequences in a batch.
Because for many chat-based LLMs (e.g., Llama 3, Mistral, Gemma):
<bos>is automatically added internally when callinggenerate()if missing.- Keeping it twice can lead to token duplication or weird generation starts.
Some Unsloth models use a custom tokenizer wrapper that adds <bos> automatically whenever you tokenize an input for training or inference.
You can verify it checking:
tokenizer.bos_token_id # = 1
tokenizer.add_bos_token # = True
tokenizer.add_eos_token # = FalseYou’ll want two datasets derived from your original data:
- Train dataset: with input + output → used for supervised fine-tuning (SFT).
- Eval dataset: with only the input/prompt → used for generation-based evaluation.
The preparation starts with formatting the examples:
def convert_to_chatml(example):
return {
"conversations": [
{"role": "system", "content": "Translate this text to emoji: "},
{"role": "user", "content": example["text"]},
{"role": "assistant", "content": example["emoji"]},
]
}
dataset = dataset.map(convert_to_chatml)And now, meanwhile for training we want the full conversation:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [
tokenizer.apply_chat_template(
convo, tokenize=False, add_generation_prompt=False
).removeprefix("<bos>")
for convo in convos
]
return {"text": texts}
train_dataset = dataset.map(formatting_prompts_func, batched=True)For evaluation, you want the input only, not the assistant’s turn.
def formatting_eval_prompts_func(examples):
convos = [
convo[:-1] # remove assistant turn
for convo in examples["conversations"]
]
texts = [
tokenizer.apply_chat_template(
convo, tokenize=False, add_generation_prompt=True
).removeprefix("<bos>")
for convo in convos
]
# keep ground-truth labels for metric comparison
labels = [ex["conversations"][-1]["content"] for ex in examples["conversations"]]
return {"text": texts, "labels": labels}
eval_dataset = dataset.map(formatting_eval_prompts_func, batched=True)Notice how we added add_generation_prompt equals True, which appends <start_of_turn>model to the text.
-> FlashInfer: https://github.com/flashinfer-ai/flashinfer
pip install flashinfer-python flashinfer-cubin
# JIT cache package (replace cu129 with your CUDA version: cu128, cu129, or cu130)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129We install the packages with uv but need to select index-url for jit-cache.