Emojify: Transform text into emojis

Index

Introduction
Schema
The Pipeline
Environment and Tools
Data and Model Registry
Hugging Face Artifacts
Tasks and Invoke
Quick Run (End-to-End)
Special Notes
Pending work

Introduction

This project is not meant to be a perfect reference for one single thing, but a practical end-to-end walkthrough of many things at once: turning text into emojis while covering the full ML lifecycle.

We focus on the following capabilities:

Data lifecycle
- Collect, generate, ingest, clean, and validate data.
- Apply quality controls such as filtering, deduplication, and consistency checks.
Training and alignment
- Train models with Supervised Fine-Tuning (SFT).
- Explore preference alignment techniques such as Direct Preference Optimization (DPO).
Evaluation and analysis
- Evaluate models with relevant metrics.
- Analyze outcomes to compare approaches and identify trade-offs.
Inference and serving
- Deploy and consume the application.
- Optimize inference for CPU-efficient production usage.
Observability and user feedback
- Track core runtime signals such as inference time, latency, and throughput.
- Collect user feedback events to support continuous improvement.
MLOps and delivery
- Version, track, and share datasets and models.
- Automate deployment, evaluation, reporting, and training workflows.

Schema

Repository layout (high-level):

.
├── pyproject.toml                     <- Project dependencies and tool config
├── Dockerfile                         <- API container image build
├── docker-compose.yml                 <- Local stack (API, monitoring, services)
├── .github                            <- CI/CD workflows
│
├── app
│   ├── settings.py                    <- Central runtime configuration
│   │
│   ├── pipelines                       <- ZenML pipeline entrypoints
│   │   ├── data_collection.py         <- ETL pipeline (HF, Unsloth, OpenAI)
│   │   ├── generate_dataset.py        <- Dataset creation + publication pipeline
│   │   ├── sft.py                     <- Supervised Fine-Tuning pipeline
│   │   ├── dpo.py                     <- Preference alignment (DPO) pipeline
│   │   └── evaluation.py              <- Evaluation pipeline
│   │
│   ├── steps                          <- Reusable pipeline building blocks
│   │   ├── etl                        <- Collection, normalization, ingestion steps
│   │   ├── dataset                    <- Dataset split/readme/publish steps
│   │   ├── sft                        <- SFT load/prepare/train/save/model-card steps
│   │   ├── dpo                        <- DPO dataset generation + training steps
│   │   └── evaluation                 <- Evaluators, datasets, and scoring logic
│   │
│   ├── tasks                          <- `invoke` commands to run workflows
│   │   ├── data_collection.py
│   │   ├── generate_dataset.py
│   │   ├── sft.py
│   │   ├── dpo.py
│   │   ├── evaluation.py
│   │   └── serve.py
│   │
│   ├── configs                        <- YAML configs injected into pipelines/tasks
│   │   ├── etl_hf.yaml
│   │   ├── etl_unsloth.yaml
│   │   ├── etl_openai.yaml
│   │   ├── generate_dataset.yaml
│   │   ├── sft.yaml
│   │   ├── dpo_generate_batch.yaml
│   │   ├── dpo_collect_batch.yaml
│   │   ├── dpo_generate_dataset.yaml
│   │   ├── dpo_train.yaml
│   │   ├── evaluation.yaml
│   │   └── agent_query.yaml
│   │
│   ├── api                            <- FastAPI serving layer
│   │   ├── main.py                    <- API bootstrap
│   │   ├── schemas.py                 <- Request/response models
│   │   └── routes/emoji.py            <- Emoji inference + feedback endpoints
│   │
│   ├── agent                          <- Inference agent graph and orchestration
│   │   ├── factory.py
│   │   ├── graph.py
│   │   ├── nodes.py
│   │   ├── state.py
│   │   └── callbacks.py
│   │
│   ├── observability                  <- Metrics and business feedback handling
│   │   ├── metrics.py
│   │   └── feedback_store.py
│   │
│   ├── infrastructure
│   │   └── db/mongo.py                <- MongoDB integration
│   │
│   ├── domain                         <- Core domain models/exceptions
│   │   ├── data_jobs.py
│   │   ├── documents.py
│   │   ├── nosql.py
│   │   ├── dpo.py
│   │   └── exceptions.py
│   │
│   ├── materializers
│   │   └── peft_model.py              <- Custom ZenML model materializer
│   │
│   ├── utils                          <- Shared utilities (training/inference/gguf)
│   │   ├── training.py
│   │   ├── inference.py
│   │   ├── gguf.py
│   │   ├── prompts.py
│   │   ├── emojis.py
│   │   ├── config_loader.py
│   │   └── serialization.py
│   │
│   └── notebooks                      <- Experiments and local validation notebooks
│
└── misc                               <- Plots/assets used in docs and analysis

The Pipeline

The Data Collection Pipeline

The data collection pipeline integrates examples from multiple sources, including public datasets and synthetically generated data. In practice, this layer serves as the project’s data warehouse.

We implement an Extract, Transform, Load (ETL) pipeline that:

Extracts (and synthetically generates) data from multiple sources. We parse datasets from Hugging Face, generate synthetic samples with open-source LLMs via Unsloth, and generate additional synthetic data with proprietary models such as OpenAI.
Transforms the data by cleaning and standardizing it into a consistent schema suitable for storage and downstream analysis.
Loads the transformed data into a warehouse/database. We use MongoDB as the project’s NoSQL data warehouse.

In this repo (data collection)

Pipeline definition (ZenML): app/pipelines/data_collection.py
- Implements a single ZenML pipeline data_collection(source: str) with dynamic routing based on source:
  - huggingface → download from HF Hub
  - unsloth → local synthetic generation using an open model
  - openai → synthetic generation via OpenAI API
How it’s executed
- We run the same ZenML pipeline with different ZenML YAML configs via invoke tasks in app/tasks/data_collection.py:
  - uv run invoke data-collection.run-hf → uses app/configs/etl_hf.yaml
  - uv run invoke data-collection.run-unsloth → uses app/configs/etl_unsloth.yaml
  - uv run invoke data-collection.run-openai → uses app/configs/etl_openai.yaml
- Each task calls data_collection.with_options(run_name=..., config_path=...)() so ZenML injects step parameters from the YAML.
What it implements (and where)
- Job tracking + lineage: app/steps/etl/common/create_data_job.py
  - Creates a DataDownloadJob (HF) or DataGenerationJob (Unsloth/OpenAI).
  - Uses the ZenML pipeline run id as job_id, so every produced Document can be linked back to a specific pipeline run.
- Data quality / cleaning: app/steps/etl/common/clean_normalize.py
  - Filters out documents with empty text / emojification.
  - Filters out rows where emojification is not “emoji-only” (via app/utils/emojis.py).
- Deduplication at ingestion: app/steps/etl/common/save_to_db.py
  - Filters out incoming rows whose text already exists in MongoDB (prevents duplicates across runs/sources).
- Job status finalization: app/steps/etl/common/update_job_status.py
  - Marks the job as COMPLETED/FAILED and records documents_added and any error message.

The Feature Pipeline: Dataset Generation

The feature pipeline ingests raw documents, processes them, and produces the features/targets consumed by training and inference workflows. Instead of sending these artifacts directly to the model stack, we persist them in a feature-store-like layer so they can be versioned, tracked, and shared reliably across runs.

Data Quantity

Estimating the right number of samples is context-dependent and rarely straightforward. For very large models (70B+ params), a relatively small set of high-quality examples can be enough (for example, ~1k in LIMA). That does not usually transfer to smaller models (for example ~7B), which often need more data just to internalize the expected chat template, especially when starting from base checkpoints. A practical approach is to bootstrap from related open-source datasets and adapt them for your fine-tuning objective.

General-purpose models: Because they must generalize across many topics and intents, they usually need much broader coverage. In practice, strong general-purpose instruct tuning often starts around 1M instruction samples.
Task-specific models: These are optimized for one objective, so focused corpora are often enough; depending on task difficulty, useful dataset sizes can range from ~100 to ~100k samples.
Domain-specific models: These adapt the model to a field’s concepts and vocabulary. Data needs depend on domain breadth/complexity and on how well that domain is represented in pre-training.

Data Quality

Data curation usually combines repurposed examples from existing datasets with newly generated samples. In broad or specialized domains, curation is harder and often needs domain experts to source and validate relevant material (papers, technical docs, and other domain-native text).

Rule-based Filtering

Rule-based filtering enforces quality through explicit deterministic checks. It is fast, scalable, consistent across samples, and transparent to audit, which reduces manual review load. The trade-off is that poorly designed rules can introduce or amplify dataset bias.

Length filtering: Enforce minimum/maximum response lengths to remove under-informative short outputs and overly long, noisy outputs. Good thresholds are task- and domain-dependent.
Keyword exclusion: Filter samples containing terms associated with low-quality, unsafe, spammy, or off-topic content, including domain-specific indicators of irrelevance.
Format checking: Validate structural compliance with the expected schema/format, especially for code, JSON, or other strongly structured outputs.

Data Deduplication

Data deduplication is foundational. Duplicates and near-duplicates can cause overfitting (memorization over generalization), biased performance (over-represented patterns), inefficient training (wasted compute), and inflated evaluation metrics (leakage-like overlap effects). Common approaches include:

Exact deduplication: Removes byte/string-identical samples after normalization, usually by hashing each sample (for example with MD5 or SHA-256) and dropping repeated hashes.
Fuzzy deduplication: Targets near-duplicates rather than exact string matches. A common method is MinHash, where compact signatures are compared with metrics like Jaccard similarity.
Semantic similarity: Deduplicates by meaning instead of lexical overlap. Samples are embedded with methods/models such as FastText, BERT, or sentence transformers, and then compared in vector space. At larger scale, clustering approaches can group similar vectors so one representative is kept and the rest are marked as duplicates.

A practical decontamination strategy is to include your evaluation set during the deduplication phase and remove any overlapping samples from the instruction dataset. You should also filter samples likely derived from the same sources as the evaluation data.

Data Quality Evaluation

Data quality evaluation is critical. Traditional human annotation is usually high quality but expensive and slow at scale, so teams commonly combine it with automated scoring approaches:

LLM-as-a-judge: Prompts one or more LLMs to score sample quality using ratings, custom rubrics, or pairwise comparisons. Known biases include position bias (mitigate by randomizing order), verbosity bias (mitigate with length controls), and same-model favoritism (mitigate with model diversity). In practice, a jury of models tends to improve consistency.
Reward models: Models that score an instruction-response pair. A common setup is a decoder-only backbone (for example Gemma/Llama) with a linear scoring head, optionally producing multi-dimensional scores such as helpfulness, correctness, or verbosity; see reward models.
Classifiers or encoder-only models: Add a classification head on top of an embedding/encoder model to predict quality classes or labels. Encoder-only architectures are typically smaller and well-suited for this classification workload.

Data Exploration

Data exploration is how we build intuition for the training corpus before committing to training decisions. It combines manual inspection and automated analysis.

Manual dataset exploration: Although time-intensive, direct sample review is still the best way to catch formatting defects, annotation mistakes, incoherent reasoning, and factual issues.
Statistical analysis: Complements manual review by quantifying vocabulary diversity, potential bias, and concept coverage. Tooling such as NLTK or spaCy helps tokenize and profile large corpora, surfacing composition patterns and cultural/contextual skews that can affect model behavior.
Topic clustering: Groups semantically related texts to reveal themes and blind spots in coverage. Hugging Face's text-clustering offers a practical pipeline: embed text, reduce dimensionality with UMAP, cluster with DBSCAN, auto-label clusters via an LLM, and visualize the results.

Data Generation

Data generation becomes necessary when available instruction data is insufficient, especially for underrepresented slices. LLM-based synthetic generation is an efficient, scalable way to expand coverage, and with solid prompt design it can produce high-quality data at a scale that manual authoring cannot match.

Synthetic pipelines usually start from a curated prompt set (often called a taxonomy) designed to induce diverse examples. Prompts typically include explicit instructions, examples, and constraints so outputs match target format and intent. Mature pipelines add multi-stage quality controls (validation and rule checks) for accuracy and relevance. You can also steer generation attributes such as instruction complexity, response length, writing style/tone, use of structured generation, and topic/domain focus. Since synthetic outputs can inherit model bias and model errors, mitigation commonly includes human review, prompt diversification, and extra filtering stages.

Data Augmentation

Data augmentation aims to increase both dataset size and sample quality. A classic approach is Evol-Instruct, where LLMs iteratively evolve simpler prompts into stronger instructions. It uses two complementary strategies:

In-depth evolving: Increases the complexity of existing instructions.
- Constraints: Add extra requirements or limitations so the task becomes harder and more specific.
- Deepening: Reformulate prompts into deeper questions that require more complete, higher-effort answers.
- Concretizing: Replace abstract concepts with specific, detailed variants to reduce ambiguity.
- Reasoning steps: Explicitly require multi-step reasoning to elicit more structured problem solving.
- Complicating input: Introduce more complex input modalities or structures (for example XML, JSON, or code snippets).
In-breadth evolving: Expands instruction diversity by generating new prompts inspired by existing ones, with emphasis on rarer and long-tail examples within the same domain.

In this repo (dataset generation)

Pipeline definition (ZenML): app/pipelines/generate_dataset.py
How it’s executed
- uv run invoke generate-dataset.run (task: app/tasks/generate_dataset.py)
- Uses ZenML YAML config: app/configs/generate_dataset.yaml
What it implements (and where)
- Dataset “data quality gate” at export time: app/steps/dataset/load_documents.py
  - Drops documents marked deprecated.
  - Drops documents whose job is deprecated or whose job status is not COMPLETED.
  - This ensures the published HF dataset is sourced only from “successful” pipeline runs.
- Split + canonical columns: app/steps/dataset/create_train_test_split.py
  - Shuffles with a fixed seed, then creates train/test.
  - Converts to HF DatasetDict with columns: text, emojification.
- Dataset card generation: app/steps/dataset/generate_readme.py
  - Builds the dataset card content used as README.md in the dataset repo.
- Publishing: app/steps/dataset/push_to_huggingface.py
  - Pushes the dataset to HF Hub using settings.HUGGINGFACE_ACCESS_TOKEN.
  - Uploads the generated dataset card as README.md in the dataset repo.
Main config options (app/configs/generate_dataset.yaml)
- parameters.dataset_id: HF dataset repo id to publish to (e.g. username/name).
- create_train_test_split.parameters.test_split_size: fraction used for test.
- create_train_test_split.parameters.shuffle_seed: ensures deterministic splits.

The Training Pipeline: Supervised Fine-Tuning (SFT)

The training pipeline consumes the prepared features/labels and produces a trained model. This model is then stored in a model registry so it can be versioned, tracked, and shared with the inference stack.

It is recommended to have a manual step before accepting a new production model.

In this stage, we apply Supervised Fine-Tuning (SFT). SFT refines model behavior using instruction-response pairs produced by previous pipelines. The goal is to teach the model the expected conversational format and adapt a general base to perform well on targeted tasks or specific domains.

During fine-tuning, we can optimize either on the full prompt-response text or on responses only (see train_on_responses_only in train_model.py). Instruction datasets also depend on template conventions. For example, Alpaca-style records may include input and system fields, which can be treated as instruction context: input carries task-specific data, while system acts as a control prompt for desired behavior.

train_on_responses_only does not make the system prompt useless. The model still attends to system + user text as context when predicting the assistant tokens; only the loss is masked so optimization focuses on the assistant response. This means we do not penalize the model for reproducing prompt/template text, which often leads to cleaner output-focused optimization.

System: Translate this text to emoji:
Instruction: A beautiful starry night sky
Output: 🌌✨🌠

The instruction field contains the task input (here, the text to translate). The output field is the target response; it is not necessarily the only valid answer, but it should be a high-quality one. When curating instruction data, the dataset should reflect real production usage. We can then filter examples using quality dimensions such as:

Accuracy: Responses should be factually correct and aligned with the instruction intent.
Diversity: Data should cover the range of expected user requests across topics, contexts, lengths, and writing styles.
Complexity: Include challenging and multi-step tasks so the model learns beyond trivial patterns.

Exploring SFT

In many cases, it is better to start with prompt engineering before investing in SFT. Techniques like few-shot prompting or retrieval-augmented generation (RAG) can solve many problems with lower cost and risk, while also helping establish a solid evaluation baseline. Fine-tuning should still be approached carefully: studies show that injecting new knowledge via SFT can increase hallucinations, and some approaches can erase previously learned capabilities (catastrophic forgetting).

Instruction Dataset Formats

Instruction datasets use structured schemas to organize prompts and responses. In practice, each sample is typically a JSON/Python dictionary with fields such as system, instruction, input, output, or conversation turns.

Name	JSONL format
Alpaca	`{"instruction": "...", "input": "...", "output": "..."}`
	`{"instruction": "...", "output": "..."}`
ShareGPT	`{"conversations": [{"from": "...", "value": "..."}, ...]}`
OpenAI	`{"conversations": [{"role": "...", "content": "..."}, ...]}`
OASST	`{"INSTRUCTION": "...", "RESPONSE": "..."}`
Raw text	`{"text": "..."}`

Raw text format is mainly used when doing continual pre-training rather than instruction tuning.

Alpaca is enough for single-turn instruction-response data (one prompt, one answer). For multi-turn conversations, formats like ShareGPT or OpenAI-style chat records are generally a better fit.

Chat Templates

After parsing dataset records, we format them with a chat template. Chat templates provide a consistent way to serialize messages for the model.

Templates also include special tokens to mark message boundaries and speaker roles (system, user, assistant, tool, etc.). Since base models are not inherently instruction-formatted, you can choose the template during fine-tuning. For already instruction-tuned models, you should usually keep the original template; changing it can degrade performance.

As with dataset schemas, multiple chat template families exist (ChatML, Llama, Mistral, Gemma, etc.). In open-source workflows, ChatML is common: it uses <|im_start|> and <|im_end|> to mark each turn.

<|im_start|>system
Translate this text to emoji:<|im_end|>
<|im_start|>user
A beautiful starry night sky<|im_end|>
<|im_start|>assistant
🌌✨🌠<|im_end|>

At inference time, the target answer is not provided. We pass only the system and user turns and append the assistant prefix (for example <|im_start|>assistant\n) to trigger generation. Because the model was trained on this template, it learns to continue with an answer that matches both user intent and system guidance.

A common failure mode is that whitespace and line breaks are tokenization-significant. Even tiny formatting changes can alter tokenization and hurt quality. For this reason, robust template systems like Jinja are recommended.

Name	Jinja template
Alpaca	`### Instruction: What is the capital of France?`
	`### Response: The capital of France is Paris.<EOS>`
ChatML	`<｜im_start｜>user`
	`What is the capital of France?<｜im_end｜>`
	`<｜im_start｜>assistant`
	`The capital of France is Paris.<｜im_end｜>`
Gemma	`<bos><start_of_turn>user`
	`What is the capital of France?<end_of_turn>`
	`<start_of_turn>model`
	`The capital of France is Paris.<end_of_turn>`

Jinja supports loops and conditionals, enabling one template definition for both training and inference (via add_generation_prompt).

Parameter-Efficient Fine-Tuning (PEFT)

Although the literature is broad, practical SFT workflows usually center on three approaches:

Full fine-tuning: Updates all base-model parameters. It can deliver strong quality but requires substantial compute/memory and is destructive because it overwrites pre-trained weights.
LoRA: Fine-tunes efficiently by adding trainable low-rank adapters while keeping base weights frozen. This drastically reduces trainable parameters, improves memory/runtime efficiency, and often reaches quality close to (or occasionally better than) full fine-tuning. Adapter sets can also be swapped per task/domain without retraining the full model.
QLoRA: Combines LoRA with quantized base weights (typically 4-bit NF4), enabling fine-tuning on smaller GPUs. It keeps LoRA’s adapter mechanism but trades extra compute time for lower memory usage; in many settings quality remains close to LoRA.

Training Hyperparameters

When fine-tuning LLMs, a small set of hyperparameters drives convergence, stability, and generalization:

Learning rate and scheduler: Usually the highest-impact knob. Too low leads to slow/underpowered learning; too high causes instability or divergence. Schedulers typically warm up/decay LR to balance early progress and late-stage refinement.
Batch size: Controls samples per optimizer step. Larger effective batches stabilize gradients and can speed training. When memory is limited, gradient accumulation approximates larger batches by accumulating gradients across multiple mini-batches before an update.
Maximum length: Upper bound on sequence length, constrained by task needs and GPU memory. Longer inputs are truncated (left or right, depending on strategy).
Packing: Improves token efficiency by concatenating short samples into longer packed sequences (for example, ~5x 200-token samples in a 1k-token slot). Correct attention masking is required to prevent cross-sample attention leakage.
Number of epochs: Full passes over the dataset (often ~1-10 for LLM fine-tuning). Too few can underfit; too many can overfit. Validation monitoring plus early stopping helps pick the right stopping point.
Optimizers: Update parameters to minimize loss. AdamW is a common and reliable default for SFT.
Weight decay: Regularizes by penalizing large weights, often improving generalization. Values that are too high can suppress learning; too low may under-regularize. With AdamW, 0.01 is a common baseline.
Gradient checkpointing: Saves memory by storing fewer forward activations and recomputing them during backpropagation. This trades extra compute time for lower memory usage.

Fine-tuning in Practice

Several specialized stacks can run SFT, including TRL, Axolotl, and Unsloth.

When selecting a base model for fine-tuning, key factors include:

License: Some checkpoints are restricted to non-commercial use.
Budget: Smaller models are generally cheaper to train and serve.
Performance: Benchmark base models on relevant tasks, ideally domain/use-case-specific rather than generic only.

During training, three monitoring signals are especially important:

Training loss: Tracks fit on the training objective. It should trend downward overall (typically sharp early drop, then slower plateau). Repeated spikes or upward drift can indicate instability from data quality, tokenizer issues, or hyperparameter mismatch (for example LR/batch size).
Validation loss: Measures generalization on held-out data. Healthy runs usually show both training and validation loss decreasing and then stabilizing with a small train/val gap. Falling training loss with rising validation loss indicates overfitting; both staying high/flat suggests underfitting.
Gradient norm: Measures update magnitude. Very large norms can signal instability, especially when paired with train/val divergence. Stable or decreasing norms generally indicate convergence; gradient clipping helps cap unstable updates.

SFT training dashboard example (loss and optimization signals over steps).

SFT qualitative predictions sample table.

In this repo (SFT)

Pipeline definition (ZenML): app/pipelines/sft.py
How it’s executed
- uv run invoke sft.train (task: app/tasks/sft.py)
- Uses ZenML YAML config: app/configs/sft.yaml
What it implements (and where)
- Load dataset from HF Hub: app/steps/sft/load_dataset.py
  - Loads train/test splits from dataset_id.
- Initialize base model + PEFT/LoRA + chat template: app/steps/sft/initialize_model.py
  - Loads model with optional 4-bit/8-bit quantization.
  - Applies LoRA (target modules, rank lora_r, etc.).
  - Crucial Step: Chat Template Application. Because base models (like Gemma 3) are not instruction-tuned, we must use get_chat_template to inject a formatting structure (e.g., gemma3) so the model learns how to recognize system, user, and assistant turns.
- Prepare training text exactly as the model sees it: app/steps/sft/prepare_dataset.py
  - Builds structured messages (system, user, assistant).
  - Uses tokenizer.apply_chat_template(..., add_generation_prompt=False) and strips <bos> to avoid duplicated BOS issues.
- Training: app/steps/sft/train_model.py
  - Uses TRL SFTTrainer and then applies train_on_responses_only(...) so the loss is computed only on the assistant turn.
- Model card generation: app/steps/sft/generate_model_card.py
  - Builds a training-aware model card used when publishing artifacts.
- Model persistence and serving-friendly exports: app/steps/sft/save_model.py
  - Supports saving:
    - lora_only (small, requires base model at inference time)
    - merged_16bit (LoRA merged into base weights; convenient for serving frameworks like vLLM)
    - merged_4bit_forced (merged + quantized)
Special implementations
- Custom ZenML materializer for PEFT models: app/materializers/peft_model.py
  - ZenML artifacts normally rely on pickling; PEFT/Accelerate mixed-precision objects can be problematic to pickle.
  - PeftModelMaterializer saves models via save_pretrained() and restores them via FastLanguageModel.from_pretrained(...), plus writes metadata.json for reconstruction (quantization + LoRA details).
  - This is wired into training via @step(output_materializers={"trained_model": PeftModelMaterializer}) in app/steps/sft/train_model.py.
- More realistic eval metrics by simulating inference (instead of relying only on teacher-forced loss):
  - Implemented in app/utils/training.py as GenerateEvalCallback.
  - During on_evaluate, it:
    - Crops the eval batch to the generation prompt boundary (process_batch_for_generation(...)),
    - Runs model.generate(...),
    - Extracts only the assistant response (app/utils/inference.py: extract_response),
    - Computes emoji-only accuracy (via app/utils/emojis.py: is_full_emoji_text).
  - Metrics are injected into the trainer logs as eval_emoji_only_accuracy, eval_num_emoji_only_preds, eval_total_preds.
  - app/steps/sft/train_model.py additionally logs these metrics + a qualitative examples table to Weights & Biases when a run is active.
Main config options (app/configs/sft.yaml)
- steps.load_sft_dataset.parameters.dataset_id: HF dataset id to train on.
- steps.initialize_model.parameters.*:
  - model_name, max_seq_length
  - load_in_4bit / load_in_8bit
  - LoRA: lora_r, lora_alpha, lora_dropout, target_modules
  - chat_template: tokenizer formatting (must match the base model’s expected template).
- steps.prepare_dataset.parameters.system_prompt_file: system prompt used as the first message.
- steps.train_model.parameters.*:
  - instruction_part / response_part: delimiters used by train_on_responses_only and by the generation-based evaluation callback.
  - Trainer knobs: batch size, grad accumulation, LR/scheduler, eval_steps, etc.
- steps.generate_model_card.parameters.*: metadata used to build the model card (dataset/model/training fields, language/license, hf_repo_id).
- steps.save_model.parameters.*:
  - hf_repo_id and hf_token: optional push to HF model hub
  - save_method: lora_only vs merged formats (important for downstream serving).
  - export_gguf, gguf_quantization_method, push_gguf_to_hub: optional GGUF export + publishing settings.

The Preference Alignment Pipeline (DPO)

Supervised Fine-Tuning (SFT) is strong for imitation, but it can miss the subtle preference signals humans care about, especially across long-tail interactions. This is why modern post-SFT workflows add a preference-alignment stage.

Preference alignment augments training with direct human or model-based feedback. In this project, we focus on Direct Preference Optimization because it is practical and efficient.

Preference data

Preference datasets are less standardized than instruction datasets because each alignment algorithm has different requirements. In general, preference data is a set of candidate responses for the same instruction, ranked by humans or judge models. For DPO, the format is simple: each prompt has one chosen response and one rejected response. The training objective pushes the model toward chosen behavior and away from rejected behavior. For example:

{
    "instruction": "A beautiful starry night sky",
    "chosen": "🌌✨🌠",
    "rejected": "🦠🚀💤"
}

The rejected sample is as important as the chosen one. Rejected responses encode behaviors we explicitly want to suppress. Preference datasets are especially useful in settings like:

Chatbots: Conversational quality depends on subjective factors like naturalness, engagement, and contextual fit. SFT alone may miss fine-grained response preferences.
Content moderation: Policy decisions are nuanced; preference pairs help teach acceptable vs unacceptable responses more explicitly.
Summarization: Multiple technically correct summaries can differ in conciseness, coherence, and usefulness.
Code generation: There are often many valid implementations, but humans prefer solutions that are cleaner, safer, or more efficient.
Creative writing: Style and emotional impact are highly subjective, making pairwise preference supervision particularly valuable.
Translation: Multiple translations can be correct, but preference data helps optimize for fluency and native-speaker naturalness.

Data quantity

DPO datasets usually need fewer examples than instruction-tuning datasets. DPO is generally less destructive than SFT, so model behavior shifts are often more controlled. As with instruction data, required volume depends on model size and task complexity: larger models are usually more sample-efficient, while harder tasks need more preference pairs. Data quality remains critical, and larger high-quality preference sets are typically beneficial.

Task-specific alignment targets a narrow behavior (for example style transfer or refusal behavior), and can often be done with smaller datasets, roughly 100 to 10k preference pairs depending on difficulty. For very narrow constraints, even ~200-500 pairs can be enough (for example teaching a specific policy statement such as not claiming the model was trained by a specific provider).

Preference data generation

The usual workflow is: generate multiple candidate answers, then rank them to form chosen/rejected pairs. There are several data-creation strategies, each with different quality/cost/scalability trade-offs:

Human-generated, human-evaluated datasets: Highest control and often highest quality, especially for complex tasks, but expensive and hard to scale.
Human-generated, LLM-evaluated datasets: Rare in practice because it still requires significant human generation effort while adding judge-model overhead.
LLM-generated, human-evaluated datasets: A strong quality/efficiency balance. LLMs scale candidate generation, humans provide preference labels. This is common in production pipelines.
LLM-generated, LLM-evaluated datasets: Fully synthetic and highly scalable, but requires careful prompt design and can propagate model bias/limitations.

Applications with active user traffic can collect preferences in-product (for example like/dislike or richer textual feedback). Another practical approach is to use a stronger model to generate likely chosen outputs and a weaker or intentionally constrained model to generate likely rejected outputs, which creates clearer supervision signals.

You can also compare model outputs against human-written references to identify style and quality gaps, then use those gaps to build preference pairs that steer tone and behavior.

When generating preference data, prompts should encourage diversity and complexity. Explicitly requesting different styles/approaches broadens the output distribution. Output variability can be increased through temperature and sampling strategy choices. Using multiple generator models can further improve diversity because different models have different strengths.

Preference data evaluation

Preference evaluation can be done by human raters or automated with LLM judges. For LLM-based judging, you define explicit criteria, encode them in the judge prompt, and ask the judge to select preferred/rejected outputs. Evaluation quality depends strongly on both the judge model and rubric quality. As judge models improve, synthetic preference labeling quality can improve as well.

LLM-based evaluation can be implemented as absolute scoring or pairwise ranking. In absolute scoring, each response gets a score/label from a rubric; this is simple but can be inconsistent across prompts/sessions. In pairwise ranking, the judge compares two responses directly; this often matches human labeling patterns better and can be more stable.

LLM judges can still be biased:

Position bias: In pairwise comparisons, the first answer may be favored disproportionately.
Length bias: Longer responses may be overrated even when they are less useful.
Family bias: Judges can prefer outputs from their own model family.

Mitigations include randomizing answer order (position bias), providing few-shot calibration examples with balanced scoring (length/family bias), and using multiple judge models as a jury instead of relying on a single judge.

In this repo (DPO)

We use LLM-generated and LLM-evaluated datasets for simplicity and efficiency, as this project is intended as a practical exercise rather than a production-ready system. For real-world applications, LLM-generated but human-evaluated datasets are generally recommended to reduce bias and evaluation leakage.

In our emojify setup, rejected responses are produced by a previous SFT version of our model, while preferred responses are generated by a stronger model (OpenAI’s suite). This introduces an implicit teacher–student distillation effect, encouraging the smaller Gemma model to learn preference patterns and stylistic behavior closer to a more capable model.

Pipeline definitions (ZenML): app/pipelines/dpo.py
- dpo_generate_batch: Create and submit OpenAI batch job for preference data
- dpo_collect_batch: Automatically find pending batch jobs and collect results
- dpo_generate_dataset: Generate rejected responses (requires a running serve.vllm instance) and push DPO dataset to HF
- dpo_train: Train model using DPO
How it's executed
- uv run invoke dpo.generate-batch → uses app/configs/dpo_generate_batch.yaml
- uv run invoke dpo.collect-batch → uses app/configs/dpo_collect_batch.yaml (automatically finds pending jobs)
- uv run invoke dpo.generate-dataset → uses app/configs/dpo_generate_dataset.yaml
- uv run invoke dpo.train → uses app/configs/dpo_train.yaml
- uv run invoke dpo.list-batch-jobs → list all DPO batch jobs from database
What it implements (and where)
- Batch input generation: app/steps/dpo/generate_batch/generate_batch_input.py
  - Creates JSONL file with topic-based prompts for OpenAI batch API
  - Uses structured outputs for consistent emoji generation format
- Batch job submission: app/steps/dpo/generate_batch/submit_batch_job.py
  - Uploads batch file and creates OpenAI batch job
  - Tracks job in MongoDB via DPOBatchJob model
- Batch results collection: app/steps/dpo/collect_batch/collect_all_pending_batches.py
  - Finds all pending batch jobs (non-terminal status) in database
  - Polls batch status, downloads results when completed
  - Creates DPODocument records with chosen (preferred) responses
- Rejected response generation: app/steps/dpo/generate_dataset/generate_rejected_responses.py
  - Calls a running vLLM server to generate rejected responses
  - Updates DPODocument records with rejected field
- DPO dataset creation: app/steps/dpo/generate_dataset/create_dpo_dataset.py
  - Creates train/test splits from complete DPO documents
  - Converts to HuggingFace DatasetDict format (prompt, chosen, rejected)
- Initialize model for DPO: app/steps/dpo/train/initialize_model.py
  - Loads the previously trained SFT model as the base.
  - Note on Chat Templates: Unlike SFT, we do not call get_chat_template here. The SFT model already has the template baked into its tokenizer. Re-applying it could lead to inconsistent tokenization or "broken" turn structures. We must maintain the exact structure the model learned during SFT to ensure the preference optimization remains aligned.
- Dataset load + formatting for DPO loss:
  - app/steps/dpo/train/load_dataset.py
  - app/steps/dpo/train/prepare_dataset.py
  - Loads prompt/chosen/rejected splits from HF and formats prompts with the system + chat template expected by the model.
- DPO training: app/steps/dpo/train/train_model.py
  - Uses Unsloth's PatchDPOTrainer and TRL's DPOTrainer for efficient preference alignment.
- Model card + persistence:
  - app/steps/dpo/train/generate_model_card.py
  - app/steps/dpo/train/save_model.py
  - Generates training metadata and saves/pushes the aligned model.
- Dataset publishing: app/steps/dpo/generate_dataset/push_dpo_dataset.py

DPO training dashboard example with preference-optimization metrics.

Domain models: app/domain/dpo.py
- DPOBatchJob: Tracks OpenAI batch job status and metadata
- DPODocument: Stores preference pairs (text, chosen, rejected)
Main config options
- dpo_generate_batch.yaml:
  - target_samples: Number of preference samples to generate
  - topics_path: Path to topics.json for diverse prompts
- dpo_generate_dataset.yaml:
  - sft_model_path: HF repo or path to SFT model for rejected responses
  - vllm_url: vLLM server URL (default: http://localhost:8787)
  - dataset_id: Target HuggingFace dataset ID for DPO data
- dpo_train.yaml:
  - steps.load_dpo_dataset.parameters.dataset_id: HF DPO dataset id
  - steps.train_model.parameters.*: DPO knobs like beta, batch size, epochs, learning rate, and scheduler
  - steps.save_model.parameters.*: model hub target + save mode for deployment

Running `dpo.generate-dataset`

The generate_rejected_responses step uses a vLLM server for inference. You must start the server first:

# Terminal 1: Start vLLM server with the SFT model
uv run invoke serve.vllm --model-name=marioparreno/emojify-sft

# Terminal 2: Run the pipeline (once server is ready)
uv run invoke dpo.generate-dataset

The pipeline will check if the vLLM server is healthy before proceeding. If it's not running, you'll see a clear error message with instructions.

The Evaluation Pipeline

Evaluation is critical for understanding how an LLM behaves in real usage, not just how it scores on isolated benchmarks. For domain- or task-specific systems, evaluation should be narrow enough to reflect production behavior, but broad enough to catch edge cases. In practice, this means evaluating the whole application path, including components around the model (for example retrievers, validators, and post-processing).

Compared with traditional ML, LLM evaluation is less purely numeric and more behavior-driven. We still track objective metrics, but we also need qualitative judgment for coherence, relevance, and contextual fit. Three practical differences are:

Metrics are multi-objective: LLMs handle varied tasks, so no single scalar metric usually captures quality.
Feature engineering shifts to prompt/system design: LLMs ingest raw text directly, so evaluation focuses more on prompt behavior and output quality than handcrafted features.
Interpretability is indirect: We usually cannot inspect model internals in a straightforward way, so we rely on traces, explanations, and targeted test sets.

A robust evaluation strategy usually combines general benchmarks with domain- and task-specific suites. Good suites should be:

Challenging: Include cases that separate strong outputs from weak ones.
Diverse: Cover broad topics, tones, and difficulty levels.
Operationally practical: Easy to run repeatedly in development and CI/CD.

When a task maps cleanly to classic supervised formats, custom benchmarks (including multiple-choice style tasks) are still useful. For open-ended generation, LLM-as-a-judge is often a better fit. If ground truth is available, passing it as context improves judge reliability; otherwise, scoring by explicit dimensions (for example relevance, toxicity, or style adherence) keeps evaluation interpretable.

Judge models have known weaknesses: verbosity bias, assertiveness bias, domain gaps, and scoring inconsistency. Mitigations include better rubric design, multiple judges, and hybrid evaluation (judge scores + deterministic checks + task-specific metrics).

To improve parsing and reproducibility, require structured outputs in evaluation prompts (or use structured generation).

In this repo (evaluation)

We have two different evaluation pipelines defined in ZenML, allowing both for full systematic assessments and quick developer feedback.

Prerequisite (required): start an OpenAI-compatible LLM endpoint first

The evaluation agent uses EVAL_BASE_URL and EVAL_MODEL from settings/env.
If no server is running at EVAL_BASE_URL, you'll see Agent query failed: Connection error.

Quick setup with llama.cpp (DPO model):

docker run --rm -it \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  --hf-repo marioparreno/emojify-dpo \
  --hf-file emojify-dpo.Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 2

Then configure env (for local machine):

EVAL_BASE_URL=http://127.0.0.1:8080/v1
EVAL_MODEL=marioparreno_emojify-dpo_emojify-dpo.Q4_K_M.gguf

Optional quick health test:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "marioparreno_emojify-dpo_emojify-dpo.Q4_K_M.gguf",
    "messages": [{"role": "user", "content": "A beautiful starry night sky"}],
    "temperature": 0.2,
    "max_tokens": 16
  }'

After that, run:

uv run invoke evaluation.run-agent-query --query="A beautiful starry night sky"

To run the full LangSmith evaluation suite (dataset + evaluators), run:
```
uv run invoke evaluation.run
```

Pipeline definitions (ZenML): app/pipelines/evaluation.py
- evaluation_pipeline: The full-scale evaluation process. It automates loading/creating the dataset in LangSmith, running the agent against every example, and triggering the multi-dimensional evaluator suite.
- agent_query_pipeline: A lightweight pipeline to run a single query through the agent. This is ideal for testing the agent's reasoning, retry logic, and tool use in a traceable environment without running a full suite.
How it's executed
- uv run invoke evaluation.run - Executes the full evaluation_pipeline.
- uv run invoke evaluation.run-agent-query --query="Your text here" - Executes the agent_query_pipeline for a single query.
What it implements (and where)
- Agent workflow (LangGraph): app/agent/
  - state.py: Defines AgentState and ErrorType for the workflow.
  - nodes.py: Generation and validation nodes using an OpenAI-compatible LLM endpoint.
  - graph.py: LangGraph workflow with retry logic that allows the agent to self-correct if it fails format validation.
- Dataset management: app/steps/evaluation/dataset/
  - manager.py: LangSmith dataset creation/retrieval utilities.
  - eval-dataset.json: Default evaluation examples with expected context.
- Evaluators: app/steps/evaluation/evaluators/
  - Format checkers: Validate output is emoji-only, not empty, and has no excessive repetitions.
  - LLM-as-judge: Uses the configured judge model (settings.EVAL_JUDGE_MODEL) to score relevance, appropriateness, and expressiveness.
LangSmith Integration:
- Full experiment tracking and trace analysis. Each run creates a new experiment in LangSmith where you can drill down into individual agent turns and evaluator decisions.

LangSmith dashboard visualization after an evaluation.run invocation, showing metrics across the dataset.

The Inference Pipeline

The inference pipeline consumes features (and labels, when available for validation) from the feature store plus a trained artifact from the model registry. With those inputs, we can serve predictions in either batch mode or real-time mode.

Inference Optimization

Inference optimization is mainly about improving time-to-first-token (latency), tokens/sec (throughput), and memory efficiency. Naive serving setups usually cause poor hardware utilization, which translates into weak latency and throughput. Modern serving stacks apply a set of techniques that can substantially speed up generation.

Autoregressive decoding is inherently sequential: to produce token t+1, the model depends on tokens 1..t. This creates an iterative token-by-token loop that underuses parallel hardware if left unoptimized. Most inference optimizations are designed to reduce this bottleneck.

[Model optimization]

KV Cache: To predict token 100, the model needs tokens 1-99. To predict token 101, it again needs tokens 1-99 plus token 100. Recomputing this context every step is expensive. The KV cache stores key/value tensors from self-attention, so each new step reuses prior work and only computes KV for the newest token. One caveat: a dynamically growing KV cache limits compatibility with torch.compile, which relies on static shapes for kernel fusion. A static KV cache pre-allocates the maximum cache size, enabling torch.compile and yielding up to ~4x forward-pass speedups.
Continuous batching: Larger batches amortize model-weight memory costs and better saturate GPU parallel compute. Continuous batching (a.k.a. in-flight batching) keeps utilization high by immediately admitting a new request when another request in the batch finishes. You still start by filling a batch, but completed requests are continuously evicted and replaced. This keeps the accelerator processing a full batch as often as possible, improving end-to-end utilization.
Speculative decoding: Generate multiple candidate tokens at once using a smaller draft model (for example, a distilled or pruned proxy), then verify those candidates with the full model. In practice, the draft model proposes ~5-10 tokens; the main model validates them, keeps the longest matching prefix, and discards mismatches. If the draft model tracks the target model well, several tokens can be accepted per step and generation speeds up. Speedup depends on draft quality and draft size. Both models must share the same tokenizer; otherwise token boundaries will not align and verification fails.
Paged Attention: An autoregressive decoding optimization that stores KV cache in fixed-size memory blocks (pages), similar to virtual memory. Instead of requiring contiguous KV memory per sequence, pages can be allocated non-contiguously, which reduces fragmentation and improves memory efficiency. In serving systems, this typically enables longer context windows and higher request concurrency, which often improves throughput.
Flash Attention: Flash Attention tiles attention computation into small blocks that fit GPU on-chip SRAM (much faster than HBM). This minimizes memory traffic between main memory and compute units. It also uses online softmax (running max + running exp sum per block), so attention probabilities can be computed without materializing large intermediate matrices.

[Model parallelism]

Data parallelism: The simplest parallelism strategy. You replicate the full model across GPUs and shard the data/requests. In training, gradients are synchronized (typically averaged) to keep replicas aligned. In inference, this is useful for handling concurrent traffic by routing different requests to different GPUs. It only works when the full model fits on a single GPU.
Pipeline parallelism: Split the model by layers across GPUs so each device hosts a stage. During forward pass, activations flow stage-by-stage; backward pass mirrors this in reverse during training. The number of stages/GPU partitions defines pipeline degree. The main drawback is "pipeline bubbles" (stages waiting idle). Micro-batching helps reduce bubble time and improve utilization.
Tensor parallelism: Split large tensors (for example, MLP weight matrices or attention projections/heads) across GPUs. Each device computes on its shard, and partial outputs are merged (typically with all-reduce/all-gather collectives). In self-attention, this works especially well because attention heads are naturally parallelizable, so each GPU can process a subset of heads independently. Limitations: not every layer parallelizes cleanly (e.g., ops with full-input dependencies such as LayerNorm or Dropout), and good performance depends on high-bandwidth, low-latency interconnects.

[Model quantization] Quantization means representing model weights/activations with lower-precision datatypes instead of standard FP16/FP32. For LLM inference, this is a primary lever to reduce memory footprint and often improve speed. The challenge is that naive quantization struggles with outlier features (extreme values). Simply discarding outliers hurts quality, so practical methods often mix quantization strategies across layers or use outlier-aware schemes. The two main weight-quantization approaches are:

Post-training quantization (PTQ): Convert a pretrained model to lower precision without retraining. It is simple and fast to apply, but can reduce output quality.
Quantization-aware training (QAT): Inject quantization effects during training/fine-tuning so the model learns to be robust to lower precision. Usually higher quality than PTQ, but it needs more compute and representative training data. Additionally, there are multiple libraries that can help with quantization, such as:
llama.cpp and GGUF: llama.cpp is an open-source C++ inference stack for many LLMs. Compared with CUDA-centric stacks, it runs on a broader hardware surface. Its native GGUF format stores tensors plus metadata and supports multiple bit-widths (roughly 1-8 bit quantization options).
GPTQ and EXL2: While GGUF/llama.cpp is strong for CPU inference (with optional GPU offload), GPTQ and EXL2 are quantization formats focused on GPU serving. They are typically faster than llama.cpp in pure GPU inference. EXL2, especially with ExLlamaV2, often delivers very high throughput and supports mixed/custom precision. GPTQ is usually fixed at 4-bit. Trade-off: GPTQ/EXL2 are generally less universally supported than GGUF.
AWQ: A quantization method that protects important weights selected by activation magnitude (rather than raw weight magnitude). It is well supported by modern inference engines.
QuIP# and HQQ: A growing trend is ultra-low-bit quantization (1-2 bit). QuIP# and HQQ target this regime and aim to preserve model quality better than naive ultra-low-bit methods. This can be especially compelling for very large models (>30B), which may end up smaller than 7B/13B models while still producing stronger outputs.

Quantization - Technical Implementation FAQ

1. What are the different ways to combine training and quantization? There are three common strategies. In practice, there is no universally "perfect" alignment: the right choice depends on your hardware budget and required quality.

Method	What happens?	When is it quantized?	Accuracy
PTQ (Post-Training)	Train in full 16-bit, then squash it to 4-bit at the very end.	After training	Good
QLoRA (The Standard)	Load a 4-bit "base," train 16-bit "adapters" on top, then merge.	During & After	Very Good
QAT (Awareness)	Simulates 4-bit errors during training so the model adapts.	During	Best

2. Why not just always use QAT if it can be better? Because "best quality" is not always "best overall trade-off."

More training time: QAT usually adds overhead to the forward/backward path due to fake-quant simulation and often needs extra epochs to converge well.
More tuning: You may need to retune learning rate, warmup, quantization granularity (per-channel/group-wise), and clipping/calibration behavior for stable results.
More failure modes: Training can become unstable (loss spikes, gradient noise, or underfitting) when quantization noise is introduced too early or too aggressively.
Less portability: QAT can overfit to one quantization setup/backend; if you change runtime kernels or bit-width strategy, gains may shrink.

In practice, teams usually start with PTQ/QLoRA, benchmark quality-latency-memory, and only take on QAT complexity when needed.

3. Why use load_in_4bit = True if the final model is stored in 16-bit? We use 4-bit loading (QLoRA) during training mainly to reduce VRAM usage. Even for smaller models like Gemma-3 270M, 4-bit loading makes training feasible on consumer GPUs while leaving headroom for longer context or larger batches. The "16-bit" artifact later pushed to Hugging Face is the merged output, which preserves high-precision adapter updates instead of immediately re-quantizing them.

4. If I'm training a "4-bit model," why are the updates in 16-bit? Neural networks learn via very small gradient updates. A 4-bit grid is too coarse to represent these updates reliably. Keeping LoRA adapters in 16-bit on top of a 4-bit base preserves learning precision while still keeping most model memory compressed.

5. Why does my Hugging Face config say bfloat16 after training? When saving with save_method: "merged_16bit", Unsloth merges the 4-bit base and 16-bit trained adapters into one high-precision master checkpoint. Hugging Face then reports bfloat16 because the merged weights are high precision. This is typically the best artifact to share, since it serves as a near-lossless source for future conversions.

6. Why use merged_16bit instead of merged_4bit_forced? In short: To protect your model's "IQ."

merged_16bit (Best Quality): Restores a high-resolution checkpoint so your training updates are merged cleanly. Think of it as a high-fidelity master copy. It is also required for reliable GGUF export for CPU deployment, since the quantizer expects a clean high-precision source.
merged_4bit_forced (Worst Quality): Forces those 16-bit training gains back into a coarse 4-bit grid, which can erase nuance and introduce "double quantization" artifacts.

In this repo (inference)

Serving default in this repo: Use llama.cpp via Docker for inference serving (simpler and reproducible).
Local llama.cpp build: Only needed if you want local GGUF tooling (for example, quantization/export workflows). See llama.cpp Local Build (optional, for GGUF tools).
Serving entrypoint (vLLM): app/tasks/serve.py
- Run:
  - uv run invoke serve.vllm --model-name=marioparreno/emojify-sft
  - or serve a local path: uv run invoke serve.vllm --model-name=./models/emojify-sft-YYYYMMDD_HHMMSS
- This wraps vllm serve ... and exposes an OpenAI-compatible HTTP API.
Why SFT export format matters for inference
- If you train with LoRA, many serving stacks expect a single set of base weights.
- Use save_model_step.parameters.save_method: merged_16bit in app/configs/sft.yaml to produce a merged model artifact that is straightforward to serve with vLLM.
GGUF artifacts for CPU-friendly inference
- SFT and DPO save steps can now export GGUF artifacts using Unsloth.
- In this repo, both app/configs/sft.yaml and app/configs/dpo_train.yaml are configured with:
  - export_gguf: true
  - gguf_quantization_method: "q4_k_m"
- This keeps a llama.cpp-ready artifact aligned with our intended low-CPU deployment path.
Chat Templates and vLLM
- vLLM automatically loads the chat template (Jinja2 format) defined in a model's tokenizer_config.json on Hugging Face, so we don't need to take care about it manually.
- We can use the OpenAI-compatible API with standard chat completions:
```
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
```

Serving with llama.cpp (GGUF)

For low-CPU servers, llama.cpp + GGUF is often more stable than vLLM CPU.

Run OpenAI-compatible server with your Q4_K_M GGUF

Note: you need to have the GGUF file in the models directory.

For the initial supervised fine-tuning, we can use:

docker run --rm -it \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  --hf-repo marioparreno/emojify-sft \
  --hf-file emojify-sft.Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 2

For DPO, we can use:

docker run --rm -it \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  --hf-repo marioparreno/emojify-dpo \
  --hf-file emojify-dpo.Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 2

Test

For the initial supervised fine-tuning, we can use:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "marioparreno_emojify-sft_emojify-sft.Q4_K_M.gguf",
    "messages": [
      {"role": "system", "content": "Convert user text to emojis only."},
      {"role": "user", "content": "A beautiful starry night sky"}
    ],
    "temperature": 0.2,
    "max_tokens": 32
  }'

For DPO, we can use:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "marioparreno_emojify-dpo_emojify-dpo.Q4_K_M.gguf",
    "messages": [
      {"role": "system", "content": "Convert user text to emojis only."},
      {"role": "user", "content": "A beautiful starry night sky"}
    ],
    "temperature": 0.2,
    "max_tokens": 32
  }'

Note: if outputs degrade after export, verify chat template/eos alignment between training and inference.

FastAPI deployment (shared agent + llama.cpp)

We now use one shared LangGraph agent implementation for both evaluation and production API:

Shared implementation: app/agent/
FastAPI app: app/api/

Environment variables used by deployment:

API_CORS_ORIGINS: comma-separated list or JSON array of allowed origins
API_BASE_URL: OpenAI-compatible base URL (http://<llama-host>:<port>/v1)
API_MODEL: model identifier to send in ChatOpenAI requests
API_TEMPERATURE, API_MAX_COMPLETION_TOKENS, API_MAX_RETRIES
LLAMA_CPP_HF_REPO, LLAMA_CPP_HF_FILE, LLAMA_CPP_THREADS (for compose llama service)

Choose one deployment path:

FastAPI on host + llama.cpp in Docker

Use this when you want to run the API process locally, but keep model serving in a container.

Start an OpenAI-compatible llama.cpp server:

docker run --rm -it \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server \
  --hf-repo marioparreno/emojify-dpo \
  --hf-file emojify-dpo.Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -t 2

In another terminal, run FastAPI locally and point it to llama.cpp:

uv sync
export API_BASE_URL=http://127.0.0.1:8080/v1
uv run uvicorn app.api.main:app --host 0.0.0.0 --port 8424

Full stack with Docker Compose (recommended for end-to-end setup)

Use this when you want api, llama-server, and observability services together.

# optional: initialize and edit .env first
# cp .env.example .env

docker compose up --build -d

Stop services:

docker compose down

API usage

curl -s http://127.0.0.1:8424/api/v1/emoji \
  -H "Content-Type: application/json" \
  -d '{"query":"A beautiful starry night sky"}'

Response shape:

{
  "output": "🌌✨🌙",
  "success": true,
  "error": "none",
  "request_id": "7ec5b669-9bb5-458f-b721-5b22bf4ea1f9"
}

Healthcheck endpoint:

curl -s http://127.0.0.1:8424/health

Monitoring stack (Prometheus + Grafana + business events)

We've set up a complete monitoring stack to track both technical performance and user behavior. If you are new to monitoring, here is a breakdown of how it works:

The Monitoring Flow

The Application (api): Our FastAPI application (app/api/main.py) does the actual work. It uses two mechanisms to track data:
- Prometheus Fastapi Instrumentator: Automatically measures basic HTTP metrics like request counts, latency, and status codes. By calling .expose(), it dynamically creates a new GET route at /metrics.
- Custom Metrics (app/observability/metrics.py): We've added custom counters and histograms to track specific details like token usage, inference duration, and business events (e.g., when a user clicks "copy" or "regenerate"). These are updated in our endpoints (app/api/routes/emoji.py).
- How they connect: Both mechanisms use the same prometheus_client library, which maintains a hidden, global "default registry" in memory. When you define a custom metric, it automatically registers itself there. When Prometheus scrapes the /metrics endpoint, the Instrumentator reads this global registry, formatting both its automatic HTTP metrics and your custom metrics into plain text. You do not need to manually pass your metrics to the Instrumentator.
The Database (postgres): While Prometheus is great for numbers over time, we also want to keep the exact text of what users liked or disliked. When a user provides feedback, app/observability/feedback_store.py saves the raw query and response directly into a PostgreSQL database.
The Scraper (prometheus): Prometheus is a time-series database. It is configured via prometheus/prometheus.yml to automatically "scrape" (download) the /metrics endpoints from our API and the llama-server every 15 seconds. It stores this data so we can query it over time using PromQL. (Note: llama.cpp exposes its own /metrics endpoint automatically when run with the --metrics flag, requiring no custom code).
The Dashboard (grafana): Grafana is our visualization UI. We use "Provisioning" to automatically configure it on startup:
- Datasources (grafana/provisioning/datasources/prometheus.yml): Tells Grafana how to connect to Prometheus and PostgreSQL.
- Dashboards (grafana/provisioning/dashboards/dashboards.yml): Tells Grafana to load the JSON dashboard files.
- The UI (grafana/dashboards/emojify-monitoring.json): This JSON file defines all our graphs, charts, and the data table that queries PostgreSQL for recent feedback events.

Accessing the Services

The docker compose stack spins up the following services:

FastAPI metrics endpoint: http://127.0.0.1:8424/metrics
Prometheus UI: http://127.0.0.1:9090
Grafana UI: http://127.0.0.1:3001 (Credentials: admin/admin unless changed in .env)
PostgreSQL service: postgres:5432 (Internal compose network)
llama.cpp metrics endpoint: http://llama-server:8080/metrics (Scraped internally)

Environment variables (from .env):

FEEDBACK_DATABASE_URL: PostgreSQL DSN used by API to persist raw feedback events.
PROMETHEUS_HOST_PORT, PROMETHEUS_RETENTION
GRAFANA_HOST_PORT, GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD
FEEDBACK_DB_NAME, FEEDBACK_DB_USER, FEEDBACK_DB_PASSWORD, FEEDBACK_DB_HOST, FEEDBACK_DB_PORT

Cloudflare Tunnel Deployment

For a public deployment behind Cloudflare Tunnel, this stack is typically exposed through 4 public hostnames:

Public hostname	Tunnel target (origin)
`emojify.maparla.es`	`http://YOUR_MACHINE_IP:4785`
`emojify-api.maparla.es`	`http://YOUR_MACHINE_IP:8424`
`emojify-grafana.maparla.es`	`http://YOUR_MACHINE_IP:3001`
`emojify-prometheus.maparla.es`	`http://YOUR_MACHINE_IP:9090`

Notes:

The host ports above must match your server runtime (API_HOST_PORT, GRAFANA_HOST_PORT, PROMETHEUS_HOST_PORT).
Keep llama-server (8080) and postgres (5432) private (no public tunnel).
For security, protect grafana and especially prometheus with Cloudflare Access policies.

Example cloudflared ingress config:

tunnel: emojify
credentials-file: /etc/cloudflared/<tunnel-id>.json

ingress:
  - hostname: emojify.maparla.es
    service: http://YOUR_MACHINE_IP:4785
  - hostname: emojify-api.maparla.es
    service: http://YOUR_MACHINE_IP:8424
  - hostname: emojify-grafana.maparla.es
    service: http://YOUR_MACHINE_IP:3001
  - hostname: emojify-prometheus.maparla.es
    service: http://YOUR_MACHINE_IP:9090
  - service: http_status:404

Quick checks after creating/updating DNS + tunnel rules:

curl -I https://emojify.maparla.es
curl -I https://emojify-api.maparla.es/health
curl -I https://emojify-grafana.maparla.es/login
curl -I https://emojify-prometheus.maparla.es/-/ready

Tracing with LangSmith

Tracing is complementary to Prometheus/Grafana. While Prometheus gives you aggregate trends (latency, throughput, error rates), LangSmith allows per-run/per-node drill-down (inputs, retries, checker outcomes, model calls).

Required env vars in .env:

LANGCHAIN_TRACING_V2=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_API_KEY=<your_key>
LANGSMITH_PROJECT=<project_name>

Quick verification:

Trigger a generation:

curl -s http://127.0.0.1:8424/api/v1/emoji \
  -H "Content-Type: application/json" \
  -d '{"query":"a calm beach at sunset"}'

Open LangSmith and select the configured LANGSMITH_PROJECT.
Confirm new runs appear.

Notes on troubleshooting missing traces:

Verify the running api container has tracing vars:

docker compose exec api /bin/sh -lc "env | grep -E 'LANGSMITH|LANGCHAIN_TRACING_V2'"

Rebuild/restart API after env updates: docker compose up -d --build api

Business Feedback Events

Raw feedback events are persisted in PostgreSQL with data like event_type (copy or regenerate), query, response, and request_id.

To submit a test event:

curl -s http://127.0.0.1:8424/api/v1/emoji/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "event_type": "copy",
    "query": "a calm beach at sunset",
    "response": "🏖️🌅😌",
    "request_id": "7ec5b669-9bb5-458f-b721-5b22bf4ea1f9",
    "session_id": "browser-session-id",
    "source": "ui",
    "created_at": "2026-02-18T16:18:00Z"
  }'

Signal interpretation:

copy is a positive intent proxy.
regenerate is a negative intent proxy.
no feedback event for a generation is neutral/unknown.

Inspecting Metrics and Feedback Data

1) In Grafana:

Open the Emojify Monitoring dashboard at http://127.0.0.1:3001.
View high-level metrics via Prometheus queries.
View the raw feedback in the Feedback Events (Latest 100 Rows) table panel (powered by the PostgreSQL datasource).

2) Useful PromQL examples:

API throughput: sum(rate(http_requests_total{handler!="/metrics"}[5m]))
Emoji generation p95 latency: histogram_quantile(0.95, sum(rate(emoji_agent_inference_duration_seconds_bucket[5m])) by (le))
Token throughput: sum(rate(emoji_agent_total_tokens_total[5m]))
Copy rate proxy: sum(rate(emoji_feedback_events_total{event_type="copy"}[5m])) / clamp_min(sum(rate(emoji_agent_requests_total{result="success"}[5m])), 1e-9)

3) Accessing PostgreSQL records directly: API endpoint for recent records:

curl -s "http://127.0.0.1:8424/api/v1/emoji/feedback/events?limit=50&offset=0"

Terminal table view from PostgreSQL container:

docker compose exec postgres psql -U "${FEEDBACK_DB_USER:-text_to_emoji}" -d "${FEEDBACK_DB_NAME:-text_to_emoji}" \
  -c "SELECT id, created_at, event_type, source, request_id, session_id, \"query\", response FROM feedback_events ORDER BY created_at DESC LIMIT 50;"

GitHub Actions deployment

GitHub Secrets

To enable deployment, configure these repository secrets in: Settings > Secrets and variables > Actions

Strictly required (without these, deploy fails):

SERVER_HOST: The IP address or domain of your server.
SERVER_USER: The SSH username (for example, root or ubuntu).
SERVER_SSH_KEY: The private SSH key used for server authentication.

Optional runtime secrets (the workflow has defaults, but set them for production control):

API_CORS_ORIGINS: Allowed browser origins for CORS in FastAPI (for example, frontend domains).
API_HOST_PORT: Host port mapped to the FastAPI container (8424 by default).
OPENAI_API_KEY: API key used by the LangChain OpenAI-compatible client; for local llama.cpp endpoints a dummy value is acceptable.
LANGCHAIN_TRACING_V2: Set to true to enable LangSmith tracing (default: true). Set to false to disable.
LANGSMITH_ENDPOINT: LangSmith API base URL (default: https://api.smith.langchain.com).
LANGSMITH_API_KEY: LangSmith key used to upload traces.
LANGSMITH_PROJECT: LangSmith project name where API traces are stored.
LLAMA_CPP_HOST_PORT: Host port mapped to the llama-server container (8080 by default).
LLAMA_CPP_HF_REPO: Hugging Face repo used by llama.cpp to fetch the GGUF model.
LLAMA_CPP_HF_FILE: GGUF filename pulled from the repo above.
LLAMA_CPP_THREADS: CPU threads used by llama.cpp during inference.
FEEDBACK_DB_NAME: PostgreSQL database name for feedback events.
FEEDBACK_DB_USER: PostgreSQL username for feedback events.
FEEDBACK_DB_PASSWORD: PostgreSQL password for feedback events.
FEEDBACK_DB_PORT: PostgreSQL service port inside compose network (5432 by default).
PROMETHEUS_HOST_PORT: Host port mapped to Prometheus (9090 by default).
PROMETHEUS_RETENTION: Prometheus retention duration (for example, 15d).
GRAFANA_HOST_PORT: Host port mapped to Grafana (3001 by default).
GRAFANA_ADMIN_USER: Grafana admin username.
GRAFANA_ADMIN_PASSWORD: Grafana admin password.

Notes:

GITHUB_TOKEN is provided automatically by GitHub Actions; you do not create it manually.
If you do not set an optional secret, the deploy workflow writes a default value into the generated server .env.

Trigger Deployment

Deployment is automatically triggered when pushing a tag that starts with v (for example, v1.0.0).

Create a new tag:

git tag v1.0.0

Push the tag:

git push origin v1.0.0

This triggers the backend GitHub Actions workflow, which connects to your server and deploys with docker compose.

To list tags:

git tag --sort=version:refname

Benchmarking llama.cpp

Benchmark results for gemma-3-270m-it.Q4_K_M.gguf using ghcr.io/ggml-org/llama.cpp:full (build ff4affb4c, CPU backend libggml-cpu-haswell.so).

System info (from lscpu)

Field	Value
Architecture	`x86_64`
CPU model	`AMD EPYC-Milan Processor`
vCPU / topology	`2 vCPU` (`1 socket x 1 core x 2 threads`)
Hypervisor	`KVM` (full virtualization, `QEMU` BIOS)
NUMA	`1 node`
Cache	`L1d 32 KiB`, `L1i 32 KiB`, `L2 512 KiB`, `L3 32 MiB`

Commands used

docker run --rm -it \
  -v /models:/models:ro \
  --entrypoint /app/llama-bench \
  ghcr.io/ggml-org/llama.cpp:full \
  -m /models/gemma-3-270m-it.Q4_K_M.gguf -t 1

Use -t to set CPU threads (for example, -t 1 for single-thread and -t 2 for two threads).

Results

pp512: prompt processing throughput (tokens/sec while ingesting a 512-token prompt). Higher is better for long-input latency.
tg128: token generation throughput (tokens/sec while generating 128 output tokens). Higher is better for response speed.

Threads	`pp512` (t/s)	`tg128` (t/s)
1	`171.29 +- 5.86`	`45.89 +- 0.30`
2	`292.28 +- 6.28`	`66.29 +- 1.80`

From -t 1 to -t 2, throughput improves by about 1.71x for pp512 and 1.44x for tg128.

Environment and Tools

Environment Variables

Create a .env file in the project root with the following configuration:

OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL=gpt-5.2-2025-12-11

We specifically use gpt-5.2-2025-12-11 for synthetic data generation. You can find pricing information and a complete list of available models at https://openai.com/api/pricing/.

Dependency Management

This project uses uv for dependency management. Dependencies are organized into groups so API deployment and ML workflows can be installed independently:

Group	Description	Platforms
Default	Lean API/runtime dependencies (FastAPI, LangGraph agent runtime, etc.)	macOS, Linux
`pipelines`	Data/ETL/evaluation stack (ZenML, MongoDB client, Redis, OpenAI, LangSmith, datasets)	macOS, Linux
`training`	Model training stack (Torch, Transformers, TRL, W&B)	macOS, Linux
`cuda`	GPU-accelerated packages (vLLM, FlashInfer, Unsloth, etc.)	Linux only
`dev`	Development tools (pytest, ruff, jupyter, etc.)	All

Why Separate Groups?

Separating API, pipelines, training, and CUDA dependencies keeps Docker/API images lightweight and avoids pulling large ML/GPU wheels unless they are explicitly needed.

Run only the API with a small dependency footprint
Install data/training dependencies only in environments that need them
Add CUDA-specific packages only on Linux GPU machines

Installation Commands

# API/runtime only (default)
uv sync

# API + data pipelines
uv sync --group pipelines

# API + pipelines + training
uv sync --group pipelines --group training

# Linux GPU machine (add CUDA packages)
uv sync --group pipelines --group training --group cuda

# Install everything (default + all dependency groups)
uv sync --all-groups

# API/runtime + development tools (any platform)
uv sync --group dev

Note: The FlashInfer packages require a custom index URL (https://flashinfer.ai/whl/cu130), which is configured in pyproject.toml under [tool.uv].

llama.cpp Local Build (optional, for GGUF tools)

Use this only when you need local llama.cpp binaries (for example, llama-quantize) during GGUF export/packaging workflows. For serving, prefer the Docker path documented in Serving with llama.cpp (GGUF).

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j 12 --clean-first --target llama-quantize llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp build/bin/llama-* .

Pre-commit

We need to install the hooks for pre-commit. You can do this by running the following command:

uv run pre-commit install

ZenML

Installing ZenML and getting started.

https://docs.zenml.io/getting-started/installation

uv run zenml login --local

Data and Model Registry

https://docs.wandb.ai/guides/registry/

Hugging Face Artifacts

Published artifacts from this repository:

SFT dataset: https://huggingface.co/datasets/marioparreno/emojify-sft
DPO dataset: https://huggingface.co/datasets/marioparreno/emojify-dpo
SFT model: https://huggingface.co/marioparreno/emojify-sft
DPO model: https://huggingface.co/marioparreno/emojify-dpo

Tasks and Invoke

We will be using invoke for calling tasks...

uv run invoke --list

Table with available tasks and docs....

ZenML YAML configuration for pipeline and steps: https://docs.zenml.io/concepts/steps_and_pipelines/yaml_configuration

To suppress verbosity from debug logs:

LOGURU_LEVEL=INFO uv run invoke data-collection.run-hf

Quick Run (End-to-End)

Minimal command flow (assuming uv is already installed and MongoDB is running):

# 1) Install all dependency groups
uv sync --all-groups

# 2) Login to local ZenML
uv run zenml login --local

# 3) Collect raw data into MongoDB (required before dataset generation)
uv run invoke data-collection.run-hf

# 4) Create instruction dataset (will create a HuggingFace dataset)
uv run invoke generate-dataset.run

# 5) Train SFT model (will create a HuggingFace model)
uv run invoke sft.train

# 6) Create DPO preference data (chosen responses) (will create a HuggingFace dataset)
uv run invoke dpo.generate-batch
uv run invoke dpo.collect-batch # This will take a while...

# 7) Create full DPO dataset (generates rejected responses; requires running vLLM)
uv run invoke serve.vllm --model-name=marioparreno/emojify-sft
uv run invoke dpo.generate-dataset

# 8) Train DPO model
uv run invoke dpo.train

# 9) Serve final model
uv run invoke serve.vllm --model-name=marioparreno/emojify-dpo

# 10) Quick test
uv run invoke evaluation.run-agent-query --query="A beautiful starry night sky"

Special Notes

The `<bos>` token

You can notice how the <bos> token is removed just after using apply_chat_template. If we take an example:

<pad><bos><start_of_turn>user
Translate this text to emoji:

A beautiful starry night sky<end_of_turn>
<start_of_turn>model
🌌✨🌠<end_of_turn>

This is a full conversation, including both user and model turns — basically what appears in a training sample. During supervised fine-tuning (SFT), the model learns from this full context:

<bos> — beginning of sequence (special token).
<start_of_turn>user ... <end_of_turn> — marks the user’s message.
<start_of_turn>model ... <end_of_turn> — marks the model’s response.
<pad> — only used for padding to align sequences in a batch.

Because for many chat-based LLMs (e.g., Llama 3, Mistral, Gemma):

<bos> is automatically added internally when calling generate() if missing.
Keeping it twice can lead to token duplication or weird generation starts.

Some Unsloth models use a custom tokenizer wrapper that adds <bos> automatically whenever you tokenize an input for training or inference. You can verify it checking:

tokenizer.bos_token_id  # = 1
tokenizer.add_bos_token  # = True
tokenizer.add_eos_token  # = False

Validation Dataset

You’ll want two datasets derived from your original data:

Train dataset: with input + output → used for supervised fine-tuning (SFT).
Eval dataset: with only the input/prompt → used for generation-based evaluation.

The preparation starts with formatting the examples:

def convert_to_chatml(example):
    return {
        "conversations": [
            {"role": "system", "content": "Translate this text to emoji: "},
            {"role": "user", "content": example["text"]},
            {"role": "assistant", "content": example["emoji"]},
        ]
    }

dataset = dataset.map(convert_to_chatml)

And now, meanwhile for training we want the full conversation:

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        ).removeprefix("<bos>")
        for convo in convos
    ]
    return {"text": texts}

train_dataset = dataset.map(formatting_prompts_func, batched=True)

For evaluation, you want the input only, not the assistant’s turn.

def formatting_eval_prompts_func(examples):
    convos = [
        convo[:-1]  # remove assistant turn
        for convo in examples["conversations"]
    ]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=True
        ).removeprefix("<bos>")
        for convo in convos
    ]
    # keep ground-truth labels for metric comparison
    labels = [ex["conversations"][-1]["content"] for ex in examples["conversations"]]
    return {"text": texts, "labels": labels}

eval_dataset = dataset.map(formatting_eval_prompts_func, batched=True)

Notice how we added add_generation_prompt equals True, which appends <start_of_turn>model to the text.

FlashInfer

-> FlashInfer: https://github.com/flashinfer-ai/flashinfer

pip install flashinfer-python flashinfer-cubin
# JIT cache package (replace cu129 with your CUDA version: cu128, cu129, or cu130)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129

We install the packages with uv but need to select index-url for jit-cache.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
.vscode		.vscode
app		app
grafana		grafana
misc		misc
prometheus		prometheus
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
tasks.py		tasks.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Emojify: Transform text into emojis

Index

Introduction

Schema

The Pipeline

The Data Collection Pipeline

In this repo (data collection)

The Feature Pipeline: Dataset Generation

Data Quantity

Data Quality

Rule-based Filtering

Data Deduplication

Data Quality Evaluation

Data Exploration

Data Generation

Data Augmentation

In this repo (dataset generation)

The Training Pipeline: Supervised Fine-Tuning (SFT)

Exploring SFT

Instruction Dataset Formats

Chat Templates

Parameter-Efficient Fine-Tuning (PEFT)

Training Hyperparameters

Fine-tuning in Practice

In this repo (SFT)

The Preference Alignment Pipeline (DPO)

Preference data

Data quantity

Preference data generation

Preference data evaluation

In this repo (DPO)

Running dpo.generate-dataset

The Evaluation Pipeline

In this repo (evaluation)

The Inference Pipeline

Inference Optimization

Quantization - Technical Implementation FAQ

In this repo (inference)

Serving with llama.cpp (GGUF)

FastAPI deployment (shared agent + llama.cpp)

The Monitoring Flow

Accessing the Services

Cloudflare Tunnel Deployment

Tracing with LangSmith

Business Feedback Events

Inspecting Metrics and Feedback Data

GitHub Secrets

Trigger Deployment

Benchmarking llama.cpp

Environment and Tools

Environment Variables

Dependency Management

Why Separate Groups?

Installation Commands

llama.cpp Local Build (optional, for GGUF tools)

Pre-commit

ZenML

Data and Model Registry

Hugging Face Artifacts

Tasks and Invoke

Quick Run (End-to-End)

Special Notes

The <bos> token

Validation Dataset

FlashInfer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Running `dpo.generate-dataset`

The `<bos>` token

Packages