Skip to content

21.1. Loading Local Models

FerrisMind edited this page Sep 10, 2025 · 1 revision

Loading Local Models

Update Summary

Changes Made

  • Updated Memory Mapping and Efficient Loading section to reflect the new build_varbuilder_with_precision function and centralized precision policy
  • Enhanced Device Assignment and Management section with details on precision policy configurations and their impact on dtype selection
  • Added comprehensive documentation for the PrecisionPolicy system and its three configurations (Default, MemoryEfficient, MaximumPrecision)
  • Updated error handling to include precision policy-related error scenarios
  • Added new section on Precision Policy Configuration to explain the centralized dtype selection system
  • Revised Best Practices for Model Organization to include guidance on precision policy selection
  • Updated code examples to show the integration of precision policy in model loading workflows

Table of Contents

  1. Introduction
  2. Model Loading Process
  3. File Selection and Validation
  4. Memory Mapping and Efficient Loading
  5. ModelBackend Trait and Abstraction
  6. Device Assignment and Management
  7. Precision Policy Configuration
  8. Error Handling
  9. Best Practices for Model Organization
  10. Conclusion

Introduction

This document provides a comprehensive guide to loading local GGUF and safetensors models from disk in the Oxide-Lab project. It details the file selection process, memory mapping, model validation, device assignment, and error handling mechanisms. The implementation leverages Tauri's dialog plugin for file selection and uses the candle crate for efficient model loading and inference. Recent architectural enhancements have introduced a unified ModelFactory system that provides consistent model creation across different formats, with standardized support for both GGUF and safetensors models through the ModelBuilder pattern. A key enhancement is the centralized precision policy system that determines the appropriate data type (dtype) based on the target device and user preferences, ensuring optimal performance and memory usage across different hardware platforms.

Model Loading Process

The model loading process in Oxide-Lab is designed to support both GGUF and safetensors formats, with distinct workflows for local files and Hugging Face Hub repositories. The core functionality is implemented in dedicated modules within the api/model_loading directory. The system has been enhanced with a unified ModelFactory architecture that provides consistent model creation across different formats and sources.

``mermaid flowchart TD A[Load Request] --> B{Request Type} B --> |Gguf| C[Load Local GGUF] B --> |HubGguf| D[Download GGUF from Hub] B --> |HubSafetensors| E[Download Safetensors from Hub] B --> |LocalSafetensors| F[Load Local Safetensors] C --> G[Open File] D --> H[Download File] E --> I[Download Index/Config] F --> J[Validate Path] G --> K[Read GGUF Content] H --> K I --> L[Cache Weight Files] J --> M[Discover Safetensors Files] K --> N[Extract Tokenizer] L --> N M --> N N --> O[Detect Architecture] O --> P[Create ModelBuilder] P --> Q[Build Model via Factory] Q --> R[Initialize Model] R --> S[Store in State]


**Diagram sources**
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs#L10-L60)
- [api/model_loading/safetensors.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/safetensors.rs#L16-L132)
- [models/registry.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/models/registry.rs#L26-L41)

**Section sources**
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs#L10-L60)
- [api/model_loading/safetensors.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/safetensors.rs#L16-L132)

## File Selection and Validation

### Local File Selection
The system supports loading both GGUF and safetensors models from local disk through the `LoadRequest::Gguf` and `LoadRequest::LocalSafetensors` variants. For safetensors models, users can specify either a directory path or a file path:

rust LoadRequest::LocalSafetensors { model_path, context_length, device }


The path is validated to ensure it exists and is accessible:

rust let model_path = Path::new(&model_path); if !model_path.exists() { return Err(format!("Model path does not exist: {}", model_path.display())); }


### Format Validation
Model integrity is validated during the loading process by parsing the GGUF header and metadata:

rust let content = gguf_file::Content::read(&mut file) .map_err(|e| format!("{}", e.with_path(PathBuf::from(model_path.clone()))))?;


Architecture detection ensures compatibility using the registry system:

rust let arch = detect_arch(&content.metadata).ok_or_else(|| "Unsupported GGUF architecture".to_string())?;


For safetensors models, the system uses a universal weights loader to discover and validate files:

rust let safetensors_files = weights::local_list_safetensors(model_dir)?;


The `local_list_safetensors` function first looks for `model.safetensors.index.json` (indicating a sharded model) and falls back to checking for a single `model.safetensors` file. It validates that all referenced files exist before returning the list. This unified approach ensures consistent file discovery across different model configurations.

**Section sources**
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs#L10-L60)
- [api/model_loading/safetensors.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/safetensors.rs#L50-L75)
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs#L58-L110)

## Memory Mapping and Efficient Loading

### GGUF Memory Mapping
GGUF files are loaded efficiently using memory mapping techniques provided by the candle crate. The `from_gguf` method handles the memory mapping process:

rust let m = Qwen3Gguf::from_gguf(content, &mut file, &guard.device, context_length, false) .map_err(|e| e.to_string())?;


This approach allows for efficient loading of large model files by mapping them directly into memory without requiring complete in-memory copies.

### Safetensors Memory Mapping
For safetensors models, the system uses a unified VarBuilder approach with memory-mapped file loading that respects the precision policy:

rust let vb = weights::build_varbuilder_with_precision(&safetensors_files, &device, Some(&guard.precision_policy))?;


The `build_varbuilder_with_precision` function creates a memory-mapped view of the weight files, enabling efficient access to model parameters without loading the entire model into RAM. This implementation applies a precision policy-based dtype selection, where the data type is determined by both the target device and the user's precision policy preference.

rust let vb = unsafe { VarBuilder::from_mmaped_safetensors(&paths, dtype, device) .map_err(|e| format!("Failed to create VarBuilder: {}", e))? };


The unsafe block is necessary for memory mapping but is safe when file paths are properly validated. The dtype is determined by the precision policy system, which selects the appropriate data type based on the device and policy configuration.

**Section sources**
- [api/model_loading/safetensors.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/safetensors.rs#L100-L120)
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs#L187-L234)
- [core/precision.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/precision.rs#L50-L150)

## ModelBackend Trait and Abstraction

### Trait Definition
The `ModelBackend` trait provides a unified interface for different model implementations:

rust pub trait ModelBackend: Send { fn forward_layered(&mut self, input: &Tensor, position: usize) -> Result<Tensor, String>; }


This trait abstracts the forward pass operation, allowing the system to work with different model architectures through a common interface.

### AnyModel Wrapper
The `AnyModel` struct wraps concrete model implementations and provides a uniform interface:

rust pub struct AnyModel { inner: Box<dyn ModelBackend + Send>, }


It implements the `ModelBackend` trait by delegating to the inner model. The unified ModelFactory enables consistent model construction across formats by using the ModelBuilder pattern to instantiate models based on architecture.

### Unified Model Building with ModelFactory
The system now uses a global `ModelFactory` to manage model creation for both GGUF and safetensors formats. The factory maintains a registry of `ModelBuilder` instances, each responsible for a specific architecture:

rust static MODEL_FACTORY: OnceLock = OnceLock::new();

pub fn get_model_factory() -> &'static ModelFactory { MODEL_FACTORY.get_or_init(|| { let mut factory = ModelFactory::new();

    // Register Qwen3 builder
    factory.register_builder(crate::models::common::builder::ModelBuilder::Qwen3(Qwen3ModelBuilder::new()));
    
    factory
})

}


For safetensors models, the factory uses `build_from_safetensors` to create models:

rust pub fn build_from_safetensors( &self, arch: ArchKind, filenames: &[String], config: &serde_json::Value, device: &Device, dtype: DType, ) -> BuildResult<Box> { let vb = crate::core::weights::build_varbuilder_with_precision(filenames, device, Some(&guard.precision_policy)) .map_err(|e| format!("Failed to build VarBuilder: {}", e))?;

self.builders
    .get(&arch)
    .ok_or_else(|| format!("No builder registered for architecture {:?}", arch))?
    .from_varbuilder(vb, config, device, dtype)

}


The `Qwen3ModelBuilder` implements `from_varbuilder` to create candle-transformers models:

rust pub fn from_varbuilder( &self, vb: VarBuilder, config: &serde_json::Value, _device: &Device, _dtype: DType, ) -> Result<Box, String> { let config_str = config.to_string(); let qwen_config: candle_transformers::models::qwen3::Config = serde_json::from_str(&config_str) .map_err(|e| format!("Failed to parse Qwen3 config: {}", e))?;

let model = candle_transformers::models::qwen3::ModelForCausalLM::new(&qwen_config, vb)
    .map_err(|e| format!("Failed to load Qwen3 model: {}", e))?;

let adapter = Qwen3CandleAdapter::new(model);

Ok(Box::new(adapter))

}


``mermaid
classDiagram
class ModelBackend {
<<trait>>
+forward_layered(input : &Tensor, position : usize) Result<Tensor, String>
}
class AnyModel {
-inner : Box<dyn ModelBackend + Send>
+from_qwen3(m : ModelWeights) Self
+from_candle_qwen3(m : ModelForCausalLM) Self
}
class ModelWeights {
-inner : quantized_qwen3 : : ModelWeights
+from_gguf(content : Content, reader : &mut R, device : &Device, context_length : usize, flag : bool) Result<Self, String>
}
class ModelFactory {
+register_builder(builder : Box<dyn ModelBuilder>)
+build_from_safetensors(arch, filenames, config, device, dtype) Result<Box<dyn ModelBackend>, String>
+detect_gguf_arch(metadata : &HashMap) Option<ArchKind>
+detect_config_arch(config : &Value) Option<ArchKind>
}
class ModelBuilder {
<<enum>>
+from_gguf(content, reader, device, ctx_len, flag) Result<Box<dyn ModelBackend>, String>
+from_varbuilder(vb, config, device, dtype) Result<Box<dyn ModelBackend>, String>
+detect_gguf_arch(metadata : &HashMap) Option<ArchKind>
+detect_config_arch(config : &Value) Option<ArchKind>
+arch_kind() ArchKind
}
class Qwen3ModelBuilder {
+from_gguf() BuildResult
+from_varbuilder() BuildResult
+detect_gguf_arch() Option<ArchKind>
+detect_config_arch() Option<ArchKind>
+arch_kind() ArchKind
}
ModelBackend <|-- AnyModel
ModelBackend <|-- ModelWeights
AnyModel --> ModelWeights : contains
ModelFactory --> ModelBuilder : contains
ModelBuilder <|-- Qwen3ModelBuilder

Diagram sources

  • model.rs
  • models/registry.rs
  • models/common/builder.rs
  • models/qwen3_builder.rs
  • models/common/candle_llm.rs

Section sources

  • model.rs
  • models/registry.rs
  • models/common/builder.rs
  • models/qwen3_builder.rs

Device Assignment and Management

Device Selection

The system supports CPU, CUDA, and Metal devices, with device selection handled by the select_device function:

let dev = select_device(device);
guard.device = dev;

The DevicePreference enum defines available device options:

enum DevicePreference {
    Cuda { index: usize },
    Cpu,
    Auto,
    Metal,
}

The auto-selection follows CUDA → Metal → CPU priority.

Unified Dtype Policy with Precision Configuration

The universal weights loader implements a unified dtype policy based on the target device and precision policy:

let dtype = build_varbuilder_with_precision(&filenames, &dev, Some(&guard.precision_policy))
    .map_err(|e| format!("Failed to determine dtype: {}", e))?
    .dtype();

This policy ensures optimal performance on GPU devices while maintaining compatibility on CPU. The policy is applied consistently through the build_varbuilder_with_precision function, which takes into account both the device and the user's precision policy preference. The three precision policy configurations are:

  • Default: CPU=F32, GPU=BF16 (recommended for most users)
  • MemoryEfficient: CPU=F32, GPU=F16 (reduces memory usage at the cost of some precision)
  • MaximumPrecision: CPU=F32, GPU=F32 (maximum precision at the cost of memory and performance)

When switching devices, the model is reloaded to ensure compatibility:

if guard.gguf_model.is_some() {
    // Reload model on new device
    let model = match arch {
        ArchKind::Qwen3 => Qwen3Gguf::from_gguf(content, &mut file, &guard.device, ctx_len, false)
            .map_err(|e| e.to_string())?,
    };
    // ... update state
}

``mermaid sequenceDiagram participant User participant API as api : : mod.rs participant Device as device.rs participant Model as model.rs User->>API : set_device(pref) API->>Device : select_device(pref) Device-->>API : Device API->>API : Update state.device alt Model Loaded API->>API : Reload model from disk API->>Model : from_gguf(content, file, device, ctx_len, false) Model-->>API : ModelWeights API->>API : Update state with new model end API-->>User : Success


**Diagram sources**
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs#L10-L60)
- [core/device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs)
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs#L187-L234)
- [core/precision.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/precision.rs)

**Section sources**
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs#L10-L60)
- [core/device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs)
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs#L187-L234)
- [core/precision.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/precision.rs)

## Precision Policy Configuration

### Centralized Precision Management
The system implements a centralized precision policy system that determines the appropriate data type (dtype) for model loading based on the target device and user preferences. This policy is managed through the `PrecisionPolicy` enum and `PrecisionConfig` struct in the `core::precision` module.

rust #[derive(Debug, Clone, PartialEq, Serialize, Deserialize)] pub enum PrecisionPolicy { Default, MemoryEfficient, MaximumPrecision, }


The precision policy is stored in the application state and can be configured by the user through the UI:

rust pub struct ModelState { // ... other fields pub(crate) precision_policy: PrecisionPolicy, }

impl ModelState { pub fn new(device: Device) -> Self { Self { // ... other initializations precision_policy: PrecisionPolicy::Default, } } }


### Policy Configuration Options
The system provides three predefined precision policy configurations:

**Default Policy** (Recommended)
- CPU: F32 for maximum compatibility
- GPU: BF16 for better performance and memory usage
- Ideal for most users seeking a balance between performance and precision

**Memory Efficient Policy**
- CPU: F32 (unchanged for compatibility)
- GPU: F16 (reduced memory usage)
- Ideal for systems with limited VRAM or when running multiple models

**Maximum Precision Policy**
- CPU: F32
- GPU: F32 (highest precision)
- Ideal for scientific computing or when maximum numerical precision is required

The dtype selection is handled by the `select_dtype_by_policy` function:

rust pub fn select_dtype_by_policy(device: &Device, policy: &PrecisionPolicy) -> DType { let config = policy_to_config(policy); select_dtype(device, &config) }


This centralized approach ensures consistent dtype selection across all model loading operations and provides users with clear options for balancing performance, memory usage, and precision.

**Section sources**
- [core/precision.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/precision.rs)
- [core/state.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/state.rs)
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs#L187-L234)

## Error Handling

### File System Errors
The system handles file permission and access issues gracefully:

rust let mut file = File::open(&model_path).map_err(|e| e.to_string())?;


### Format and Compatibility Errors
Unsupported formats and corrupted files are validated during loading:

rust let content = gguf_file::Content::read(&mut file) .map_err(|e| format!("{}", e.with_path(PathBuf::from(model_path.clone()))))?;


The unified weights loader provides comprehensive error handling for safetensors file discovery:

rust if safetensors_files.is_empty() { return Err("No safetensors files found (model.safetensors[.index.json])".into()); }


Additional validation ensures all referenced files exist:

rust for path in safetensors_paths { if !std::path::Path::new(path).exists() { return Err(format!("Safetensors file not found: {}", path)); } }


### ModelBuilder and Precision Policy Errors
The new ModelFactory system and precision policy introduce additional error scenarios:

rust match get_model_factory().build_from_safetensors(arch, &filenames, &config, &dev, dtype) { Ok(model_backend) => { built_model_opt = Some(model_backend); } Err(e) => println!("[local] ModelBuilder failed: {}", e), }


Precision policy-related errors are handled during dtype determination:

rust let dtype = build_varbuilder_with_precision(&filenames, &dev, Some(&guard.precision_policy)) .map_err(|e| format!("Failed to determine dtype: {}", e))? .dtype();


CUDA initialization failures are explicitly handled:

rust match candle::Device::cuda_if_available(index) { Ok(dev) => { guard.device = dev; } Err(e) => { return Err(format!("CUDA init failed (index={}): {}", index, e)); } }


The error handling strategy follows a consistent pattern of converting low-level errors to user-friendly string messages while preserving the underlying cause for debugging.

**Section sources**
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs#L10-L60)
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs#L58-L110)
- [api/model_loading/safetensors.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/safetensors.rs#L100-L115)
- [core/precision.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/precision.rs)

## Best Practices for Model Organization

### Directory Structure
Organize local model files in a dedicated directory with clear naming:

models/ ├── qwen3/ │ ├── qwen3-8b-gguf/ │ │ ├── qwen3-8b-Q4_K_M.gguf │ │ └── tokenizer.json │ └── qwen3-14b-gguf/ │ ├── qwen3-14b-Q4_K_M.gguf │ └── tokenizer.json └── safetensors/ ├── qwen3-8b-safetensors/ │ ├── model.safetensors.index.json │ ├── pytorch_model-00001-of-00002.safetensors │ ├── config.json │ └── tokenizer.json └── qwen3-14b-safetensors/ ├── model.safetensors ├── config.json └── tokenizer.json


### Naming Conventions
Follow consistent naming patterns:

- Use descriptive names that include model size and quantization
- Include quantization level in GGUF filenames (e.g., Q4_K_M, Q5_K_S)
- Use standard Hugging Face naming for safetensors files
- Keep filenames lowercase with hyphens as separators
- Include both sharded (with index.json) and consolidated (single file) options for flexibility
- Include config.json for architecture detection
- Ensure config.json contains accurate model_type or architectures fields for proper ModelBuilder detection
- Consider including precision policy recommendations in model documentation

### File Management
- Store tokenizer files alongside model weights
- Keep configuration files (config.json) with the model
- Use versioned directories for different model variants
- Maintain a README with model source and licensing information
- Organize by model family and size for easy navigation
- Ensure all files referenced in index.json exist in the directory
- For safetensors models, ensure config.json is present for proper architecture detection by the ModelFactory
- Document the recommended precision policy for each model to help users achieve optimal performance

## Conclusion
The Oxide-Lab model loading system provides a robust framework for loading GGUF and safetensors models from local disk and remote repositories. By leveraging memory mapping, device abstraction, and comprehensive error handling, it ensures efficient and reliable model initialization. The introduction of a unified ModelFactory architecture with standardized ModelBuilder pattern enhances consistency across different model formats. The `ModelBackend` trait abstraction allows for extensibility to support additional model architectures while maintaining a consistent interface for inference operations. The unified dtype policy optimizes performance across different hardware configurations. The registry system with architecture detection enables automatic model instantiation based on metadata, providing a seamless user experience for loading various model types. The recent unification of model building through the global factory simplifies the loading process and ensures consistent behavior across both GGUF and safetensors formats. The centralized precision policy system provides users with clear options for balancing performance, memory usage, and precision, making the system adaptable to a wide range of hardware configurations and use cases.

**Referenced Files in This Document**   
- [model.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/models/common/model.rs) - *Core model abstraction definitions*
- [api/model_loading/gguf.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/gguf.rs) - *GGUF model loading implementation*
- [api/model_loading/safetensors.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/model_loading/safetensors.rs) - *Local and Hub safetensors model loading implementation with precision policy integration*
- [core/weights.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/weights.rs) - *Universal weights loader and VarBuilder utilities with precision-aware dtype selection*
- [core/device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs) - *Device selection and management*
- [core/types.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/types.rs) - *Device preference definitions*
- [models/registry.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/models/registry.rs) - *Model architecture detection and factory*
- [models/qwen3_builder.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/models/qwen3_builder.rs) - *Qwen3 model builder implementation*
- [models/common/builder.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/models/common/builder.rs) - *ModelBuilder trait and ModelFactory implementation*
- [models/common/candle_llm.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/models/common/candle_llm.rs) - *Adapters for candle-transformers models*
- [core/precision.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/precision.rs) - *Centralized precision policy management*
- [core/state.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/state.rs) - *Application state with precision policy storage*

Clone this wiki locally