0xideas · 0xideas · May 22, 2026 · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/README.md b/README.md
@@ -203,7 +203,7 @@ Please cite with:
   title = {sequifier - causal transformer models for multivariate sequence modelling},
   year = {2025},
   publisher = {GitHub},
-  version = {v1.1.2.0},
+  version = {v1.1.2.1},
   url = {https://github.com/0xideas/sequifier}
 }
 ```
diff --git a/dev/directions.txt b/dev/directions.txt
@@ -1,3 +1,4 @@
+distributed inference
 Inference scope:
  - Add "infinite" autoregressive inference via output_to_folder (configurable number of rows per file)
  - implement kv_cache (adapt specifically for learned positional embeddings)

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -15,7 +15,7 @@
 project = 'sequifier'
 copyright = '2025, Leon Luithlen'
 author = 'Leon Luithlen'
-release = 'v1.1.2.0'
+release = 'v1.1.2.1'
 html_baseurl = 'https://www.sequifier.com/'
 
 # -- General configuration ---------------------------------------------------

diff --git a/documentation/configs/hyperparameter-search.md b/documentation/configs/hyperparameter-search.md
@@ -105,8 +105,8 @@ Most fields here are lists for sampling, but some are scalar values fixed for al
 | `backend` | str | No | `nccl` | The distributed training backend to use (e.g., `nccl` for GPUs). Only relevant if `distributed: true`. |
 | `device_max_concat_length` | `int` | No | `12` | Controls recursive tensor concatenation to prevent CUDA kernel limits. |
 | `max_ram_gb` | `int` or `float`| No | `16` | RAM limit (GB) for the cache when using lazy loading. |
-| `load_full_data_to_ram` | `bool` | No | `true` | If `false`, uses lazy loading (requires `read_format: pt`). |
-| `distributed` | `bool` | No | `false`| Enable multi-GPU training (DDP or FSDP). Requires `read_format: pt`. |
+| `load_full_data_to_ram` | `bool` | No | `true` | If `false`, uses lazy loading (requires `read_format: pt` or `read_format: parquet`). |
+| `distributed` | `bool` | No | `false`| Enable multi-GPU training (DDP or FSDP). Requires `read_format: pt` or `read_format: parquet`. |
 | `layer_type_dtypes` | `dict` | No | `null` | Map of layer types to dtypes (e.g., `{'linear': 'bfloat16'}`). |
 | `layer_autocast` | `bool` | No | `true` | Enable `torch.autocast`. |
 | `sampling_strategy` | `str` | No | `exact` | How to address input file imbalance for multi-GPU training. |

diff --git a/documentation/configs/infer.md b/documentation/configs/infer.md
@@ -17,7 +17,7 @@ The configuration is defined in a YAML file (e.g., `infer.yaml`).
 | Field | Type | Mandatory | Default | Description |
 | :--- | :--- | :--- | :--- | :--- |
 | `project_root` | `str` | **Yes** | - | The root directory of your Sequifier project. Usually `.` |
-| `data_path` | `str` | **Yes** | - | Path to the input data file (csv/parquet) or folder (if `read_format: pt`). |
+| `data_path` | `str` | **Yes** | - | Path to the input data file (`csv` or `parquet`) or folder (`pt` or `parquet`). |
 | `model_path` | `str` or `list[str]` | **Yes** | - | Path to a specific model file, or a list of paths to process sequentially. (e.g., `models/sequifier-[NAME]-best-[EPOCH].pt`). |
 | `training_config_path`| `str` | No | `configs/train.yaml`| Path to the config used to train the model. Required to reconstruct the model architecture. |
 | `metadata_config_path`| `str` | **Yes** | - | Path to the JSON metadata file generated during preprocessing. Used for ID mapping and normalization. |
@@ -51,17 +51,12 @@ These fields tell the inference engine which columns to extract from the new dat
 | `infer_with_dropout` | `bool` | No | `false` | If `true`, keeps dropout active during inference (useful for uncertainty estimation/Monte Carlo Dropout). |
 | `seed` | `int` | No | `1010` | Random seed for reproducibility. |
 
-### 4\. System & Distributed
+### 4\. System
 
 | Field | Type | Mandatory | Default | Description |
 | :--- | :--- | :--- | :--- | :--- |
 | `device` | `str` | **Yes** | - | `cuda`, `cpu`, or `mps`. |
-| `distributed` | `bool` | No | `false`| Enable multi-GPU inference. Requires `read_format: pt`. |
-| `world_size` | `int` | No | `1` | Number of GPUs/processes for distributed inference. |
-| `num_workers` | `int` | No | `0` | Number of subprocesses for data loading. |
 | `enforce_determinism` | `bool` | No | `false` | Forces PyTorch to use deterministic algorithms. |
-| `load_full_data_to_ram`| `bool` | No | `true` | If `false`, uses lazy loading (requires `read_format: pt`). |
-
 -----
 
 ## Key Trade-offs and Decisions
@@ -89,8 +84,9 @@ Standard inference predicts the next step ($t+1$) based on history ($t-n \dots t
 
 ### 4\. Input Format (`read_format`)
 
-  * **`parquet` / `csv`:** Best for standard inference on new data files. The inferer will filter the data to `input_columns` automatically.
-  * **`pt` (PyTorch Tensors):** Required for **Distributed Inference** or **Lazy Loading**. If your inference dataset is massive (terabytes), preprocess it into `.pt` chunks first, then run inference with `read_format: pt` and `distributed: true`.
+  * **`csv`:** Best for standard inference on small data. The inferer will filter the data to `input_columns` automatically.
+  * **`parquet`** Best for most use cases. Can be used with lazy loading, will use less disk space but more CPU than `pt`
+  * **`pt`** Optimized for lazy loading, uses more disk space but less CPU than `parquet`
 
 -----
 
@@ -116,12 +112,12 @@ Results are saved in the `outputs/` folder within your project root.
 
 ### Directory Output Mode (Sharded Inference)
 
-When using PyTorch tensors (`read_format: pt`), sequifier creates a directory containing multiple sharded outputs.
+When using a folder of files as input, sequifier creates a directory containing multiple sharded outputs.
 
 **File Structure**
-* **`pt` inputs:** `outputs/predictions/[MODEL_NAME]-predictions/[MODEL_NAME]-[CHUNK_ID]-predictions.[format]` *(Directory of files)*
-* **`pt` inputs:** `outputs/probabilities/[MODEL_NAME]-[TARGET_COLUMN]-probabilities/[MODEL_NAME]-[CHUNK_ID]-probabilities.[format]` *(Directory of files)*
-* **`pt` inputs:** `outputs/embeddings/[MODEL_NAME]-embeddings/[MODEL_NAME]-[CHUNK_ID]-embeddings.[format]` *(Directory of files)*
+* **folder inputs:** `outputs/predictions/[MODEL_NAME]-predictions/[MODEL_NAME]-[CHUNK_ID]-predictions.[format]` *(Directory of files)*
+* **folder inputs:** `outputs/probabilities/[MODEL_NAME]-[TARGET_COLUMN]-probabilities/[MODEL_NAME]-[CHUNK_ID]-probabilities.[format]` *(Directory of files)*
+* **folder inputs:** `outputs/embeddings/[MODEL_NAME]-embeddings/[MODEL_NAME]-[CHUNK_ID]-embeddings.[format]` *(Directory of files)*
 
 
 **Pipeline Note:** If you switch to `.pt` inputs, ensure your downstream scripts are configured to read from a directory of files rather than a single file. This behavior applies to predictions, probabilities, and embeddings.
diff --git a/documentation/configs/preprocess.md b/documentation/configs/preprocess.md
@@ -26,8 +26,10 @@ The configuration is defined in a YAML file (e.g., `preprocess.yaml`). Below are
 
 > **Important Constraint on `write_format`:**
 >
->   * If `write_format` is **`pt`** (PyTorch tensors), `merge_output` must be **`false`**. This sharded format is **required** for distributed training on large datasets.
->   * If `write_format` is **`csv`** or **`parquet`**, `merge_output` must be **`true`**.
+>   * If `write_format` is **`pt`** (PyTorch tensors), `merge_output` must be **`false`**.
+>   * If `write_format` is **`parquet`**, `merge_output` can be **`false`** or **`true`**.
+>   * If `write_format` is **`csv`**, `merge_output` must be **`true`**.
+> For distributed training, `merge_output` must be set to **`false`**.
 
 ### 2\. Column Selection & Filtering
 
@@ -62,8 +64,8 @@ The configuration is defined in a YAML file (e.g., `preprocess.yaml`). Below are
 
 ### 1\. `write_format`: `parquet` vs. `pt`
 
-  * **Choose `parquet` (default):** If your dataset is small to medium (fits in RAM) and you want to inspect the preprocessed data easily using standard tools like Pandas or Polars. This produces one file per split (e.g., `data-split0.parquet`).
-  * **Choose `pt`:** If your dataset is massive (larger than RAM) or you intend to use **Distributed Training** (multi-GPU). This format saves data as thousands of small PyTorch tensor files. It allows the `SequifierDatasetFromFolderLazy` to load data on demand without clogging memory.
+  * **Choose `parquet` (default):** Unless you have a specific reason, use `parquet`. *Note: If you are doing distributed training, Parquet support is currently in **Beta**.
+  * **Choose `pt`:** Use `pt` data loading if speed and CPU overhead are your primary bottlenecks, **or if you are running multi-GPU distributed training.** This format is the most stable choice for high-throughput scaling.
 
 ### 2\. `stride_by_split` configuration
 

diff --git a/documentation/configs/train.md b/documentation/configs/train.md
@@ -84,8 +84,8 @@ These fields determine the size and complexity of the Transformer.
 | `backend` | `str` | No | `nccl` | The distributed training backend to use (e.g., `nccl` for GPUs, `gloo` for CPUs). Only relevant if `distributed: true`. |
 | `device_max_concat_length`| `int` | No | `12` | Controls recursive tensor concatenation to prevent CUDA kernel limits on specific hardware. Lower this if you encounter "CUDA error: too many resources requested for launch". |
 | `continue_training` | `bool` | No | `true` | Load model weights and optimizer state from laste checkpoint and continue training |
-| `distributed` | `bool` | No | `false`| Enable multi-GPU training (DDP). Requires `read_format: pt`. |
-| `load_full_data_to_ram`| `bool` | No | `true` | If `false`, uses lazy loading (requires `read_format: pt`). |
+| `distributed` | `bool` | No | `false`| Enable multi-GPU training (DDP). Requires `read_format: pt` or `read_format: parquet`. |
+| `load_full_data_to_ram`| `bool` | No | `true` | If `false`, uses lazy loading (requires `read_format: pt` or `read_format: parquet`). |
 | `layer_type_dtypes` | `dict` | No | `null` | Map of layer types (`linear`, `embedding`, `norm`, `decoder`) to dtypes (`float32`, `float16`, `bfloat16`, `float8_e4m3fn`, `float8_e5m2`). Used for mixed-precision/quantization. |
 | `layer_autocast` | `bool` | No | `true` | If `true`, enables `torch.autocast` for automatic mixed precision training. |
 | `sampling_strategy` | `str` | No | `exact` | How to address input file imbalance: `exact` requires exact divisibility of n_files by the number of GPUs (`world_size`), alternatively `oversampling` and `undersampling` equalise the number of samples seen
@@ -118,10 +118,10 @@ These fields determine the size and complexity of the Transformer.
       * *Pros*: Fastest training speed.
       * *Cons*: Limited by physical RAM. If the dataset is 64GB and you have 32GB RAM, this will crash.
   * **`false` (Lazy Loading):** Loads individual files on-demand during training.
-      * *Requirements:* `read_format` must be `pt`.
+      * *Requirements:* `read_format` must be `parquet` or `pt`.
       * *Mechanism:* Uses an `IterableDataset` with cross-file buffering to stream pre-processed chunked files sequentially, automatically calculating exact sample boundaries across GPU ranks and workers.
       * *Pros:* Can train on datasets much larger than RAM, safely supporting DDP/FSDP synchronization.
-      * *Cons:* Slight I/O overhead depending on disk speed. Increase `num_workers` to mitigate this.
+      * *Cons:* Slight I/O overhead depending on disk speed. Increase `num_workers` to mitigate this. **Note for Parquet users:** Lazy loading distributed Parquet files is currently in **Beta** and may cause high CPU overhead or deadlocks on large multi-GPU nodes. For distributed lazy loading, `read_format: pt` is strongly recommended.
 
 ### 2\. Attention Mechanism (`attention_type` & `n_kv_heads`)
 
@@ -139,7 +139,7 @@ These fields determine the size and complexity of the Transformer.
 If you have multiple GPUs:
 
 1.  Set `distributed: true` in `training_spec`.
-2.  **Crucial:** You must have run `preprocess` with `write_format: pt` and `merge_output: false`.
+2.  **Crucial:** You must have run `preprocess` with `merge_output: false`.
 3.  Set `world_size` to the number of GPUs.
 4.  Set `data_parallelism` to `DDP` for `DistributedDataParallel`training or `FSDP` for `FullyShardedDataParallel` training
 5.  Set `torch_compile` to `inner` when training with `FSDP` and to `outer` when training with `DDP`