The ALICE experiment at CERN is dedicated to studying the physics of strongly interacting matter at extreme energy densities, where a state of matter known as quark-gluon plasma is formed. One of the key challenges in this experiment is the accurate identification of particle species produced during high-energy collisions. This process, known as Particle Identification (PID), involves classifying particles such as (anti)pions, (anti)kaons, and (anti)protons based on the features of their tracks measured by the detectors.
Traditionally, PID has been performed using n-σ methods, which rely on the deviation of observed detector signals from the expected distributions of specific particle species generated in Monte Carlo (MC) simulations. If the deviation is below a fixed threshold, the particle is identified as belonging to that species. However, this approach has limitations, particularly in terms of accuracy and its ability to generalize to real experimental data, where distribution drifts between MC simulations and real data can occur.
Our project seeks to address these limitations by leveraging machine learning (ML) models trained on MC data to improve the accuracy and reliability of PID. Specifically, we aim to:
- Develop ML-based methods that outperform traditional n-σ approaches in identifying particle species.
- Ensure the models are robust and reliable when applied to real experimental data, accounting for potential distribution drifts.
- Integrate the models into the O2Physics framework for production use and store trained models for all suitable LHC periods in the CCDB.
It is important to note that our focus is not on identifying rare particle species, which requires more sophisticated methods. Instead, we aim to provide an alternative, more accurate method for filtering out background particles, specifically (anti)pions, (anti)kaons, and (anti)protons. This will enhance the overall quality of the data used in subsequent analyses.
Due to the highly imbalanced nature of the dataset—where the majority of particles produced in collisions are pions—we employ binary classifiers for each targeted particle species. This approach allows us to handle the imbalance more effectively. Additionally, the dataset often contains missing data, as certain detectors: TOF and TRD are not always available. For instance, in Run 3, TOF data is rarely available, even though it is crucial for classification. To address this, we train either:
- Six models that can handle missing data directly, or
- Twenty-four models, each tailored to specific combinations of missing detectors.
Our data preprocessing pipeline is implemented using the O2Physics PIDMLProducer task, which extracts the necessary features from raw data for training and inference. This ensures that the input to our ML models is consistent and optimized for the task at hand.
The training datasets for this project are pre-produced and stored on CERNBox. To begin working with the data, download it to the data/raw directory in your local project structure. These datasets are derived from Monte Carlo (MC) simulations and experimental data of various LHC periods for Run3.
For additional metadata about the datasets, such as finding corresponding experimental data for a specific LHC period (e.g., LHC24b1b), you can use the MonALISA tool. Simply search for the desired period to retrieve relevant information.
The project is organized into the following main directories and files:
-
data/
Stores raw and processed datasets used for training and evaluation.raw/: Contains unprocessed ROOT data files.processed/: Contains preprocessed data ready for use in training and testing.
-
experiments/
Contains configurations for various experiments, such as hyperparameter sweeps. -
results/
Stores logs generated during training, testing, and hyperparameter tuning. Organized by experiment type. -
notebooks/
Contains Jupyter notebooks for exploratory data analysis (EDA), visualization, and model comparison. -
scripts/
Includes Python scripts for training, testing, and sweeping models. Examples includetrain_one_particle.pyandsweep_one_particle.py. -
src/
The main source code directory for the project. Contains modules for data preparation, engine implementations, model definitions, and configuration handling. -
default_config.json
The default configuration file for training and testing runs.
Our project includes the following machine learning models, each tailored to specific requirements of particle identification:
-
Multi-Layer Perceptron (MLP)
A basic neural network architecture that requires complete data. Missing values must be imputed (e.g., using mean or linear regression imputation).- Architecture: Fully connected layers with configurable hidden layers, activation functions (e.g., ReLU), and dropout.
- Use Case: Suitable for datasets with no missing detector data.
-
Ensemble of MLPs
A collection of MLPs, each trained for a specific combination of missing detectors.- Architecture: Each MLP in the ensemble has its own hidden layers, activation functions, and dropout settings.
- Use Case: Designed to handle missing data by training separate models for each missing detector combination.
-
Attention-Based Model
A neural network leveraging attention mechanisms to process incomplete data. Missing detectors are encoded using one-hot encoding.- Architecture: Includes embedding layers, multi-head attention blocks, feed-forward layers, and pooling layers. Configurable parameters include the number of attention heads, blocks, and hidden dimensions.
- Use Case: Effective for datasets with missing data, providing robust performance across various detector combinations.
-
Attention-Based Domain Adversarial Neural Network (Attention-DANN)
An extension of the attention-based model that incorporates domain adaptation to address distribution drifts between simulated and experimental data.- Architecture: Combines the attention-based model with a domain classifier trained adversarially to minimize domain-specific biases.
- Use Case: Ideal for scenarios where the training data (simulated) differs significantly from the target data (experimental).
These models are implemented in src/pdi/models.py and configured via src/pdi/config.py.
The configuration for training runs is defined in src/pdi/config.py. It provides a flexible and comprehensive structure to set up various aspects of the training process. The configuration is organized into several sections, including:
-
Data Configuration
Defines how the data is preprocessed, split into training/validation/test sets, and filtered for outliers. -
Model Configuration
Specifies the architecture of the machine learning model to use (e.g., MLP, Ensemble, Attention, or Attention-DANN). Each architecture has its own set of parameters, such as the number of layers, hidden dimensions, activation functions, and dropout rates. -
Training Configuration
Includes settings for the optimizer (e.g., AdamW, SGD), learning rate schedulers, batch size, number of epochs, and mixed precision training. It also contains two undersampling methods (undersamplig missing detectors groups and undersampling pions to the next majority class). -
Sweep Configuration
Allows for hyperparameter sweeps using tools likewandb. This includes defining the sweep name, project name, and parameter ranges. -
Validation Configuration
Contains parameters for evaluating the model during training, such as validation frequency and metrics to monitor. -
Paths and Directories
Specifies paths to simulated and experimental datasets, as well as directories for saving results, logs, and trained models.
A default configuration is provided in the default_config.json file located in the root directory. This file can be used as a starting point for creating custom configurations. Example experiments using this configuration can be found in the experiments directory, where various setups and hyperparameter sweeps are demonstrated. New experiment configuration should be placed in the experiments directory.
The configuration includes a seed parameter, which ensures reproducibility across different runs of the project. This seed is used in various parts of the pipeline to control randomness, including:
-
Data Splitting
The seed is used during the train/validation/test split to ensure that the same data subsets are created across runs. This is particularly important for maintaining consistent evaluation metrics when comparing models. -
Batch Shuffling
The seed is applied to shuffle data batches during training, ensuring that the order of data fed into the model remains consistent across runs.
This is critical for debugging, benchmarking, and comparing results across different configurations or models.
The training and testing flow in this project is built around the DataPreparation class and the Engine classes (ClassicEngine and DomainAdaptationEngine), which are orchestrated by the build_engine(cfg) function. Engine knows how to prepare data and uses DataPreparation class with specific parameters to achieve its goal.
-
Data Preparation
TheDataPreparationclass is responsible for handling all stages of data preprocessing, including:- Loading data from ROOT files.
- Preprocessing data (e.g., setting NaNs, applying cuts on features like
fPandfTCPSignal). - Splitting data into train/validation/test sets.
- Standardizing data using statistics (mean and standard deviation) calculated on the training set.
- Grouping data by missing detector combinations.
- Creating data loaders for efficient batch processing during training and testing.
The
DataPreparationclass also calculates a checksum based on the configuration, seed, and input data paths. This checksum is used to cache preprocessed data in a directory named after the checksum, ensuring that the same preprocessing steps do not need to be repeated for identical configurations. -
Engine Selection
Thebuild_engine(cfg)function determines the appropriate engine to use based on the model architecture specified in the configuration:- ClassicEngine: Used for standard PyTorch models like MLP, Ensemble, and Attention-based architectures.
- DomainAdaptationEngine: Used for domain adversarial models like Attention-DANN, which require handling both simulated and experimental data.
Each engine is responsible for:
- Running data preparation (e.g., DomainAdaptationEngine runs it both for simulated and experimental data).
- Initializing the model, optimizer, and learning rate scheduler.
- Running the training loop, including logging metrics and saving the best model.
- Evaluating the model on validation and test datasets.
Below is an example of how to execute training for a single particle species. It is train_one_particle.py script:
import tyro
from pdi.config import Config, OneParticleConfig
from pdi.constants import PART_NAME_TO_TARGET_CODE
from pdi.engines import build_engine
def main(config: Config, target_code: int):
particle_name = PART_NAME_TO_TARGET_CODE[target_code]
print(f"Starting training for {particle_name}...")
# Build the engine
engine = build_engine(config, target_code)
# Train the model
engine.train()
# Optionally, evaluate the model
engine.test()
print(f"Training and evaluation completed for {particle_name}.")
if __name__ == "__main__":
# Parse CLI arguments using tyro
cli_config = tyro.cli(OneParticleConfig)
# Load the configuration and execute training
main(cli_config.config, cli_config.particle)The scripts in this project use tyro to parse command-line arguments. Below are the CLI options for each script:
-
train_one_particle.pyconfig: Path to the JSON configuration file.particle: Name of the particle species to train the model for (e.g.,pion,kaon,proton).output_file: Path to save the metadata of the training run (e.g., configuration checksum, results directory).
-
train_all_particles.pyconfig: Path to the JSON configuration file.all: Use the same configuration for all particle species.<particle_name>: Optional overrides for specific particle species (e.g.,pion,kaon,proton).output_file: Path to save the metadata of the training runs for all particles.
-
sweep_one_particle.pyconfig: Path to the JSON configuration file.particle: Name of the particle species to perform the hyperparameter sweep for.sweep_config: Path to the sweep configuration file (e.g., forwandb).output_file: Path to save the metadata of the sweep run (e.g., configuration checksum, results directory).
-
sweep_all_particles.pyconfig: Path to the JSON configuration file.all: Use the same sweep configuration for all particle species.<particle_name>: Optional overrides for specific particle species (e.g.,pion,kaon,proton).sweep_config: Path to the sweep configuration file (e.g., forwandb).output_file: Path to save the metadata of the sweep runs for all particles.
These options allow users to customize the training and hyperparameter sweep processes for specific particle species or all species at once.
In this project, entities such as learning rate schedulers, loss functions, model architectures, etc., are designed to be modular and easily extendable. Each entity has a corresponding build_<entity>(cfg) method that integrates it into the pipeline. To add a new entity, follow these general steps:
-
Update the Configuration
Extend the relevant configuration dataclass in src/pdi/config.py to include parameters for the new entity. For example, if adding a new learning rate scheduler, update theLRSchedulersConfigdataclass:@dataclasses.dataclass class LRSchedulersConfig: # Existing schedulers exponential: Optional[ExponentialConfig] = None cosine_restarts: Optional[CosineRestartsConfig] = None # Add your new scheduler new_scheduler: Optional[NewSchedulerConfig] = None
-
Implement the Entity
Add the implementation of the new entity in the appropriate module. For example, if adding a new learning rate scheduler, implement it in src/pdi/lr_schedulers.py:from torch.optim.lr_scheduler import NewScheduler def build_lr_scheduler(cfg: TrainingConfig, optimizer: Optimizer): if cfg.lr_scheduler == "new_scheduler": return NewScheduler(optimizer, **cfg.lr_schedulers.new_scheduler.params)
-
Update Documentation
Document the new entity in the README or relevant documentation files. Include details about its configuration parameters and use cases.
In Run 3, most of the data is TPC-only, which caused the model to underperform when using TOF data. Specifically, in higher transverse momentum (pT) ranges, the recall was worse because the model learned to ignore these regions in TPC-only data. This led to poor performance even in data with TOF, where these regions are well distinguishable. To address this, we undersample TPC-only data to ensure the model learns to utilize TOF effectively.
Pions constitute ~90% of the observations, leading to a severe class imbalance. To address this, we undersample pions to match the next majority class. While this approach has shown limited productivity in some cases, we retain this option for future use to explore its potential benefits in specific scenarios.
Below are some references and resources relevant to PIDML and particle identification in the ALICE experiment:
-
Particle Identification with Machine Learning from Incomplete Data in the ALICE Experiment
- Karwowska, M., Graczykowski, Ł., Deja, K., Kasak, M., Janik, M., on behalf of the ALICE collaboration.
- Journal of Instrumentation, 2024.
- DOI: 10.1088/1748-0221/19/07/C07013
-
Particle Identification with Machine Learning in ALICE Run 3
- Karwowska, M., Jakubowska, M., Graczykowski, Ł., Deja, K., Kasak, M.
- EPJ Web of Conferences, 2024.
- DOI: 10.1051/epjconf/202429509029
-
Using Machine Learning for Particle Identification in ALICE
- Graczykowski, Ł. K., Jakubowska, M., Deja, K. R., Kabus, M., on behalf of the ALICE collaboration.
- Journal of Instrumentation, 2022.
- DOI: 10.1088/1748-0221/17/07/C07016
-
Using Random Forest Classifier for Particle Identification in the ALICE Experiment
- Trzciński, T., Graczykowski, Ł., Glinka, M.
- In Information Technology, Systems Research, and Computational Physics, Springer, 2020.
- DOI: 10.1007/978-3-030-18058-4_1
These references provide a foundation for understanding the techniques and methodologies used in this project, as well as the integration of machine learning into particle identification workflows in the ALICE experiment.