GitHub - mytkom/AlicePIDML: Particle Identification using Machine Learning in the ALICE experiment at CERN python codebase repository.

Particle Identification using Machine Learning (PIDML) in the ALICE Experiment at CERN

The ALICE experiment at CERN is dedicated to studying the physics of strongly interacting matter at extreme energy densities, where a state of matter known as quark-gluon plasma is formed. One of the key challenges in this experiment is the accurate identification of particle species produced during high-energy collisions. This process, known as Particle Identification (PID), involves classifying particles such as (anti)pions, (anti)kaons, and (anti)protons based on the features of their tracks measured by the detectors.

Traditionally, PID has been performed using n-σ methods, which rely on the deviation of observed detector signals from the expected distributions of specific particle species generated in Monte Carlo (MC) simulations. If the deviation is below a fixed threshold, the particle is identified as belonging to that species. However, this approach has limitations, particularly in terms of accuracy and its ability to generalize to real experimental data, where distribution drifts between MC simulations and real data can occur.

Our project seeks to address these limitations by leveraging machine learning (ML) models trained on MC data to improve the accuracy and reliability of PID. Specifically, we aim to:

Develop ML-based methods that outperform traditional n-σ approaches in identifying particle species.
Ensure the models are robust and reliable when applied to real experimental data, accounting for potential distribution drifts.
Integrate the models into the O2Physics framework for production use and store trained models for all suitable LHC periods in the CCDB.

It is important to note that our focus is not on identifying rare particle species, which requires more sophisticated methods. Instead, we aim to provide an alternative, more accurate method for filtering out background particles, specifically (anti)pions, (anti)kaons, and (anti)protons. This will enhance the overall quality of the data used in subsequent analyses.

Due to the highly imbalanced nature of the dataset—where the majority of particles produced in collisions are pions—we employ binary classifiers for each targeted particle species. This approach allows us to handle the imbalance more effectively. Additionally, the dataset often contains missing data, as certain detectors: TOF and TRD are not always available. For instance, in Run 3, TOF data is rarely available, even though it is crucial for classification. To address this, we train either:

Six models that can handle missing data directly, or
Twenty-four models, each tailored to specific combinations of missing detectors.

Our data preprocessing pipeline is implemented using the O2Physics PIDMLProducer task, which extracts the necessary features from raw data for training and inference. This ensures that the input to our ML models is consistent and optimized for the task at hand.

Datasets

The training datasets for this project are pre-produced and stored on CERNBox. To begin working with the data, download it to the data/raw directory in your local project structure. These datasets are derived from Monte Carlo (MC) simulations and experimental data of various LHC periods for Run3.

For additional metadata about the datasets, such as finding corresponding experimental data for a specific LHC period (e.g., LHC24b1b), you can use the MonALISA tool. Simply search for the desired period to retrieve relevant information.

Project structure

The project is organized into the following main directories and files:

data/
Stores raw and processed datasets used for training and evaluation.
- raw/: Contains unprocessed ROOT data files.
- processed/: Contains preprocessed data ready for use in training and testing.
experiments/
Contains configurations for various experiments, such as hyperparameter sweeps.
results/
Stores logs generated during training, testing, and hyperparameter tuning. Organized by experiment type.
notebooks/
Contains Jupyter notebooks for exploratory data analysis (EDA), visualization, and model comparison.
scripts/
Includes Python scripts for training, testing, and sweeping models. Examples include train_one_particle.py and sweep_one_particle.py.
src/
The main source code directory for the project. Contains modules for data preparation, engine implementations, model definitions, and configuration handling.
default_config.json
The default configuration file for training and testing runs.

What models do we currently use?

Our project includes the following machine learning models, each tailored to specific requirements of particle identification:

Multi-Layer Perceptron (MLP)
A basic neural network architecture that requires complete data. Missing values must be imputed (e.g., using mean or linear regression imputation).
- Architecture: Fully connected layers with configurable hidden layers, activation functions (e.g., ReLU), and dropout.
- Use Case: Suitable for datasets with no missing detector data.
Ensemble of MLPs
A collection of MLPs, each trained for a specific combination of missing detectors.
- Architecture: Each MLP in the ensemble has its own hidden layers, activation functions, and dropout settings.
- Use Case: Designed to handle missing data by training separate models for each missing detector combination.
Attention-Based Model
A neural network leveraging attention mechanisms to process incomplete data. Missing detectors are encoded using one-hot encoding.
- Architecture: Includes embedding layers, multi-head attention blocks, feed-forward layers, and pooling layers. Configurable parameters include the number of attention heads, blocks, and hidden dimensions.
- Use Case: Effective for datasets with missing data, providing robust performance across various detector combinations.
Attention-Based Domain Adversarial Neural Network (Attention-DANN)
An extension of the attention-based model that incorporates domain adaptation to address distribution drifts between simulated and experimental data.
- Architecture: Combines the attention-based model with a domain classifier trained adversarially to minimize domain-specific biases.
- Use Case: Ideal for scenarios where the training data (simulated) differs significantly from the target data (experimental).

These models are implemented in src/pdi/models.py and configured via src/pdi/config.py.

Config description

The configuration for training runs is defined in src/pdi/config.py. It provides a flexible and comprehensive structure to set up various aspects of the training process. The configuration is organized into several sections, including:

Data Configuration
Defines how the data is preprocessed, split into training/validation/test sets, and filtered for outliers.
Model Configuration
Specifies the architecture of the machine learning model to use (e.g., MLP, Ensemble, Attention, or Attention-DANN). Each architecture has its own set of parameters, such as the number of layers, hidden dimensions, activation functions, and dropout rates.
Training Configuration
Includes settings for the optimizer (e.g., AdamW, SGD), learning rate schedulers, batch size, number of epochs, and mixed precision training. It also contains two undersampling methods (undersamplig missing detectors groups and undersampling pions to the next majority class).
Sweep Configuration
Allows for hyperparameter sweeps using tools like wandb. This includes defining the sweep name, project name, and parameter ranges.
Validation Configuration
Contains parameters for evaluating the model during training, such as validation frequency and metrics to monitor.
Paths and Directories
Specifies paths to simulated and experimental datasets, as well as directories for saving results, logs, and trained models.

A default configuration is provided in the default_config.json file located in the root directory. This file can be used as a starting point for creating custom configurations. Example experiments using this configuration can be found in the experiments directory, where various setups and hyperparameter sweeps are demonstrated. New experiment configuration should be placed in the experiments directory.

Reproducibility

The configuration includes a seed parameter, which ensures reproducibility across different runs of the project. This seed is used in various parts of the pipeline to control randomness, including:

Data Splitting
The seed is used during the train/validation/test split to ensure that the same data subsets are created across runs. This is particularly important for maintaining consistent evaluation metrics when comparing models.
Batch Shuffling
The seed is applied to shuffle data batches during training, ensuring that the order of data fed into the model remains consistent across runs.

This is critical for debugging, benchmarking, and comparing results across different configurations or models.

DataPreparation and Engines: how to use them?

The training and testing flow in this project is built around the DataPreparation class and the Engine classes (ClassicEngine and DomainAdaptationEngine), which are orchestrated by the build_engine(cfg) function. Engine knows how to prepare data and uses DataPreparation class with specific parameters to achieve its goal.

How it works:

Data Preparation
The DataPreparation class is responsible for handling all stages of data preprocessing, including:
- Loading data from ROOT files.
- Preprocessing data (e.g., setting NaNs, applying cuts on features like fP and fTCPSignal).
- Splitting data into train/validation/test sets.
- Standardizing data using statistics (mean and standard deviation) calculated on the training set.
- Grouping data by missing detector combinations.
- Creating data loaders for efficient batch processing during training and testing.
The DataPreparation class also calculates a checksum based on the configuration, seed, and input data paths. This checksum is used to cache preprocessed data in a directory named after the checksum, ensuring that the same preprocessing steps do not need to be repeated for identical configurations.
Engine Selection
The build_engine(cfg) function determines the appropriate engine to use based on the model architecture specified in the configuration:
- ClassicEngine: Used for standard PyTorch models like MLP, Ensemble, and Attention-based architectures.
- DomainAdaptationEngine: Used for domain adversarial models like Attention-DANN, which require handling both simulated and experimental data.
Each engine is responsible for:
- Running data preparation (e.g., DomainAdaptationEngine runs it both for simulated and experimental data).
- Initializing the model, optimizer, and learning rate scheduler.
- Running the training loop, including logging metrics and saving the best model.
- Evaluating the model on validation and test datasets.

Example: Training a Model for One Particle

Below is an example of how to execute training for a single particle species. It is train_one_particle.py script:

import tyro
from pdi.config import Config, OneParticleConfig
from pdi.constants import PART_NAME_TO_TARGET_CODE
from pdi.engines import build_engine

def main(config: Config, target_code: int):
    particle_name = PART_NAME_TO_TARGET_CODE[target_code]
    print(f"Starting training for {particle_name}...")

    # Build the engine
    engine = build_engine(config, target_code)

    # Train the model
    engine.train()

    # Optionally, evaluate the model
    engine.test()

    print(f"Training and evaluation completed for {particle_name}.")

if __name__ == "__main__":
    # Parse CLI arguments using tyro
    cli_config = tyro.cli(OneParticleConfig)

    # Load the configuration and execute training
    main(cli_config.config, cli_config.particle)

CLI Options for Scripts

The scripts in this project use tyro to parse command-line arguments. Below are the CLI options for each script:

train_one_particle.py
- config: Path to the JSON configuration file.
- particle: Name of the particle species to train the model for (e.g., pion, kaon, proton).
- output_file: Path to save the metadata of the training run (e.g., configuration checksum, results directory).
train_all_particles.py
- config: Path to the JSON configuration file.
- all: Use the same configuration for all particle species.
- <particle_name>: Optional overrides for specific particle species (e.g., pion, kaon, proton).
- output_file: Path to save the metadata of the training runs for all particles.
sweep_one_particle.py
- config: Path to the JSON configuration file.
- particle: Name of the particle species to perform the hyperparameter sweep for.
- sweep_config: Path to the sweep configuration file (e.g., for wandb).
- output_file: Path to save the metadata of the sweep run (e.g., configuration checksum, results directory).
sweep_all_particles.py
- config: Path to the JSON configuration file.
- all: Use the same sweep configuration for all particle species.
- <particle_name>: Optional overrides for specific particle species (e.g., pion, kaon, proton).
- sweep_config: Path to the sweep configuration file (e.g., for wandb).
- output_file: Path to save the metadata of the sweep runs for all particles.

These options allow users to customize the training and hyperparameter sweep processes for specific particle species or all species at once.

How to add new entity instance

In this project, entities such as learning rate schedulers, loss functions, model architectures, etc., are designed to be modular and easily extendable. Each entity has a corresponding build_<entity>(cfg) method that integrates it into the pipeline. To add a new entity, follow these general steps:

Update the Configuration
Extend the relevant configuration dataclass in src/pdi/config.py to include parameters for the new entity. For example, if adding a new learning rate scheduler, update the LRSchedulersConfig dataclass:

@dataclasses.dataclass
class LRSchedulersConfig:
    # Existing schedulers
    exponential: Optional[ExponentialConfig] = None
    cosine_restarts: Optional[CosineRestartsConfig] = None
    # Add your new scheduler
    new_scheduler: Optional[NewSchedulerConfig] = None

Implement the Entity
Add the implementation of the new entity in the appropriate module. For example, if adding a new learning rate scheduler, implement it in src/pdi/lr_schedulers.py:

from torch.optim.lr_scheduler import NewScheduler

def build_lr_scheduler(cfg: TrainingConfig, optimizer: Optimizer):
    if cfg.lr_scheduler == "new_scheduler":
        return NewScheduler(optimizer, **cfg.lr_schedulers.new_scheduler.params)

Update Documentation
Document the new entity in the README or relevant documentation files. Include details about its configuration parameters and use cases.

Undersampling methods and why are they needed?

Undersampling missing detectors combinations

In Run 3, most of the data is TPC-only, which caused the model to underperform when using TOF data. Specifically, in higher transverse momentum (pT) ranges, the recall was worse because the model learned to ignore these regions in TPC-only data. This led to poor performance even in data with TOF, where these regions are well distinguishable. To address this, we undersample TPC-only data to ensure the model learns to utilize TOF effectively.

Undersampling pions

Pions constitute ~90% of the observations, leading to a severe class imbalance. To address this, we undersample pions to match the next majority class. While this approach has shown limited productivity in some cases, we retain this option for future use to explore its potential benefits in specific scenarios.

Literature

Below are some references and resources relevant to PIDML and particle identification in the ALICE experiment:

Particle Identification with Machine Learning from Incomplete Data in the ALICE Experiment
- Karwowska, M., Graczykowski, Ł., Deja, K., Kasak, M., Janik, M., on behalf of the ALICE collaboration.
- Journal of Instrumentation, 2024.
- DOI: 10.1088/1748-0221/19/07/C07013
Particle Identification with Machine Learning in ALICE Run 3
- Karwowska, M., Jakubowska, M., Graczykowski, Ł., Deja, K., Kasak, M.
- EPJ Web of Conferences, 2024.
- DOI: 10.1051/epjconf/202429509029
Using Machine Learning for Particle Identification in ALICE
- Graczykowski, Ł. K., Jakubowska, M., Deja, K. R., Kabus, M., on behalf of the ALICE collaboration.
- Journal of Instrumentation, 2022.
- DOI: 10.1088/1748-0221/17/07/C07016
Using Random Forest Classifier for Particle Identification in the ALICE Experiment
- Trzciński, T., Graczykowski, Ł., Glinka, M.
- In Information Technology, Systems Research, and Computational Physics, Springer, 2020.
- DOI: 10.1007/978-3-030-18058-4_1

These references provide a foundation for understanding the techniques and methodologies used in this project, as well as the integration of machine learning into particle identification workflows in the ALICE experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
data/raw		data/raw
experiments		experiments
notebooks		notebooks
scripts		scripts
src/pdi		src/pdi
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
default_config.json		default_config.json
load_python_env.sh		load_python_env.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run-jupyter-ssh.sh		run-jupyter-ssh.sh
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Particle Identification using Machine Learning (PIDML) in the ALICE Experiment at CERN

Datasets

Project structure

What models do we currently use?

Config description

Reproducibility

DataPreparation and Engines: how to use them?

How it works:

Example: Training a Model for One Particle

CLI Options for Scripts

How to add new entity instance

Undersampling methods and why are they needed?

Undersampling missing detectors combinations

Undersampling pions

Literature

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

mytkom/AlicePIDML

Folders and files

Latest commit

History

Repository files navigation

Particle Identification using Machine Learning (PIDML) in the ALICE Experiment at CERN

Datasets

Project structure

What models do we currently use?

Config description

Reproducibility

DataPreparation and Engines: how to use them?

How it works:

Example: Training a Model for One Particle

CLI Options for Scripts

How to add new entity instance

Undersampling methods and why are they needed?

Undersampling missing detectors combinations

Undersampling pions

Literature

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages