Selphi is a tool for genotype imputation based on a weighted-PBWT (Positional Burrows-Wheeler Transform) algorithm. It provides efficient imputation of missing genotypes in a target sample dataset using a reference panel, processing entire chromosomes in a single pass to avoid edge effects from windowed approaches.
A tiny example dataset is included in the example/ directory (chr22, 100 reference samples, 2 target samples). After installing Selphi with any of the methods below, you can run:
selphi \
--target example/data/target.vcf.gz \
--refpanel example/selphi_ref/chr22 \
--map example/data/genetic_map.chr22.map \
--outvcf example/results/imputed \
--cores 4See example/README.md for full details.
- Site Compatibility: The target dataset should only contain sites that are a subset of the reference panel.
- Chromosome Consistency: Both the target and reference panel must pertain to the same chromosome.
- Phased Genotypes: All target genotypes must be phased.
The install script builds all dependencies (htslib, bcftools, pbwt, Python packages) into a self-contained directory. Supports Linux (Ubuntu/Debian, CentOS/RHEL/Fedora, Arch) and macOS.
# Install system prerequisites (Ubuntu/Debian)
sudo apt-get update && sudo apt-get install -y \
gcc g++ make autoconf automake git curl pkg-config \
zlib1g-dev libbz2-dev liblzma-dev libzip-dev libcurl4-openssl-dev \
python3 python3-pip python3-venv
# Install system prerequisites (macOS)
xcode-select --install
brew install autoconf automake git curl pkg-config xz libzip python@3.11
# Run the installer
./install.sh # installs to ./selphi_env
./install.sh /opt/selphi # or specify a custom prefix
# After installation
export PATH="/path/to/selphi_env/bin:$PATH"
selphi --helpThe installer also supports these options:
--skip-xsqueezeit— skip optional xSqueezeIt (only needed for.xsireference panels)--skip-python— skip Python venv setup (use system Python instead)--python /path/to/python3— use a specific Python 3.10+ executable-j N— set parallel build jobs (default: auto-detect)
Build the Docker image from the included Dockerfile:
docker build -t selphi .
docker run selphi --helpFor HPC environments where Docker is not available, build a Singularity image from the Dockerfile:
singularity build selphi.sif docker-daemon://selphi:latest
singularity run selphi.sif --helpThis requires building the Docker image first (Option 2), then converting it. Alternatively, build directly from the Dockerfile using Apptainer:
apptainer build selphi.sif docker-daemon://selphi:latestConvert a phased VCF/BCF reference panel to Selphi's internal formats (.pbwt, .samples, .sites, .srp):
# Standalone
selphi \
--prepare_reference \
--ref_source_vcf /path/to/refpanel.vcf.gz \
--refpanel /path/to/output_prefix \
--cores 16
# Docker
docker run -v /path/to/data:/data selphi \
--prepare_reference \
--ref_source_vcf /data/refpanel.vcf.gz \
--refpanel /data/output_prefix \
--cores 16This generates 4 files: output_prefix.pbwt, output_prefix.samples, output_prefix.sites, output_prefix.srp. These files can be reused across imputation runs. Multiple cores linearly decrease .srp creation time but increase memory usage.
# Standalone
selphi \
--refpanel /path/to/refpanel_prefix \
--target /path/to/target.vcf.gz \
--map /path/to/genetic_map.chrN.GRCh38.map \
--outvcf /path/to/output \
--cores 16
# Docker
docker run -v /path/to/data:/data selphi \
--refpanel /data/refpanel_prefix \
--target /data/target.vcf.gz \
--map /data/genetic_map.chrN.GRCh38.map \
--outvcf /data/output \
--cores 16| Option | Description |
|---|---|
--refpanel REFPANEL |
Location of reference panel files (required) |
--target TARGET |
Path to VCF/BCF containing target samples |
--map MAP |
Path to genetic map in plink format |
--outvcf OUTVCF |
Output path for imputed VCF (.vcf.gz added automatically) |
--cores CORES |
Number of cores for parallel processing (default: 1) |
--prepare_reference |
Convert reference panel to PBWT and SRP formats |
--ref_source_vcf |
VCF/BCF reference panel source (with --prepare_reference) |
--ref_source_xsi |
XSI reference panel source (with --prepare_reference) |
--pbwt_path |
Path to pbwt binary (auto-detected by install script) |
--tmp_path |
Location for temporary files |
--match_length |
Minimum PBWT match length |
--est_ne |
Estimated effective population size (default: 1000000) |
--no_core_reduction |
Disable automatic core reduction to limit HMM memory |
--chunk_size |
Chunk size for reference panel creation |
- One chromosome per file, matching the reference panel
- All genotypes must be phased
- Variants not found in the reference panel are automatically appended to the output
example/— Tiny example dataset for quick pipeline validationdocs/SRP_FORMAT.md— Full specification of the.srp(Sparse Reference Panel) file format, including archive structure, chunk layout, sparse matrix encoding, and access patterns
Selphi requires a genetic map in plink format for recombination rate interpolation. GRCh38 maps are available from the Eagle repository.
If you encounter any issues or have suggestions for improvements, please submit a pull request or create an issue in the GitHub repository.
If you use Selphi in your research, please cite:
Empowering GWAS Discovery through Enhanced Genotype Imputation
Adriano De Marino, Abdallah Amr Mahmoud, Sandra Bohn, Jon Lerga-Jaso,
Biljana Novković, Charlie Manson, Salvatore Loguercio, Andrew Terpolovsky,
Mykyta Matushyn, Ali Torkamani, Puya G. Yazdi
medRxiv 2023.12.18.23300143; doi: https://doi.org/10.1101/2023.12.18.23300143
This software is provided free of charge for academic research use only. Any use by commercial entities, for-profit organizations, or consultants is strictly prohibited without prior authorization. For inquiries about commercial licensing, contact pyazdi@gmail.com.