Package to process images and their features from .cyz files from the CytoSense and upload them to EcoTaxa.
NB: As for all things Python, you should preferrably install CytoProcess within a Python venv/coda environment. The package is tested with Python=3.11 and should therefore work with this or a more recent version. To create a conda environment, use
conda create -n cytoprocess python=3.11
conda activate cytoprocessThen install the sable version with
pip install cytoprocessor the development version with
pip install git+https://github.com/jiho/cytoprocess.gitThe Python package includes a command line tool, which should become available from within a terminal. To try it and output the help message
cytoprocessCytoProcess depends on Cyz2Json. To install it, run
cytoprocess installCytoProcess uses the concept of "project". A project corresponds conceptually to a cruise, a time series, etc. Practically, it is a directory with a specific set of subdirectories that contain all files related to the cruise/time series/etc. It corresponds to a single EcoTaxa project.
Each .cyz file is considered as a "sample" (and will correspond to an EcoTaxa sample).
A project is organised like so
my_project/
config configuration files
raw source .cyz files
meta file storing manually-provided metadata for each sample(lat, lon, etc.)
work data extracted by the various processing steps
<sample_id_1> in one folder per sample
converted_data.json file converted from .cyz by Cyz2Json
cytometric_features.parquet average cytometric measurement per image
image_features.parquet features computed on each image (area, etc.)
images images with scale bar and mask for the particle
metadata.parquet instrument metadata extracted from the .json file
pulses_plots plot of the pulse shapes of imaged particles
pulses_summaries.parquet polynomial summaries of the pulse shapes
<sample_id_2>
...
ecotaxa .zip files ready for upload in EcoTaxa
logs logs of all commands executed on this project, split per day
A CytoProcess command line looks like
cytoprocess --global-option command --command-option project_directoryTo know which global options and which commands are available, use
cytoprocess --helpTo know which options are available for a given command
cytoprocess command --helpUse
cytoprocess create path/to/my_projectThen copy/move the .cyz files that are relevant for this project in my_project/raw. If you have an archive of .cyz files organised differently, you should be able to symlink them in my_project/raw instead of copying them.
List available raw samples and create the meta/samples.csv file with
cytoprocess list path/to/my_projectManually enter the required metadata (such as lon, lat, etc.) in the .csv file. You can add or remove columns as you see fit, you can use the option --extra-fields (or -e) to change the default columns added. The conventions follow those of EcoTaxa.
Then, perform all processing steps, for all samples, with default options
cytoprocess all path/to/my_projectTo check how far along the processing of each sample is, use
cytoprocess status path/to/projectWhen several samples have been prepared, you can upload them in batches to EcoTaxa with
cytoprocess upload path/to/projectIf you want to know the details, or proceed manually, the steps behind all are:
# convert .cyz files into .json and create a placeholder its metadata
cytoprocess convert path/to/project
# extract instrument provided metadata from each .json file
cytoprocess extract_meta path/to/project
# extract cytometric features for each imaged particle
cytoprocess extract_cyto path/to/project
# compute pulse shapes polynomial summaries for each imaged particle
cytoprocess summarise_pulses path/to/project
# extract images and their features
cytoprocess extract_images path/to/project
# prepare files for ecotaxa upload
cytoprocess prepare path/to/projectTo process a subset of samples, use
cytoprocess --sample 'name_of_cyz_file' command path/to/projectwhich processes this single sample. Or
cytoprocess --sample '*foo*' command path/to/projectwhich process all samples whose name contains foo.
All commands will skip the processing of a given sample if the output is already present. To re-process and overwrite, use the --force option.
For metadata and cytometric features extraction (extract_meta and extract_cyto), information from the json file needs to be curated and translated into EcoTaxa metadata columns. This is defined in the configuration file my_project/config/config.yaml. It contains key: value pairs of the form json.fields.item.name: ecotaxa_name. To get the list of possible json fields, use the --list (or -l) option for extract_meta or extract_cyto; it will write a text file in config with all possibilities. You can then copy-paste them to config/config.yaml.
Even with all these fields available, the CytoSense does not record some relevant metadata such as latitude, longitude, and date of collection of each sample, which EcoTaxa needs to filter the data or export it to other data bases. You should provide such fields manually by editing the meta/samples.csv file.
If you change this metadata or the mapping of fields in config.yaml and want to reimport the modified .tsv files on EcoTaxa, you can do so with
# re-generate the .tsv files with the corrected metadata
cytoprocess prepare --force path/to/project
# re-upload the .tsv only and use "Update metadata" mode
cytoprocess upload --update path/to/projectTo predict the classification of images outside of EcoTaxa, you need to define a function with the signature def my_fun(paths, features), either in a custom package that can be found by python with import or, simply, in a file (e.g. my_model.py). Then you can call
cytoprocess predict --model my_model.py::my_fun /path/to/projectFor each sample, this will pass two arguments to your function:
paths: a list of absolute paths to the imagesfeatures: a pandas DataFrame with as many rows as there are images and cytometric + image features as columns.
Your function should return a DataFrame (or a dictionary that can be converted to one) with rows matching the input images (same number, same order) and at least one column called 'annotation_category' containing the predicted EcoTaxa category name for each image.
NB: If you use deep learning approaches, remember that the images all have a 31 px high scale bar at the bottom which should be removed before processing the image (typically, at the very beginning of the your data loader).
Because everything is stored in the EcoTaxa .zip files and can be re-generated from the .cyz files, you may want to remove the intermediate files, in work, as well as old log files, to reclaim disk space. For example, to remove intermediate files and log files older than 20 days
cytoprocess clean --older-than 20 path/to/projectHere are all cytoprocess commands
Usage: cytoprocess [OPTIONS] COMMAND [ARGS]...
CytoProcess command line interface
CytoProcess is a tool to process CytoSense images and upload them to EcoTaxa.
It uses the concept of "project" to organise the data and metadata. It
provides commands to create a project, convert raw files, extract metadata and
features, summarise pulse shapes, extract images, optionally predict their
classification, prepare files for EcoTaxa, and upload them.
Options:
-d, --debug Show debugging messages.
-s, --sample TEXT Limit processing to the sample(s) matching the given
string, including globing patterns (e.g. 'sample_123' to
process only the sample called exactly that or '*2025* to
process all samples with '2025' in their name).
--help Show this message and exit.
Commands:
install Install dependency: Cyz2Json converter.
create Create a new CytoProcess project directory.
list List samples and create/update meta/samples.csv.
convert Convert .cyz files to .json format.
extract_meta Extract instrument metadata from .json files.
extract_cyto Extract cytometric features from .json files.
summarise_pulses Summarise pulse shapes.
extract_images Extract images from .json files.
prepare Prepare .tsv and images for EcoTaxa.
upload Upload files to EcoTaxa.
all Run all steps from convert to prepare in sequence.
status Show per-sample processing status.
clean Remove intermediate files in the project.
predict Run a user-provided model to predict classifications.
Usage: cytoprocess install [OPTIONS]
Install dependency: Cyz2Json converter.
This tool is required to convert .cyz files in a readable .json format. It is
distributed from https://github.com/OBAMANEXT/cyz2json. This command installs
the latest release automatically.
Options:
-f, --force Force (re)installation of the latest release even if Cyz2Json
already exists.
--help Show this message and exit.
Usage: cytoprocess create [OPTIONS] PROJECT
Create a new CytoProcess project directory.
Options:
--help Show this message and exit.
Usage: cytoprocess list [OPTIONS] PROJECT
List samples and create/update meta/samples.csv.
Run this after creating the project and after adding new .cyz files to it.
Once run, the file meta/samples.csv should be edited to add metadata for each
sample. The default metadata fields are very relevant for EcoTaxa (location,
time, etc.).
Options:
-e, --extra-fields TEXT Comma-separated list of extra fields to add as
columns in samples.csv.
--help Show this message and exit.
Usage: cytoprocess convert [OPTIONS] PROJECT
Convert .cyz files to .json format.
Options:
-f, --force Force conversion even if .json files already exist.
--help Show this message and exit.
Usage: cytoprocess extract_meta [OPTIONS] PROJECT
Extract instrument metadata from .json files.
These are metadata fields stored in the .json by the CytoSense itself. They
are useful to describe the acquisition of the samples.
The names of the fields can by found by using the `--list` option, and should
then be mapped to EcoTaxa metadata columns in config.xml.
Options:
-l, --list List all metadata items found in the .json file(s) instead of
extracting some of them.
-f, --force Force extraction even if output files already exist.
--help Show this message and exit.
Usage: cytoprocess extract_cyto [OPTIONS] PROJECT
Extract cytometric features from .json files.
These correspond to what is traditionally called the "listmode" files: they
are summaries of the pulse shape per channel for each object (maximum value,
average value, etc.). Some can be directly informative biologically and all
can be used by machine learning algorithms to predict classifications.
Similarly, to the metadata fields, the names of the cytometric features can be
found by using the `--list` option, and should then be mapped to EcoTaxa
feature columns in config.xml.
Options:
-l, --list List all cytometric fields paths found in the .json file(s)
instead of extracting some of them.
-f, --force Force extraction even if output files already exist.
--help Show this message and exit.
Usage: cytoprocess summarise_pulses [OPTIONS] PROJECT
Summarise pulse shapes.
The pulse shapes for each particle are standardised between 0 and 1 and then
approximated by a polynomial of degree `n_poly` (default 10). The coefficients
of the polynomial are then used as features in EcoTaxa. This is a way to
summarise the pulse shapes while keeping their general form, which can be
informative for classification.
In addition, a plot of the standardised pulse shape is created. This plot is
uploaded to EcoTaxa as the third image of each object.
Options:
-n, --n-poly INTEGER Number of polynomial coefficients
-f, --force Force processing even if output files already exist.
-m, --max-cores INTEGER Maximum number of CPU cores to use for parallel
processing.
--help Show this message and exit.
Usage: cytoprocess extract_images [OPTIONS] PROJECT
Extract images from .json files.
Extract the images from the .json files and segments the main object in each.
It stores a file for the image and for the mask, which are both uploaded to
EcoTaxa.
Some usual features are measured on the segmented object (area, perimeter,
etc.), which are added to the features extracted from the pulse shapes and can
be used for classification or biological interpretation.
Options:
-f, --force Force extraction even if output files already exist.
-m, --max-cores INTEGER Maximum number of CPU cores to use for parallel
processing.
--help Show this message and exit.
Usage: cytoprocess prepare [OPTIONS] PROJECT
Prepare .tsv and images for EcoTaxa.
Create a .zip archive in the `ecotaxa` folder with: (1) the .tsv file with the
metadata and features for each object and (2) three images per object (image,
mask, pulse plot), for each sample.
Options:
-f, --force Force preparation even if output files already exist.
--help Show this message and exit.
Usage: cytoprocess upload [OPTIONS] PROJECT
Upload files to EcoTaxa.
The .zip files prepared are uploaded and then imported into an EcoTaxa
project, configured in config.xml.
If the `--update` flag is used, only the metadata of existing samples is
updated, without re-uploading the images. This is useful to update the
metadata after editing samples.csv: it only requires to re-run the `prepare`
step and then `upload --update`.
Multiple samples can be aggregated into a single upload using `--batch`. The
samples are combined into one zip file, uploaded, and imported together. This
is usually faster than uploading sample per sample.
Options:
-u, --username TEXT EcoTaxa email address.
-p, --password TEXT EcoTaxa password.
--update Only update the metadata for existing samples.
-b, --batch INTEGER RANGE Number of samples to aggregate into a single upload
(default: 10). [x>=1]
--help Show this message and exit.
Usage: cytoprocess all [OPTIONS] PROJECT
Run all steps from convert to prepare in sequence.
Options:
-f, --force Force processing even if output already exists.
-n, --n-poly INTEGER Number of polynomial coefficients.
-m, --max-cores INTEGER Maximum number of CPU cores to use for parallel
processing.
--help Show this message and exit.
Usage: cytoprocess status [OPTIONS] PROJECT
Show per-sample processing status.
Options:
-w, --width INTEGER Width of the sample ID display (truncated with ellipsis
if too long).
--help Show this message and exit.
Usage: cytoprocess clean [OPTIONS] PROJECT
Remove intermediate files in the project.
Remove the `work` directory: everything in it can be re-generated by re-
running the commands and the relevant content is stored in the .zip files in
`ecotaxa`.
Optionnally, log files older than a certain number of days can also be
removed.
Options:
-o, --older-than INTEGER Remove log files older than this many days (by
default, do not remove anything).
--help Show this message and exit.
Usage: cytoprocess predict [OPTIONS] PROJECT
Run a user-provided model to predict classifications.
The function should accept two arguments:
1. paths: a list of absolute paths to the images,
2. features: a DataFrame with one row per image and cytometric + image
features as columns (the cytometric features retained are
defined in config.xml).
It should return a DataFrame (or a dictionary that can be converted to one)
with rows matching the input images (same number, same order) and at least one
column called 'annotation_category' containing the predicted EcoTaxa category
name for each image. Other columns can be added. All column names will be
prepended with 'object_' before their import into EcoTaxa.
NB: Images contain a 31 pixels-high scale bar at the bottom. It should be
cropped out before feeding the image to a deep learning model.
Options:
-m, --model TEXT A function encapsulating the prediction model, specificed as
'path/to/model.py::func_name' or 'my_module.func_name'.
[required]
-f, --force Force re-prediction even if output already exists.
--help Show this message and exit.
RTFM ;-)
All this is accessible from cytoprocess by using cytoprocess command --help.
Fork this repository, clone your fork.
Prepare your development environment by installing the dependencies within a conda environment
conda create -n cytoprocess python=3.11
conda activate cytoprocess
pip install -e .This creates a cytoprocess.egg-info directory at the root of the package's directory. It is safely ignored by git (and you should too).
Now, either run commands as you normally would
cytoprocess --helpor call the module explicitly
python -m cytoprocess --helpAny edits made to the files are immediately reflected in the output (because the package was installed in "editable" mode: pip install -e ... ; or is run directly as a module: python -m ...).