Interactive Jupyter notebooks accompanying the book "Data and Text Processing for Health and Life Sciences". Each notebook is a hands-on, step-by-step tutorial demonstrating how Unix shell scripting can be used to find, retrieve, and process biomedical data and text.
Note: Includes a fix for the ChEBI 2.0 web interface, which currently lacks detailed cross-references on individual entry pages.
| Folder | Description |
|---|---|
notebooks/ |
Jupyter notebooks (one per tutorial) |
data/ |
Input and output data files used by the notebooks |
scripts/ |
Shell scripts created during the tutorials |
| # | Notebook | Google Colab | Topics Covered |
|---|---|---|---|
| 01 | unix-shell | Unix basics: ls, pwd, head, cat, piping. Environment setup for ChEBI retrieval. |
|
| 02 | data-retrieval | curl with EBI APIs. Download UniProt cross-references (CSV/XML). Build getdata.sh. |
|
| 03 | data-extraction | grep filtering (HUMAN/RAT/MOUSE), cut for column selection. Build getproteins.sh. |
|
| 04 | task-repetition | Loops, xargs, and parallel for batch processing. |
|
| 05 | xml-processing | xmllint with XPath queries on UniProt XML. Extract PubMed IDs. |
|
| 06 | text-retrieval | RDF publication data (UniProt/NCBI). Extract titles and abstracts. | |
| 07 | text-processing | Pattern matching, regular expressions, tokenization, and sentence splitting. | |
| 08 | semantic-processing | OWL ontologies (ChEBI, DOID), URI/label conversion, synonyms, NER with the MER tool. |
You can open any notebook manually:
- Go to Google Colab
- File -> Open notebook -> GitHub tab
- Paste the repository URL:
https://github.com/lasigeBioTM/data-text-processing-notebooks - Select a notebook from the
notebooks/folder and click Open
git clone https://github.com/lasigeBioTM/data-text-processing-notebooks
cd data-text-processing-notebooks
jupyter notebook notebooks/This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

