- Current Version: 0.3.3
This package provides utilities for data engineering on Ingenii's Azure Data Platform. This can be both used for local development, and is used in the Ingenii Databricks Runtime.
Import the package to use the functions within.
import ingenii_data_engineeringPart of this package validates dbt schemas to ensure they are compatible with Databricks and the larger Ingenii Data Platform. This happens when a data pipeline to ingest a file is run, to make sure a file is ingested correctly. Full details of how to set up your dbt schema files in your Data Engineering repository can be found in the Ingenii Data Engineering Example repository.
This package contains code to facilitate the pre-processing of files before they are ingested by the data platform. This allows users to transform any data into a form that is compatible. See details of working with pre-processing functions in the Ingenii Data Engineering Example repository.
This package also contains the code to turn the pre-processing scripts into a package, ready to be uploaded and used by the Data Platform. Once this package is installed, the command
python -m <package name> <command> <folder with pre-processing code>
python -m ingenii_data_engineering pre_processing_package pre_processwill generate a .whl file in a folder called dist/. For more details, see the Ingenii Data Engineering Example repository.
- A working knowledge of git SCM
- Installation of Python 3.7.3
- Complete the 'Getting Started > Prerequisites' section
- For Windows only:
- Run
make setup: to copy the .env into place (.env-dist>.env)
-
Complete the 'Getting Started > Set up' section
-
From the root of the repository, in a terminal (preferably in your IDE) run the following commands to set up a virtual environment:
python -m venv venv . venv/bin/activate pip install -r requirements-dev.txt pre-commit installor for Windows:
python -m venv venv . venv/Scripts/activate pip install -r requirements-dev.txt pre-commit install -
Note: if you get a
permission deniederror when executing thepre-commit installcommand you'll need to runchmod -R 775 venv/bin/to recursively update permissions in thevenv/bin/dir -
The following checks are run as part of pre-commit hooks: flake8(note unit tests are not run as a hook)
- Complete the 'Getting Started > Set up' section
- Run
make buildto create the package in./dist - Run
make cleanto remove dist files
- Complete the 'Getting Started > Set up' and 'Development' sections
- Run
make testto run the unit tests using pytest - Run
flake8to run lint checks using flake8 - Run
make qato run the unit tests and linting in a single command - Run
make qato remove pytest files
0.3.3: Deprecated path for dbt0.3.2: Further bugfix for JSON UTF-8 BOM0.3.1: Remove unnecessary functions specific to Databricks0.3.0: Create pre-processing package using the module0.2.1: Handle JSON read UTF-8 BOM0.2.0: Pre-processing happens all in the 'archive' container0.1.5: Better functionality for column names in .csv files0.1.4: Handle JSON files0.1.3: Adding pre-processing utilities0.1.2: Rearrangement and better split of work with the Databricks Runtime. Better validation0.1.1: Minor bug fixes0.1.0: dbt schema validation, pre-processing class