Skip to content

DataBooth/try-point-blank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

try-point-blank

Try Pointblank - https://posit-dev.github.io/pointblank/

Status

Status: Working License

Overview

This project demonstrates data validation in Python using Posit's pointblank, a new open-source library inspired by its popular R counterpart.

Ensuring data quality is essential for trustworthy analytics, decision-making, and data governance. pointblank makes it easy to define validation rules for Pandas, Polars, and DuckDB tables, generating clear, actionable reports.

Compared to other tools like Great Expectations, Pandera, and Deequ, pointblank stands out for its intuitive API, seamless integration with modern Python data stacks, and robust reporting features. The package enables you to check for missing values, validate ranges and categories, and extract failing records for further inspection. It also supports advanced features like threshold-based monitoring and custom actions on validation failures.

This repository includes example validations (using the classic Titanic dataset) and demonstrates how to leverage pointblank for practical data quality workflows. See the accompanying code and documentation for details on setup, rule definition, and report generation.

For a complete write up, see this blog post.

Key Artefacts

  • notebooks/titanic.ipynb: The main notebook that demonstrates the functionality of pointblank with the Titanic dataset.
  • notebooks/titanic.duckdb: The DuckDB database file containing the Titanic dataset.
  • LICENSE: The Apache License 2.0 under which this project is distributed.

Getting Started

Prerequisites

  • Python 3.10 or higher
  • pointblank
  • duckdb
  • pandas
  • notebook (including IPython) for displaying Markdown in notebooks

Install the required dependencies using uv:

uv add pointblank duckdb pandas ipython

Appendix: Comparison of Data Validation Libraries

Feature/Aspect pointblank (Python) Great Expectations (GX) Pandera Deequ
Ease of Use Simple, user-friendly API; quick to get started; programmatic simplicity[2][3]. Steep learning curve; complex object model; can feel bloated for simple use cases[1][7]. Very simple, pandas-like API; shallow learning curve[7]. More complex, requires Scala/Spark.
Supported Backends Pandas, Polars, DuckDB, SQL databases[2]. Pandas, Spark, SQL, cloud data stores[6]. Pandas, Dask, PySpark, Polars (via extensions)[7]. Spark, Scala, some PyDeequ support.
Reporting Attractive, detailed HTML reports (gt-based)[3]. Comprehensive, customisable documentation and reports[6]. No built-in reporting; must build reporting manually[7]. Basic anomaly reports.
Validation Features Wide range, but still maturing; atomic test units; easy drilldown to failures[2][3]. Very comprehensive; supports complex checks, profiling, and autogenerated tests[5][6][7]. Focused on schema validation and runtime checks; less comprehensive than GX[7]. Focused on anomaly detection.
Action Triggers No built-in triggers for follow-up actions on failure; must code manually[2][3]. Supports post-validation actions (Checkpoints, integrations, notifications)[7]. No built-in triggers; manual handling required. Limited; mainly Spark integrations.
Integration & Ecosystem Early days; fewer integrations but growing[2]. Mature ecosystem; integrates with many data tools and cloud services[6][7]. Limited integrations; best for local, data science workflows[7]. Good for Spark/Scala environments.
Performance/Scalability Suitable for moderate data sizes; not yet optimised for huge scale[2]. Scalable; designed for production and large projects[6][7]. Best for small to moderate data; not designed for big data or distributed validation[7]. Scales with Spark.
Community/Support Newer, smaller community; backed by Posit[2][3]. Large, active community; extensive documentation and support[6][7]. Smaller, data science-focused community[7]. Supported by AWS, active in Spark.
Best For Quick, readable validations; attractive reports; individual or small team use[2][3]. Production-grade, complex validation systems; integrations; enterprise use[6][7]. Lightweight, Pythonic validation for data science and ML workflows[7]. Spark/Scala pipelines, anomaly detection.

Summary of Pros/Cons

  • pointblank:

    • Pros: Simple API, great reports, easy to use, supports multiple backends.
    • Cons: Fewer integrations, lacks automated action triggers, new and still maturing[2][3].
  • Great Expectations:

    • Pros: Feature-rich, highly configurable, strong integrations, production-ready, comprehensive reporting[6][7].
    • Cons: Steep learning curve, can feel bloated for simple tasks, complex object model[1][7].
  • Pandera:

    • Pros: Lightweight, familiar to pandas users, quick to write tests, good for data science workflows[7].
    • Cons: Limited integrations and reporting, not designed for large-scale or production systems[7].
  • Deequ:

    • Pros: Strong for Spark/Scala environments, good for anomaly detection and big data.
    • Cons: Less Python support, basic reporting, less user-friendly for non-Spark users[1].

References

[1] https://www.reddit.com/r/dataengineering/comments/15a45gt/great_expectations_is_bloaty_what_are_the/
[2] https://aeturrell.com/blog/posts/the-data-validation-landscape-in-2025/
[3] https://www.linkedin.com/posts/richard-iannone-a5640017_i-started-a-new-blog-this-ones-all-about-activity-7314040067340046337-j6KD
[4] https://posit.co/blog/introducing-pointblank-for-python/
[5] https://docs.greatexpectations.io/docs/reference/learn/data_quality_use_cases/distribution/
[6] https://whylabs.ai/blog/posts/choosing-the-right-data-quality-monitoring-solution
[7] https://endjin.com/blog/2023/03/a-look-into-pandera-and-great-expectations-for-data-validation
[8] https://rstudio.github.io/pointblank/reference/tbl_match.html

Releases

No releases published

Packages

No packages published