Try Pointblank - https://posit-dev.github.io/pointblank/
This project demonstrates data validation in Python using Posit's pointblank, a new open-source library inspired by its popular R counterpart.
Ensuring data quality is essential for trustworthy analytics, decision-making, and data governance. pointblank makes it easy to define validation rules for Pandas, Polars, and DuckDB tables, generating clear, actionable reports.
Compared to other tools like Great Expectations, Pandera, and Deequ, pointblank stands out for its intuitive API, seamless integration with modern Python data stacks, and robust reporting features. The package enables you to check for missing values, validate ranges and categories, and extract failing records for further inspection. It also supports advanced features like threshold-based monitoring and custom actions on validation failures.
This repository includes example validations (using the classic Titanic dataset) and demonstrates how to leverage pointblank for practical data quality workflows. See the accompanying code and documentation for details on setup, rule definition, and report generation.
For a complete write up, see this blog post.
notebooks/titanic.ipynb: The main notebook that demonstrates the functionality ofpointblankwith the Titanic dataset.notebooks/titanic.duckdb: The DuckDB database file containing the Titanic dataset.LICENSE: The Apache License 2.0 under which this project is distributed.
- Python 3.10 or higher
pointblankduckdbpandasnotebook(including IPython) for displaying Markdown in notebooks
Install the required dependencies using uv:
uv add pointblank duckdb pandas ipython| Feature/Aspect | pointblank (Python) | Great Expectations (GX) | Pandera | Deequ |
|---|---|---|---|---|
| Ease of Use | Simple, user-friendly API; quick to get started; programmatic simplicity[2][3]. | Steep learning curve; complex object model; can feel bloated for simple use cases[1][7]. | Very simple, pandas-like API; shallow learning curve[7]. | More complex, requires Scala/Spark. |
| Supported Backends | Pandas, Polars, DuckDB, SQL databases[2]. | Pandas, Spark, SQL, cloud data stores[6]. | Pandas, Dask, PySpark, Polars (via extensions)[7]. | Spark, Scala, some PyDeequ support. |
| Reporting | Attractive, detailed HTML reports (gt-based)[3]. | Comprehensive, customisable documentation and reports[6]. | No built-in reporting; must build reporting manually[7]. | Basic anomaly reports. |
| Validation Features | Wide range, but still maturing; atomic test units; easy drilldown to failures[2][3]. | Very comprehensive; supports complex checks, profiling, and autogenerated tests[5][6][7]. | Focused on schema validation and runtime checks; less comprehensive than GX[7]. | Focused on anomaly detection. |
| Action Triggers | No built-in triggers for follow-up actions on failure; must code manually[2][3]. | Supports post-validation actions (Checkpoints, integrations, notifications)[7]. | No built-in triggers; manual handling required. | Limited; mainly Spark integrations. |
| Integration & Ecosystem | Early days; fewer integrations but growing[2]. | Mature ecosystem; integrates with many data tools and cloud services[6][7]. | Limited integrations; best for local, data science workflows[7]. | Good for Spark/Scala environments. |
| Performance/Scalability | Suitable for moderate data sizes; not yet optimised for huge scale[2]. | Scalable; designed for production and large projects[6][7]. | Best for small to moderate data; not designed for big data or distributed validation[7]. | Scales with Spark. |
| Community/Support | Newer, smaller community; backed by Posit[2][3]. | Large, active community; extensive documentation and support[6][7]. | Smaller, data science-focused community[7]. | Supported by AWS, active in Spark. |
| Best For | Quick, readable validations; attractive reports; individual or small team use[2][3]. | Production-grade, complex validation systems; integrations; enterprise use[6][7]. | Lightweight, Pythonic validation for data science and ML workflows[7]. | Spark/Scala pipelines, anomaly detection. |
-
pointblank:
- Pros: Simple API, great reports, easy to use, supports multiple backends.
- Cons: Fewer integrations, lacks automated action triggers, new and still maturing[2][3].
-
Great Expectations:
- Pros: Feature-rich, highly configurable, strong integrations, production-ready, comprehensive reporting[6][7].
- Cons: Steep learning curve, can feel bloated for simple tasks, complex object model[1][7].
-
Pandera:
- Pros: Lightweight, familiar to pandas users, quick to write tests, good for data science workflows[7].
- Cons: Limited integrations and reporting, not designed for large-scale or production systems[7].
-
Deequ:
- Pros: Strong for Spark/Scala environments, good for anomaly detection and big data.
- Cons: Less Python support, basic reporting, less user-friendly for non-Spark users[1].
[1] https://www.reddit.com/r/dataengineering/comments/15a45gt/great_expectations_is_bloaty_what_are_the/
[2] https://aeturrell.com/blog/posts/the-data-validation-landscape-in-2025/
[3] https://www.linkedin.com/posts/richard-iannone-a5640017_i-started-a-new-blog-this-ones-all-about-activity-7314040067340046337-j6KD
[4] https://posit.co/blog/introducing-pointblank-for-python/
[5] https://docs.greatexpectations.io/docs/reference/learn/data_quality_use_cases/distribution/
[6] https://whylabs.ai/blog/posts/choosing-the-right-data-quality-monitoring-solution
[7] https://endjin.com/blog/2023/03/a-look-into-pandera-and-great-expectations-for-data-validation
[8] https://rstudio.github.io/pointblank/reference/tbl_match.html