Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,20 @@ For all but the last option, you may optionally specify a `label_precision`, whi
</details>


## Benford's Law
Real-world numeric distributions (such as bank account balances) often follow [Benford's law](https://en.wikipedia.org/wiki/Benford%27s_law), where the leading digit follows a specific non-uniform distribution. To facilitate synthesis of such data, `dbt_synth_data` provides a convenience macro to "`benfordize()`" any distribution:

```sql
{{synth_column_distribution(name="account_balance",
distribution=synth_distribution_benfordize(
distribution=synth_distribution_continuous_uniform(min=0, max=200000)
)
)}}
```

The macro works by casting values from the `distribution` to a text-minimal scientific notation string (`1.2345E2`), replacing the leading digit with one from the Benford distribution (`probabilities={"1":0.301, "2":0.176, "3":0.125, "4":0.097, "5":0.079, "6":0.067, "7":0.058, "8":0.051, "9":0.046}` by default), and casting back to a number (`type="double"` by default). Note that this casting may result in loss of precision.


## Constructing Complex Distributions
This package provides the following mechanisms for composing several distributions:

Expand Down
38 changes: 38 additions & 0 deletions macros/distributions/benfordize.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{% macro synth_distribution_benfordize(distribution, type="double", probabilities={
"1":0.301, "2":0.176, "3":0.125, "4":0.097, "5":0.079, "6":0.067, "7":0.058, "8":0.051, "9":0.046
}) %}
{{ return(adapter.dispatch('synth_distribution_benfordize')(distribution, type, probabilities)) }}
{% endmacro %}

{% macro default__synth_distribution_benfordize(distribution, type, probabilities) -%}
{# NOT YET IMPLEMENTED #}
{%- endmacro %}

{% macro sqlite__synth_distribution_benfordize(distribution, type, probabilities) %}
concat(
{{synth_distribution_discrete_probabilities(probabilities=probabilities)}},
substr(printf('%.12e', {{distribution}}), 2)
)::{{type}}
{% endmacro %}

{% macro duckdb__synth_distribution_benfordize(distribution, type, probabilities) %}
concat(
{{synth_distribution_discrete_probabilities(probabilities=probabilities)}},
substring(format('{:E}', {{distribution}}), 2)
)::{{type}}
{% endmacro %}

{% macro postgres__synth_distribution_benfordize(distribution, type, probabilities) %}
concat(
{{synth_distribution_discrete_probabilities(probabilities=probabilities)}},
substring(to_char({{distribution}}, '9.9999999999999999999EEEE') from 2)
)::{{type}}
{% endmacro %}

{% macro snowflake__synth_distribution_benfordize(distribution, type, probabilities) %}
concat(
{{synth_distribution_discrete_probabilities(probabilities=probabilities)}},
substring(to_varchar({{distribution}}, 'TME')::varchar, 2)
{# see https://docs.snowflake.com/en/sql-reference/sql-format-models#text-minimal-format-elements #}
)::{{type}}
{% endmacro%}