Conversation
|
Cool. Thanks @ksolarski, just a quick reply from my phone... Don't do this for the synthetic control because I have an in progress PR that will change it. It won't have a formula input. But can I just get some clarification... does this change the API? Can we get the exact same functionality? If not, let's think again. Will try to look at the code properly when I can 👍🏻 |
|
I can't find where I saw it in the I'm not 100% sure that this is a problem, and apologies I can't find the relevant part in the docs. But does my concern make sense? |
|
You're right, Patsy has the power of preserving the transformation / encoding of variables through However, Patsy repo suggests migration to https://github.com/matthewwardrop/formulaic instead, which is capable of "reusing the encoding choices made during conversion of one data-set on other datasets." (see https://matthewwardrop.github.io/formulaic/latest/). There's also a migration guide from Patsy to Formulaic to switch would be easy. It also supports many operators: https://matthewwardrop.github.io/formulaic/latest/guides/grammar/ Did you check out this library before? What do you think about using this instead of formulae? |
|
@drbenvincent any strong opinions about using |
|
Sorry for the delayed response @ksolarski. So as far as I understand, Right now there are no use-cases for hierarchical modelling. That might change in the future, though I don't have any specific use cases in mind. So I guess the only choice at the moment is |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #463 +/- ##
==========================================
+ Coverage 94.44% 94.45% +0.01%
==========================================
Files 80 80
Lines 12707 12701 -6
Branches 770 770
==========================================
- Hits 12001 11997 -4
+ Misses 498 497 -1
+ Partials 208 207 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@drbenvincent Yes, from the docs it seems that no hierarchical models are allowed in the import pandas as pd
from formulaic import model_matrix
import formulaic
# Create a training dataset
train_data = pd.DataFrame(
{
"feature1": ["A", "B", "C", "D"],
"target": [0, 1, 0, 1],
}
)
# Create a test dataset
test_data = pd.DataFrame(
{
"feature1": [
"A", # In training
"D", # In training
"E", # Not in training
],
"target": [0, 1, 0],
}
)
# Generate the model matrix for the training data
train_matrix = model_matrix("target ~ 0 + feature1", train_data)
# Print the training matrix and spec
print("Training Matrix:")
print(train_matrix)
# Use the same spec to transform the test data
test_matrix = model_matrix(spec=train_matrix.model_spec, data=test_data)
# Print the test matrix - see that columns are properly aligned from the training data transformation
print("\nTest Matrix:")
print(test_matrix)Is that the problem you had in mind or something else? |
@ksolarski Yes that's pretty much it. Turns out the phrase I was looking for was "stateful transforms" which you pretty much said here. I'm actually wondering - if we don't get the additional functionality of hierarchical modeling, is there much benefit from moving from patsy to formulaic? Not saying it's not worth it, but we should be clear about the rationale. Is it because patsy is no longer maintained and formulaic might see more features, or does it have a richer formula API? Sorry for the extended conversation on this by the way - but given the formula aspect is a core part of the API it's worth thinking it through :) |
|
Hi @ksolarski and @drbenvincent, sorry for the delay in my response. I can add a few pieces of information.
With that said, unless you need hierarchical models supported via the |
23221c2 to
0734c8b
Compare
|
@drbenvincent I went with |
|
Sorry for being slow on this @ksolarski ! After #641 is merged, we should revisit this. That will see increased modularity, so all the action can happen in the new |
|
This comment was generated with LLM assistance and may have been edited by the commenter. Issue EvaluationSummaryThis PR aims to replace
Current RelevanceThis issue is still relevant and valid. The codebase continues to use
Code references checked:
Related PRs:
Discussion Summary
Recommendation
Proposed next steps:
Migration Scope (for reference)Once #641 is merged, the migration will involve:
The migration pattern (from patsy to formulaic) would be: # Before (patsy)
from patsy import dmatrices, build_design_matrices
y, X = dmatrices(formula, data)
# After (formulaic)
from formulaic import model_matrix
mm = model_matrix(formula, data)
# Then use mm.model_spec for new datacc @ksolarski - Heads up that this is blocked on #641. Once that merges, this PR should be straightforward to complete! |
|
Sounds good, I see it's merged already so I'll continue here |
PR SummaryMedium Risk Overview Updates the PyMC models layer to use Written by Cursor Bugbot for commit 2f7e57e. This will update automatically on new commits. Configure here. |
|
@drbenvincent now I believe this is ready to be looked at. I removed Patsy from all the places except for notebooks. Let me know if you think I should also regenerate those then. |
Documentation build overview
13 files changed ·
|
|
Thanks @ksolarski - I'm hoping to take a few looks at this because this is a major change :) At the moment I'll just flag up a few issues:
|
|
PS. I'm fine if you don't re-run or update notebooks in this PR. But given that it's a big change I think I would want to do that and push changes all in this one PR, so that this is done in one decisive PR. Reason is because I'd like to be able to be able to create a new release at any point without having a PR which breaks things until some future hypothetical PR fixes the notebooks :) PPS. I updated this branch from main by the way, so you'll need to pull changes to your local machine @ksolarski |
| "X": X.data, | ||
| "y": np.zeros( | ||
| (new_no_of_observations, n_treated_units), dtype=np.int32 | ||
| ), |
There was a problem hiding this comment.
Prediction data converted to buffer object
High Severity
_data_setter() now passes X.data into pm.set_data. When predict() is called with a NumPy array (common in several experiments), X.data is a raw buffer, not the feature matrix. This can corrupt X shape/content during posterior prediction and break out-of-sample predictions.
|
Apologies about letting this go stale @ksolarski. I'm going to get this to green and then I'll recap any remaining issues to be resolved. |
Resolve conflicts in regression_discontinuity.py (keep main's improved warning logic with formulaic's model_matrix) and environment.yml (keep formulaic + main's new dependencies). Made-with: Cursor
- _data_setter: query the existing y data node's dtype instead of hardcoding int32 or defaulting to float64, so pm.set_data accepts the placeholder regardless of how formulaic encoded the outcome. Also revert the X.data change (X is always xr.DataArray here). - IPW: store formula LHS as treatment_variable_name, not outcome_variable_name, since for IPW the formula models the treatment assignment, not the outcome. Fixes __maketables_depvar__ returning the wrong variable. Made-with: Cursor
PR #463 status update — sync with main + bug fixes@ksolarski I've merged What was done
What passes now
Review items from @drbenvincent (Jan 26) — status
Remaining blocker:
|
| File | patsy usage |
|---|---|
causalpy/transforms.py |
import patsy + patsy.stateful_transform(StepTransform/RampTransform) |
causalpy/experiments/piecewise_its.py |
from patsy import dmatrices |
causalpy/tests/test_piecewise_its.py |
from patsy import dmatrix, PatsyError |
Since the PR removes patsy from pyproject.toml, these files will break at import time if left unmigrated.
The good news: the step and ramp transforms are only truly stateful for datetime inputs (they remember the datetime origin). For numeric time — the common case — they're pure functions. formulaic can handle them as plain callables via its context parameter:
dm = model_matrix(formula, data, context={"step": step, "ramp": ramp})The patsy.stateful_transform() wrapper can be replaced with simple functions that instantiate StepTransform/RampTransform and call transform() directly.
Suggested approach:
- In
transforms.py: replacepatsy.stateful_transform(StepTransform)with a plain callable wrapper that instantiates the class and calls itstransform()method - In
piecewise_its.py: switchdmatricestomodel_matrixand passstep/rampviacontext - In
test_piecewise_its.py: replacedmatrixandPatsyErrorimports with formulaic equivalents
Let me know if you'd prefer me to tackle this, or if you'd rather do it yourself!
- Resolve merge conflict in inverse_propensity_weighting.py (keep formulaic implementation, incorporate improved docstring from main) - Update rd_pymc_drinking.ipynb: replace patsy-style coefficient names (treated[T.True]) with formulaic-style names (treated), fixing the notebook CI failure - Fix mypy type: ignore codes in panel_regression.py (attr-defined -> union-attr) Made-with: Cursor
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Synced branch with
All local checks pass ( |




Solving issue #386
Starting with DiD, will continue with other methods if you with general design @drbenvincent
Seems like the key practical difference between
formulaeandpatsyis lack ofbuild_design_matricesmethod informulae. User has to then provide formula again.Edit: After discussion, it was decided that
formulaicsuits us best.📚 Documentation preview 📚: https://causalpy--463.org.readthedocs.build/en/463/