-
Notifications
You must be signed in to change notification settings - Fork 0
Description
When creating a Pipeline from a dict, a dummy fit is currently required in order to obtain feature_names_out. However, calling fit may trigger actual mathematical computations inside transformers. In some cases, these computations can fail, even though the intent is only to retrieve feature names.
Feature name resolution should not depend on executing a full or partial fitting procedure. Coupling feature name handling with numerical computation introduces unnecessary side effects and makes the pipeline construction fragile.
Proposed Approach
A cleaner design would separate:
- Feature name handling (schema / metadata resolution), and
- Fitting computation (parameter estimation, numerical processing).
This separation avoids dummy computations and prevents crashes during pipeline initialization.
Current Status
This separation was partially implemented in commit 405ba6d. However, similar feature name handling logic is still required in both ColumnTransformer and Pipeline.
Next Steps
- Extend the feature name handling abstraction to
ColumnTransformerandPipeline. - Ensure that feature_names_out can be resolved without invoking fit.
- Clearly distinguish metadata-only operations from computational fitting steps across the preprocessing API.
Side notes
- Handling metadata scheme is also a keypoint to retreive a pipeline from end features. For exemple, when performing tide request to obtain a feature in W that is computed from 2 temperature and a mass flow rate, tide is unable to identify the the transformers that were responsible for the calaculation. The whole Pipeline must be created and fit_transform. Knowing the relations between the features would help build a pipeline with only the required steps