Skip to content

[feat]: Implement new features for KDP #32

@piotrlaczkowski

Description

@piotrlaczkowski

Implement:

  1. Self‑Supervised Contrastive Pretraining
    Implement a built‑in contrastive learning stage inspired by ReConTab, where an asymmetric autoencoder with regularization selects salient features and a contrastive loss distills robust, invariant embeddings
    arXiv
    .

  2. Masked Feature Prediction Pretraining
    Offer a masked‑attribute prediction task akin to TabTransformer’s masked language modelling—masking random features and training the model to reconstruct them—thereby contextualizing embeddings via intra‑row dependencies
    ar5iv
    .

  3. Tree‑Regularized Embedding Layer
    Provide a supervised, tree‑regularized embedding layer (both Tree‑to‑Vector and Tree‑to‑Token) that binarizes inputs via pretrained tree‑ensemble splits and generates embeddings capturing hierarchical, rule‑based structure
    arXiv
    .

  4. Random Fourier Feature Preprocessor
    Incorporate a Random Fourier Feature transformation module projecting numeric inputs into a fixed high‑frequency basis (via sin/cos of random projections), improving conditioning and convergence without additional learned parameters
    arXiv
    .

  5. Periodic and PLE Numeric Embeddings
    Add numeric embedding functions using periodic expansions (sin/cos) and parameterized linear expansions (PLE), which empirically close the gap between MLPs/Transformers and tree‑based baselines on tabular tasks
    arXiv
    .

  6. Entity Embeddings for Categorical Variables
    Support an entity embedding API mapping category indices to dense vectors—leveraging the classic approach that clusters similar categories in latent space and reduces overfitting for high‑cardinality features
    arXiv
    .

  7. Semantic Feature Enrichment via Pretrained Word Embeddings
    Enable optional semantic text embedding lookup for descriptive categorical fields, pulling in pretrained Word2Vec or GloVe vectors to infuse domain semantics into KDP’s categorical pipelines
    MachineLearningMastery.com
    .

  8. Multi‑Grained Categorical Embeddings
    Implement an end‑to‑end multi‑grained embedding layer that hierarchically encodes category subsets (e.g., via decision‑forest splits) to capture rich, multi‑resolution feature granularity
    ScienceDirect
    .

  9. Library of Self‑Supervised Tabular SSL Tasks
    Bundle a suite of state‑of‑the‑art tabular self‑supervised objectives—SCARF, SAINT, SubTab, XTab, etc.—as configurable pretraining strategies directly within KDP
    GitHub
    .

  10. Multi‑Scale Fourier Feature Embedding
    Incorporate a multiscale FourierFeatureEmbedding layer supporting user‑specified sigma scales, enabling simultaneous capture of both low and high‑frequency numeric patterns
    mathlab.github.io
    .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions