[feat]: Implement new features for KDP

Implement:

1. Self‑Supervised Contrastive Pretraining
Implement a built‑in contrastive learning stage inspired by ReConTab, where an asymmetric autoencoder with regularization selects salient features and a contrastive loss distills robust, invariant embeddings 
[arXiv](https://arxiv.org/abs/2310.18541?utm_source=chatgpt.com)
.

2. Masked Feature Prediction Pretraining
Offer a masked‑attribute prediction task akin to TabTransformer’s masked language modelling—masking random features and training the model to reconstruct them—thereby contextualizing embeddings via intra‑row dependencies 
[ar5iv](https://ar5iv.labs.arxiv.org/html/2012.06678?utm_source=chatgpt.com)
.

3. Tree‑Regularized Embedding Layer
Provide a supervised, tree‑regularized embedding layer (both Tree‑to‑Vector and Tree‑to‑Token) that binarizes inputs via pretrained tree‑ensemble splits and generates embeddings capturing hierarchical, rule‑based structure 
[arXiv](https://arxiv.org/abs/2403.00963?utm_source=chatgpt.com)
.

4. Random Fourier Feature Preprocessor
Incorporate a Random Fourier Feature transformation module projecting numeric inputs into a fixed high‑frequency basis (via sin/cos of random projections), improving conditioning and convergence without additional learned parameters 
[arXiv](https://arxiv.org/pdf/2506.02406?utm_source=chatgpt.com)
.

5. Periodic and PLE Numeric Embeddings
Add numeric embedding functions using periodic expansions (sin/cos) and parameterized linear expansions (PLE), which empirically close the gap between MLPs/Transformers and tree‑based baselines on tabular tasks 
[arXiv](https://arxiv.org/abs/2203.05556?utm_source=chatgpt.com)
.

6. Entity Embeddings for Categorical Variables
Support an entity embedding API mapping category indices to dense vectors—leveraging the classic approach that clusters similar categories in latent space and reduces overfitting for high‑cardinality features 
[arXiv](https://arxiv.org/abs/1604.06737?utm_source=chatgpt.com)
.

7. Semantic Feature Enrichment via Pretrained Word Embeddings
Enable optional semantic text embedding lookup for descriptive categorical fields, pulling in pretrained Word2Vec or GloVe vectors to infuse domain semantics into KDP’s categorical pipelines 
[MachineLearningMastery.com](https://machinelearningmastery.com/word-embeddings-for-tabular-data-feature-engineering/?utm_source=chatgpt.com)
.

8. Multi‑Grained Categorical Embeddings
Implement an end‑to‑end multi‑grained embedding layer that hierarchically encodes category subsets (e.g., via decision‑forest splits) to capture rich, multi‑resolution feature granularity 
[ScienceDirect](https://www.sciencedirect.com/science/article/pii/S0306457324000050?utm_source=chatgpt.com)
.

9. Library of Self‑Supervised Tabular SSL Tasks
Bundle a suite of state‑of‑the‑art tabular self‑supervised objectives—SCARF, SAINT, SubTab, XTab, etc.—as configurable pretraining strategies directly within KDP 
[GitHub](https://github.com/wwweiwei/awesome-self-supervised-learning-for-tabular-data?utm_source=chatgpt.com)
.

10. Multi‑Scale Fourier Feature Embedding
Incorporate a multiscale FourierFeatureEmbedding layer supporting user‑specified sigma scales, enabling simultaneous capture of both low and high‑frequency numeric patterns 
[mathlab.github.io](https://mathlab.github.io/PINA/_rst/model/block/fourier_embedding.html?utm_source=chatgpt.com)
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]: Implement new features for KDP #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat]: Implement new features for KDP #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions