Using Tabnetics
This guide covers practical usage of tabnetics — from running the full pipeline on your own data to configuring individual components.
For the ready-to-run setup used across this guide, install:
pip install tabnetics
That base install now matches the ready-to-run public runtime surface and includes every currently shipped direct dependency except TabPFN. For the fully loaded opt-in stack, including TabPFN, use pip install "tabnetics[full]". Optional integrations still keep their own upstream licenses/terms; see Third-party integrations and licenses.
Table of contents
- Full pipeline
- Auto router
- Third-party integrations and licenses
- Your own CSV files
- Configuration
- Standalone feature selection
- Standalone distribution fitting
- Running benchmarks
- Validation campaigns
- Datasets
- Reproducibility and data source policy
- Uncertainty and conformal outputs
- Oracle presets
- Feature selection methods
- Method profiles (benchmark)
- Multi-omics
Full pipeline
The main entry point is DistributionFeatureSelectionPipeline. It handles train/test splitting, distribution fitting, feature selection, and classification in a single leakage-safe call.
from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig
import numpy as np
# X: (n_samples, n_features) array
# y: (n_samples,) array of class labels
config = DFFSConfig(
random_seed=42,
test_size=0.20,
n_final_features=50,
n_jobs=4,
)
pipeline = DistributionFeatureSelectionPipeline(config)
result = pipeline.run(X, y, dataset_name="my_dataset")
print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")
print(f"Model: {result.model_name}")
print(f"Features selected: {result.selected_features_count}")
print(f"Feature indices: {result.selected_feature_indices_original}")
This default config uses the packaged V25 auto-router. It computes a training-only dataset descriptor, selects a supported method/config candidate, then runs the delegated pipeline with the router disabled to avoid recursion.
Pre-split mode
If you manage your own splits (e.g., for nested cross-validation), use run_pre_split():
result = pipeline.run_pre_split(
X_train, y_train, X_test, y_test,
dataset_name="my_dataset",
seed=42,
)
Result object
PipelineRunResult contains:
| Field | Description |
|---|---|
accuracy | Test-set accuracy |
balanced_accuracy | Balanced accuracy (macro-averaged recall) |
macro_f1 | Macro F1 score |
hybrid_score | Weighted combination of balanced accuracy and macro F1 |
roc_auc | ROC AUC (binary or OvR multiclass) |
selected_features_count | Number of features selected |
selected_feature_indices_original | Indices into the original feature matrix |
model_name | Name of the classifier chosen by the oracle |
distribution_summaries | Per-feature distribution fit results |
fs_time_sec | Feature selection wall time |
dist_time_sec | Distribution fitting wall time |
Current runtime defaults
The packaged defaults match the current benchmark/validation path:
auto_router_enabled=Trueis the default, so the V25 calibrated score-router chooses the supported profile from the training split before distribution fitting, feature selection, and classification.df_stage_position="after_fs"is the promoted default, so the distribution-fitting stage operates on the selected feature space rather than the full raw matrix.- Evidence-bearing benchmark and validation runs default to
allow_synthetic_fallback=Falseanddataset_integrity_policy="error". - Conformal prediction remains opt-in and is interpreted as uncertainty output, not as a balanced-accuracy gain mechanism.
Auto router
The auto-router is the recommended entry point for ordinary use. It avoids asking users to pick validation-campaign flags manually and instead selects among supported, already-tested candidates using descriptors computed directly from the training data.
from tabnetics.pipeline import DFFSConfig, DistributionFeatureSelectionPipeline
config = DFFSConfig(random_seed=42, n_jobs=4)
result = DistributionFeatureSelectionPipeline(config).run(X, y, dataset_name="my_dataset")
To opt out and keep explicit/manual configuration:
config = DFFSConfig(auto_router_enabled=False)
To inspect the router decision before running a full pipeline:
from tabnetics.auto_router import predict_auto_router
decision = predict_auto_router(X_train, y_train)
print(decision.metadata["selected_candidate_id"])
print(decision.enabled_methods)
The V25 router predicts balanced accuracy and macro-F1 for 12 finite candidate profiles and applies a conservative calibrated policy. It uses only dataset-computable descriptors and candidate action encodings; it does not use holdout labels, validation tiers, or dataset identity.
Third-party integrations and licenses
Tabnetics itself is licensed under Apache 2.0, but it depends on several upstream projects with their own licenses or terms. Installing Tabnetics or enabling the full extra does not relicense those projects: you still need to comply with the upstream terms for each library you install or activate.
This table covers the direct runtime dependencies declared for the shipped public install surface (tabnetics plus opt-in tabnetics[full]). It is not a full transitive dependency manifest for every wheel dependency; for redistribution or commercial deployment, review the exact upstream license text for the versions you ship.
| Library | Used for in Tabnetics | Surface | Audited upstream license / terms | Notes |
|---|---|---|---|---|
NumPy (numpy) | Core ndarray/matrix operations across the pipeline, selectors, backends, and validation code | Base package | BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0 | PyPI metadata reflects the main BSD-3-Clause project license plus bundled permissive notices. |
pandas (pandas) | Dataset loaders, metadata tables, reporting, and artifact assembly | Base package | BSD 3-Clause License | PyPI metadata leads with BSD-3-Clause text. |
SciPy (scipy) | Statistical tests, optimization, sparse helpers, and MAT-file IO | Base package | BSD License | PyPI metadata marks SciPy as BSD-licensed. |
scikit-learn (scikit-learn) | Estimator interface, preprocessing, CV, metrics, and many baseline models | Base package | BSD-3-Clause | Core runtime dependency. |
threadpoolctl (threadpoolctl) | Worker-side thread caps for benchmark/runtime process control | Base package | BSD-3-Clause | Used in the benchmark runner to keep BLAS/OpenMP workers bounded. |
tqdm (tqdm) | Progress bars in copula sampling and data-generation helpers exposed through shipped code paths | Base package | MPL-2.0 AND MIT | Mixed permissive/weak-copyleft metadata; preserve upstream notices if redistributing modified package files. |
Boruta (boruta) | Boruta feature selection | Base package; boruta method | BSD 3 clause | Included in plain tabnetics. |
SHAP (shap) | TreeSHAP values for the treeshap selector | Base package; treeshap method | MIT License | Included in plain tabnetics. |
pyvinecopulib (pyvinecopulib) | Vine-copula knockoff generation | Base package; copula_knockoff methods | MIT | Included in plain tabnetics. |
MAPIE (mapie) | Conformal prediction / uncertainty wrappers | Base package; conformal classifier outputs | BSD-3-Clause | Included in plain tabnetics. |
statsmodels (statsmodels) | BH / multiple-testing correction in prefiltering | Base package | BSD License | Included in plain tabnetics. |
datasets (datasets) | HuggingFace dataset bundle loading for benchmark/validation workflows | Base package | Apache 2.0 | Included in plain tabnetics. |
FLAML (flaml) | AutoML classifier backend/oracle candidates | Base package; classification_backend="flaml" | MIT License | Included in plain tabnetics. |
Optuna (optuna) | Hyperparameter-search classifier backend/oracle candidates | Base package; classification_backend="optuna" | MIT License | Included in plain tabnetics. |
pytabkit (pytabkit) | Official TabM / RealMLP-TD sklearn-compatible backends | Base package; tabm_official, realmlp_td | Apache-2.0 | Included in plain tabnetics; paired with PyTorch. |
PyTorch (torch) | Runtime for official pytabkit backends and benchmark worker/runtime paths | Base package | BSD-3-Clause | Included in plain tabnetics. |
LightGBM (lightgbm) | Gradient-boosted tree classifier candidate | Base package | MIT | Included in plain tabnetics. |
XGBoost (xgboost) | Gradient-boosted tree classifier candidate | Base package | Apache-2.0 | Included in plain tabnetics. |
CatBoost (catboost) | Gradient-boosted tree classifier candidate | Base package | Apache License, Version 2.0 | Included in plain tabnetics. |
TabPFN (tabpfn) | Foundation-model classifier candidate | full extra only; tabpfn classifier path | Prior Labs License (Apache 2.0 with additional attribution) | Kept opt-in in tabnetics[full]; upstream packaging terms add attribution obligations, and the default TabPFN-2.5 weights remain non-commercial. |
Your own CSV files
For ad-hoc datasets, the simplest path is to load your CSV with pandas, keep metadata columns separate, and pass only numeric features into the pipeline. The examples below assume a label column plus a few metadata fields such as sample IDs, study IDs, or batch IDs.
1. Default settings
import pandas as pd
from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig
df = pd.read_csv("data/my_hdlss_dataset.csv")
label_col = "label"
metadata_cols = ["sample_id", "study_id", "batch_id"]
feature_cols = [c for c in df.columns if c not in metadata_cols + [label_col]]
X = df[feature_cols].apply(pd.to_numeric, errors="coerce").to_numpy(dtype=float)
y = df[label_col].to_numpy()
pipeline = DistributionFeatureSelectionPipeline(DFFSConfig(random_seed=42))
result = pipeline.run(X, y, dataset_name="my_hdlss_dataset", seed=42)
print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")
print(f"Selected features: {result.selected_features_count}")
This CSV example also uses the auto-router. Add auto_router_enabled=False to DFFSConfig if you want to force manual defaults.
2. Different profile + additional flags
If you want one of the benchmark method profiles on your own CSV, import the profile from tabnetics.benchmarks.profiles and layer extra flags on top of it:
import pandas as pd
from tabnetics.benchmarks.profiles import FS_METHOD_SETS
from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig
df = pd.read_csv("data/my_hdlss_dataset.csv")
label_col = "label"
metadata_cols = ["sample_id", "study_id", "batch_id"]
feature_cols = [c for c in df.columns if c not in metadata_cols + [label_col]]
X = df[feature_cols].apply(pd.to_numeric, errors="coerce").to_numpy(dtype=float)
y = df[label_col].to_numpy()
profile_name = "mnpo_v14_core_plus_ipss"
config = DFFSConfig(
random_seed=42,
auto_router_enabled=False,
enabled_methods=FS_METHOD_SETS[profile_name],
n_final_features=75,
prefilter_top_k=800,
screening_enabled=True,
screening_method="evalue",
eval_models_enabled=True,
df_stage_position="after_fs",
max_dist_features=512,
)
pipeline = DistributionFeatureSelectionPipeline(config)
result = pipeline.run(X, y, dataset_name="my_hdlss_profiled_csv", seed=42)
print(f"Profile: {profile_name}")
print(f"Chosen model: {result.model_name}")
print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")
3. Multi-omics correction on specific fields
For real multi-omics CSVs with explicit blocks such as transcriptomics, proteomics, or metabolomics, define the block columns yourself and fit the correction/integration step on the training split only. This keeps the workflow leakage-safe.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tabnetics.multiomics import MINTIntegrator
from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig
df = pd.read_csv("data/my_multiomics_dataset.csv")
label_col = "label"
study_fields = ["study_id", "plate_id"]
transcript_cols = [c for c in df.columns if c.startswith("rna_")]
protein_cols = [c for c in df.columns if c.startswith("prot_")]
clinical_cols = ["age", "stage_code"]
y = df[label_col].to_numpy()
study_labels = (
df[study_fields[0]].astype(str)
+ "__"
+ df[study_fields[1]].astype(str)
).to_numpy()
idx = np.arange(len(df))
train_idx, test_idx = train_test_split(
idx,
test_size=0.20,
random_state=42,
stratify=y,
)
train_blocks = [
(
df.iloc[train_idx][transcript_cols]
.apply(pd.to_numeric, errors="coerce")
.to_numpy(dtype=float),
"transcriptomics",
),
(
df.iloc[train_idx][protein_cols]
.apply(pd.to_numeric, errors="coerce")
.to_numpy(dtype=float),
"proteomics",
),
]
test_blocks = [
(
df.iloc[test_idx][transcript_cols]
.apply(pd.to_numeric, errors="coerce")
.to_numpy(dtype=float),
"transcriptomics",
),
(
df.iloc[test_idx][protein_cols]
.apply(pd.to_numeric, errors="coerce")
.to_numpy(dtype=float),
"proteomics",
),
]
mint = MINTIntegrator(n_components=2)
mint.fit(train_blocks, y[train_idx], study_labels[train_idx])
Z_train = mint.transform(train_blocks, study_labels[train_idx])
Z_test = mint.transform(test_blocks, study_labels[test_idx])
X_train = np.hstack([
Z_train,
df.iloc[train_idx][clinical_cols]
.apply(pd.to_numeric, errors="coerce")
.to_numpy(dtype=float),
])
X_test = np.hstack([
Z_test,
df.iloc[test_idx][clinical_cols]
.apply(pd.to_numeric, errors="coerce")
.to_numpy(dtype=float),
])
pipeline = DistributionFeatureSelectionPipeline(DFFSConfig(random_seed=42))
result = pipeline.run_pre_split(
X_train,
y[train_idx],
X_test,
y[test_idx],
dataset_name="my_multiomics_dataset",
seed=42,
)
print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")
If you want explicit multi-block integration without study/batch correction, replace MINTIntegrator with MultiBlockPLSDA. If those block matrices contain missing values, impute them on the training split before calling fit(...) and transform(...).
Configuration
DFFSConfig is a dataclass with ~120 fields. Most have sensible defaults. The key groups are:
Core
| Parameter | Default | Description |
|---|---|---|
random_seed | 42 | Global random seed |
test_size | 0.20 | Fraction held out for test |
n_final_features | 50 | Target number of features after selection |
n_jobs | 1 | Parallelism for FS methods and distribution fitting |
fs_fraction | 0.40 | Fraction of training data used for feature selection |
Distribution fitting
| Parameter | Default | Description |
|---|---|---|
dist_criterion | "simple" | Criterion: simple, cvm_p, ks_p, aic, bic, aicc, cv, cv_loglik, crps, mnpo_oracle |
apply_cdf_transform | True | Apply CDF transform after fitting |
df_stage_position | "after_fs" | before_fs or after_fs |
max_dist_features | 256 | Max features to distribution-fit (skip rest) |
Prefilter (Tier 1)
| Parameter | Default | Description |
|---|---|---|
use_rank_prefilter | True | Enable univariate prefilter before FS |
prefilter_top_k | 600 | Keep top-k features from prefilter |
prefilter_strategies | ("mi_ftest_blend","rf_importance","wsnr","bh_fdr") | Prefilter scoring strategies |
batch_correction | "none" | Batch correction: none, combat, combat_seq, cdf_center, center_scale |
Screening (Tier 2)
| Parameter | Default | Description |
|---|---|---|
screening_enabled | True | Enable interaction-aware screening |
screening_method | "evalue" | Screening method: stir, evalue, none |
Feature selection
| Parameter | Default | Description |
|---|---|---|
enabled_methods | 7-method default stack | Tuple of method keys to run |
fs_portfolio_size | 5 | MNPO portfolio max candidates |
fs_oracle_weighting_mode | "banzhaf" | Oracle weighting: tritrust, uniform, shapley, banzhaf |
Classification
| Parameter | Default | Description |
|---|---|---|
model_candidates | ("lr","svm_rbf","svm_linear","dlda","knn","rf","nb","elastic_net_lr") | Classifier pool (legacy default; regime pools override this when regime gating is enabled) |
folding_method | "pls_da" | Dimensionality reduction: none, rff, tensor_sketch, pls_da |
scaler_mode | "standard" | Input scaling: standard, robust, quantile |
Regime-aware classifier pools
When regime_gating_enabled=True, the pipeline ignores model_candidates and uses regime-appropriate pools defined in REGIME_POOLS:
| Regime | Pool size | Key families |
|---|---|---|
hdlss_extreme (n/p < 0.1) | 22 | LR, elastic-net LR, SVM-linear, BC-SVM, RP-ensemble, DLDA, shrinkage-LDA, NSC, PLS-DA, NB, DBDA, GQDA, SGLNN, RFF-LR, near-subspace, spatial-median-DA, copula-DA, CPDA, TabM, RealMLP, TabM-official*, RealMLP-TD* |
hdlss_moderate (0.1 ≤ n/p < 1) | 27 | All of extreme + SVM-RBF, GPC, KNN, vote-ensemble, TabPFN |
standard (n/p ≥ 1) | All available | Full pool including tree models (RF, XGBoost, LightGBM, CatBoost) |
*Included in plain tabnetics and tabnetics[full]. If pytabkit is missing in a custom environment, these paths still fail gracefully.
Dual implementations. TabM and RealMLP each have two backends:
tabm/realmlp: numpy-only approximations (zero extra deps, millisecond fit times on HDLSS data)tabm_official/realmlp_td: official PyTorch implementations frompytabkit(higher fidelity, requires PyTorch)
Both variants are in the regime pools. The oracle evaluates whichever are available and selects the best combination.
Opt-in features
| Parameter | Default | Description |
|---|---|---|
enable_ratio_features | False | Construct log-ratio features (pairs of original features) |
regime_gating_enabled | False | Route datasets to regime-appropriate profiles |
eval_models_enabled | True | Multi-classifier evaluation proxy |
Standalone feature selection
Use FeatureSelector directly when you only need feature selection without the full pipeline:
from tabnetics.feature_selection import FeatureSelector
fs = FeatureSelector(
random_state=42,
selection_strategy="mnpo_portfolio",
portfolio_size=6,
n_folds=5,
n_bootstrap_iterations=10,
)
X_selected, fs_result = fs.fit_transform(
X_train, y_train, n_final_features=30, return_result_object=True
)
# Selected feature indices
print(fs_result.selected_feature_indices)
# Per-method results
for method, info in fs_result.method_results.items():
print(f"{method}: {len(info.get('selected', []))} features")
Configuring the MNPO oracle
from tabnetics.feature_selection import OracleConfig
oracle = OracleConfig.from_preset("full")
# Presets: perf_only, perf_complexity, perf_complexity_stability, full, minimal_cvar
# Or customize:
oracle = OracleConfig(
weighting_mode="banzhaf",
diversity_mode="mi_redundancy",
use_cvar=True,
cvar_alpha=0.33,
)
Standalone distribution fitting
Fit a single feature to the best parametric distribution:
from tabnetics.distribution.selector import UnifiedDistributionSelectorV6
selector = UnifiedDistributionSelectorV6(robust_mode=True, n_jobs=1)
best_name, best_fit, all_fits = selector.select_best_distribution(
data_column, # 1-D numpy array
criterion="simple", # or cvm_p, ks_p, aic, bic, cv, crps, mnpo_oracle
)
print(f"Best distribution: {best_name}")
print(f"Parameters: {best_fit.params}")
print(f"KS p-value: {best_fit.ks_pvalue:.4f}")
Available criteria:
| Criterion | Description |
|---|---|
simple | Fast prescreening with KS p-value and CVM ranking (default) |
cvm_p | Cramér–von Mises p-value |
ks_p | Kolmogorov–Smirnov p-value |
aic / bic / aicc | Information criteria |
cv / cv_loglik | Cross-validated log-likelihood |
crps | Continuous ranked probability score |
mnpo_oracle | Game-theoretic multi-criterion selection |
Running benchmarks
Command line
# Run on a specific dataset
tabnetics-benchmark --datasets leukemia_golub --seeds 11 23 37
# Run on a named dataset group
tabnetics-benchmark --dataset-sets fs_easy --max-workers 8
# Run with a specific method profile
tabnetics-benchmark --datasets leukemia_golub dlbcl_shipp \
--fs-method-set mnpo_broad_stable \
--seeds 42
# Full benchmark with distribution fitting diagnostics
tabnetics-benchmark --dataset-sets core \
--dist-criterion simple \
--df-stage-position after_fs \
--compute-budget standard \
--max-workers 16 \
--seeds 11 23 37
Key CLI flags
| Flag | Default | Description |
|---|---|---|
--datasets | — | Space-separated dataset IDs |
--dataset-sets | — | Named groups: core, fs_easy, fs_medium, fs_hard, smoke |
--seeds | 11 23 37 | Random seeds for repeated runs |
--max-workers | 1 | Parallel dataset workers |
--test-size | 0.20 | Test split fraction |
--fs-method-set | — | Named method profile (see profiles) |
--dist-criterion | simple | Distribution fitting criterion |
--df-stage-position | after_fs | before_fs or after_fs |
--dataset-integrity-policy | error | Fail, skip, or fallback when class-diversity sanity checks fail |
--compute-budget | standard | fast, standard, thorough |
--prefilter-top-k | 600 | Prefilter feature count |
--screening-enabled | False | Enable Tier-2 screening |
--screening-method | none | stir, evalue, none |
--eval-models-enabled | False | Multi-classifier evaluation proxy |
--multiomics-adapter | none | Benchmark-time shortcut for synthetic block construction (split_halves) |
--enable-classifier-conformal | False | Turn on classifier-side conformal diagnostics |
--classifier-conformal-method | split | split, aps, raps, or cross |
--task-timeout-sec | 300 | Per-dataset timeout |
--quiet-worker-logs | False | Suppress worker output |
--enable-nestedcv-audit | False | Nested CV robustness audit |
Programmatic
from tabnetics.benchmarks.cli import main
# Equivalent to CLI invocation
import sys
sys.argv = ["tabnetics-benchmark", "--datasets", "leukemia_golub", "--seeds", "42"]
main()
Validation-catalog benchmark datasets are not allowed to silently fall back to synthetic proxies. When those datasets are selected, the benchmark CLI requires the authoritative HuggingFace bundle via TABNETICS_HF_ORG or TABNETICS_HF_REPO_ID. That bundle is an operational mirror of the public upstream datasets rather than a separate private corpus.
Validation campaigns
Tabnetics ships three packaged validation surfaces:
tabnetics-validation-plan/python -m tabnetics.validation.generate_plantabnetics-validation-shard/python -m tabnetics.validation.core.shard_runnertabnetics-validation-suite/python -m tabnetics.validation.suite
Typical workflow:
export TABNETICS_HF_ORG=klokedm
export TABNETICS_HF_REPO_ID=klokedm/tabnetics-validation
# 1. Build a sharded campaign plan.
tabnetics-validation-plan --plan-kind validation17 --num-pods 4
# 2. Run a small local slice with the unified suite.
tabnetics-validation-suite \
--dataset-sets fs_easy \
--seeds 11 23 37 \
--dataset-integrity-policy error \
--output-dir run_artifacts/validation/unified_suite
# 3. Execute one shard from a generated plan.
tabnetics-validation-shard \
--shard-id 1 \
--plan pod_validation/plan_4_val17.json \
--shards pod_validation/shards_4_val17.json
Use the suite for smaller slices or component ablations. Use the plan + shard flow when you want a reproducible multi-job campaign across a larger benchmark catalog.
Datasets
Tabnetics ships with a registry of 70+ HDLSS benchmark datasets. Operationally, many validation-catalog runs are loaded through the HuggingFace bundle, but the underlying data come from public upstream sources such as OpenML, GEO, CuMiDa, UCSC Xena, public HuggingFace datasets, and Orange / University of Ljubljana (biolab.si) mirrors where applicable.
Registry
from tabnetics.datasets import CATALOG, DATASET_SETS
# List all registered datasets
for ds_id, spec in CATALOG.items():
print(f"{ds_id}: {spec.display_name} ({spec.n_samples}×{spec.n_features}, {spec.n_classes} classes)")
# List named dataset groups
print(list(DATASET_SETS.keys()))
# ['all', 'smoke', 'core', 'extended', 'fs_all', 'fs_easy', 'fs_medium', ...]
Dataset groups
| Group | Description |
|---|---|
smoke | 3 datasets for quick sanity checks |
core | All non-extended datasets |
extended | Full catalog including CuMiDa and TCGA |
fs_easy / fs_medium / fs_hard / fs_very_hard | FS pipeline datasets by difficulty tier |
Example datasets
| ID | Name | Samples | Features | Classes |
|---|---|---|---|---|
leukemia_golub | Leukemia (Golub) | 72 | 7,129 | 2 |
dlbcl_shipp | DLBCL (Shipp) | 77 | 5,469 | 2 |
ovarian_petricoin | Ovarian Cancer (Petricoin) | 253 | 15,154 | 2 |
srbct_khan | SRBCT (Khan) | 83 | 2,308 | 4 |
prostate_singh | Prostate Cancer (Singh) | 102 | 12,600 | 2 |
carcinom_11class | Carcinom 11-class | 174 | 9,182 | 11 |
nci9_60_9class | NCI9 | 60 | 9,712 | 9 |
gla_bra_180 | GLA-BRA-180 | 180 | 49,151 | 4 |
Reproducibility and data source policy
The public Pages site reflects the packaged implementation, not earlier draft workflows. The operational rules to keep in mind are:
df_stage_position="after_fs"is the current default in the packaged pipeline and benchmark CLI.- Validation-catalog and evidence-bearing benchmark runs treat the HuggingFace bundle as the authoritative reproducibility mirror of the public upstream sources.
- The benchmark CLI and validation suite default to
--no-synthetic-fallback; if you explicitly re-enable fallback for non-evidence workflows, do so knowingly and record the source policy in your artifacts. dataset_integrity_policy="error"is the default so mislabeled or collapsed class structures fail fast instead of silently contaminating results.- The built-in
multiomics_adapter="split_halves"is intended for benchmark-time stress tests; use explicit omics blocks withMultiBlockPLSDAorMINTIntegratoron real data.
If you are preparing a public benchmark or a paper-facing validation run, keep the HF bundle, no-synthetic-fallback, and integrity-error settings unchanged.
Uncertainty and conformal outputs
Classifier-side conformal prediction is available as an opt-in diagnostic layer:
tabnetics-benchmark \
--datasets leukemia_golub \
--enable-classifier-conformal \
--classifier-conformal-method aps \
--classifier-conformal-output-sets
Interpret those outputs as uncertainty diagnostics:
- coverage and mean prediction-set size
- singleton rate / compactness of the prediction sets
- optional per-sample prediction sets when artifact size is acceptable
They are not expected to improve balanced accuracy. This follows the MAPIE interpretation in Taquet et al. 2022: conformal wrappers calibrate prediction sets around a fixed base classifier. For singleton-oriented efficiency and reject-option interpretations, see Wang, Sun & Dobriban 2025 and Hallberg Szabadváry et al. 2025.
Oracle presets
The MNPO oracle can be configured with presets via OracleConfig.from_preset():
| Preset | Oracles | Use case |
|---|---|---|
perf_only | Performance | Fastest; single-criterion selection |
perf_complexity | Performance + Complexity | Prefer simpler feature sets |
perf_complexity_stability | Performance + Complexity + Stability | Add bootstrap stability |
full | Performance + Stability + Complexity + Robust + Diversity | Production default — all 5 oracles |
minimal_cvar | Performance + CVaR | Tail-risk focus for small datasets |
Feature selection methods
All 35+ methods in the registry, grouped by paradigm:
Stability
| Key | Label |
|---|---|
stability_lasso | Stability Selection (Lasso) |
stability_subsample | Stability Selection (complementary subsampling) |
tigress_stability | TIGRESS-style Stability Selection |
subspace_stability | Subspace Stability Selection |
decorrelated_stability | Decorrelated Stability Selection |
ipss | Integrated Path Stability Selection (IPSS) |
cluster_stability | Cluster Stability Selection |
Wrapper
| Key | Label |
|---|---|
rfecv | Recursive Feature Elimination |
boruta | Boruta |
iterative_redundancy_pruning | Iterative redundancy-pruning wrapper |
iterative_redundancy_pruning_bounded | Iterative redundancy-pruning wrapper (runtime-bounded) |
Filter
| Key | Label |
|---|---|
mutual_information | Mutual Information |
anova_f | ANOVA F-test |
chi_square | Chi-Square univariate filter |
relieff | ReliefF instance-based filter |
fcbf | FCBF correlation-based filter |
cmim | CMIM conditional MI filter |
hsic_lasso | HSIC Lasso-style kernelized selection |
mrmr_jmi | mRMR/JMI redundancy-aware selection |
Embedded
| Key | Label |
|---|---|
gradient_boosting | Gradient Boosting |
linear_svm | Linear SVM |
treeshap | TreeSHAP embedded selector |
oaenet | OAENet adaptive elastic-net selector |
slce_centroid_encoder | SLCE centroid-encoder selection |
group_sparse_lasso | Group sparse lasso |
Pairwise / AUC
| Key | Label |
|---|---|
wmw_auc | WMW univariate AUC filter |
joint_auc_l1 | Joint AUC-aware L1 selector (binary only) |
ktsp | k-TSP pairwise rank selection |
Knockoff
| Key | Label |
|---|---|
copula_knockoff | Copula knock-off selection |
Multiclass
| Key | Label |
|---|---|
ova_ensemble | OVA multiclass ensemble selection |
ecoc_class_aware | ECOC class-aware decomposition selection |
joint_multiclass_support | Joint multiclass shared-support selection |
dove_class_specific | DOvE-style class-specific multiclass selection |
sparse_multinomial | Sparse multinomial multiclass selection |
nearest_shrunken_centroid | Nearest shrunken centroids multiclass selection |
class_pareto_front | Class-specific Pareto-front multiclass selection |
sir_sdr / save_sdr / pfc_sdr | Sufficient dimension reduction selectors |
Method profiles
Named method sets for benchmark runs. Use with --fs-method-set <profile> on the CLI.
| Profile | Methods | Notes |
|---|---|---|
strict_plus_mrmr | GB, SVM, MI, ANOVA, mRMR | 5-method baseline |
strict_plus_mrmr_auc | Baseline + WMW AUC | 6-method |
mnpo_copula_extended | Baseline + copula knockoff | Knockoff expansion |
mnpo_ipss_extended | Baseline + IPSS | Stability expansion |
mnpo_broad_stable | 14 production-safe methods | Adds Boruta, copula knockoff, decorrelated stability, ReliefF, stability lasso, RFECV, HSIC lasso |
mnpo_v14_core | 15 methods | broad_stable + joint multiclass support |
mnpo_v14_core_plus_ipss | 16 methods | v14_core + IPSS |
mnpo_broad_all | 36 methods | Exhaustive — all non-deprecated selectors |
TabArena pipeline profiles
Pipeline-level profiles for the TabArena general tabular benchmark. Use with --profile <name> via python -m experiments.benchmarking.tabarena_benchmark.
| Profile | Classifier pool | Key settings | Best for |
|---|---|---|---|
hdlss | LR, SVM-RBF, RF, KNN, elastic-net, NB | CDF transform on, HDLSS screening/folding active | HDLSS baseline (p » n) |
general | LR, SVM-RBF, RF, KNN, elastic-net, NB, XGBoost, LightGBM | CDF off, FLAML-backed MNPO hybrid selection, 10-candidate cap | Moderate general tabular |
general_full (default) | 16 classifiers (full surface excl. TabPFN) | CDF off, FLAML-backed MNPO, no runtime cap, tritrust oracle k=3, ensemble, post-Val-17 FS defaults | Broad general tabular |
general_tabular | 12 tree-weighted classifiers (RF, ExtraTrees, XGBoost, LightGBM, CatBoost, LR, elastic-net, SVM-linear, SVM-RBF, KNN, NB, copula-DA) | CDF off, legacy selection with FLAML tuning, no screening/folding, adaptive FS fraction by N/p ratio, no HDLSS machinery | N » p datasets (many samples, few features) |
The general_tabular profile adapts feature selection by the sample-to-feature ratio:
- N/p > 100 or p ≤ 20:
fs_fraction=1.0(keep all features through FS) - N/p > 10 or p ≤ 50:
fs_fraction=0.90 - Otherwise:
fs_fraction=0.50
Prefilter is only activated when p > 200 (with BH correction disabled per Val-18 P03 evidence). FS still uses the post-Val-17 Banzhaf-weighted defaults, while classifier selection now stays on the legacy FLAML path because the strongest MNPO-collapse evidence so far is HDLSS/small-sample specific rather than general-tabular. Classifier-side conformal uses APS (per Val-18 C07 evidence). Screening and folding are always off (these are HDLSS-specific). HDLSS-oriented classifiers (GPC, PLS-DA, NSC, vote_ensemble, shrinkage_LDA) are excluded in favour of tree ensembles plus copula_da as an extra generative family.
Multi-omics
Tabnetics includes DIABLO-style multi-block PLS-DA and MINT batch correction for multi-omics integration:
from tabnetics.multiomics import MultiBlockPLSDA
# X_blocks: list of arrays, one per omics layer
# y: shared class labels
model = MultiBlockPLSDA(n_components=3)
model.fit(X_blocks, y)
scores = model.transform(X_blocks)
For the full pipeline, multiomics_adapter="split_halves" is the built-in benchmark-style shortcut: it derives two blocks from the first and second half of the feature matrix. For real CSVs with named omics fields, prefer the explicit field-based pattern in Your own CSV files and build the blocks yourself with MultiBlockPLSDA or MINTIntegrator.
For the broader modeling context, see Cai et al. 2022, which reviews why multi-omics integration remains a meaningful lever for cancer classification even when small-sample settings make the engineering path harder.
See tabnetics.multiomics for full API.
License
Apache 2.0 — see LICENSE.
Documentation and webpages on this site are generated from authoritative internal sources using a combination of deterministic rules and generative AI. Errors are possible. Please report issues via GitHub Discussions or email [email protected].