Using Tabnetics

This guide covers practical usage of tabnetics — from running the full pipeline on your own data to configuring individual components.

For the ready-to-run setup used across this guide, install:

pip install tabnetics

That base install now matches the ready-to-run public runtime surface and includes every currently shipped direct dependency except TabPFN. For the fully loaded opt-in stack, including TabPFN, use pip install "tabnetics[full]". Optional integrations still keep their own upstream licenses/terms; see Third-party integrations and licenses.

Full pipeline
Auto router
Third-party integrations and licenses
Your own CSV files
Configuration
Standalone feature selection
Standalone distribution fitting
Running benchmarks
Validation campaigns
Datasets
Reproducibility and data source policy
Uncertainty and conformal outputs
Oracle presets
Feature selection methods
Method profiles (benchmark)
Multi-omics

Full pipeline

The main entry point is DistributionFeatureSelectionPipeline. It handles train/test splitting, distribution fitting, feature selection, and classification in a single leakage-safe call.

from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig
import numpy as np

# X: (n_samples, n_features) array
# y: (n_samples,) array of class labels

config = DFFSConfig(
    random_seed=42,
    test_size=0.20,
    n_final_features=50,
    n_jobs=4,
)

pipeline = DistributionFeatureSelectionPipeline(config)
result = pipeline.run(X, y, dataset_name="my_dataset")

print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")
print(f"Model:             {result.model_name}")
print(f"Features selected: {result.selected_features_count}")
print(f"Feature indices:   {result.selected_feature_indices_original}")

This default config uses the packaged V25 auto-router. It computes a training-only dataset descriptor, selects a supported method/config candidate, then runs the delegated pipeline with the router disabled to avoid recursion.

Pre-split mode

If you manage your own splits (e.g., for nested cross-validation), use run_pre_split():

result = pipeline.run_pre_split(
    X_train, y_train, X_test, y_test,
    dataset_name="my_dataset",
    seed=42,
)

Result object

PipelineRunResult contains:

Field	Description
`accuracy`	Test-set accuracy
`balanced_accuracy`	Balanced accuracy (macro-averaged recall)
`macro_f1`	Macro F1 score
`hybrid_score`	Weighted combination of balanced accuracy and macro F1
`roc_auc`	ROC AUC (binary or OvR multiclass)
`selected_features_count`	Number of features selected
`selected_feature_indices_original`	Indices into the original feature matrix
`model_name`	Name of the classifier chosen by the oracle
`distribution_summaries`	Per-feature distribution fit results
`fs_time_sec`	Feature selection wall time
`dist_time_sec`	Distribution fitting wall time

Current runtime defaults

The packaged defaults match the current benchmark/validation path:

auto_router_enabled=True is the default, so the V25 calibrated score-router chooses the supported profile from the training split before distribution fitting, feature selection, and classification.
df_stage_position="after_fs" is the promoted default, so the distribution-fitting stage operates on the selected feature space rather than the full raw matrix.
Evidence-bearing benchmark and validation runs default to allow_synthetic_fallback=False and dataset_integrity_policy="error".
Conformal prediction remains opt-in and is interpreted as uncertainty output, not as a balanced-accuracy gain mechanism.

Auto router

The auto-router is the recommended entry point for ordinary use. It avoids asking users to pick validation-campaign flags manually and instead selects among supported, already-tested candidates using descriptors computed directly from the training data.

from tabnetics.pipeline import DFFSConfig, DistributionFeatureSelectionPipeline

config = DFFSConfig(random_seed=42, n_jobs=4)
result = DistributionFeatureSelectionPipeline(config).run(X, y, dataset_name="my_dataset")

To opt out and keep explicit/manual configuration:

config = DFFSConfig(auto_router_enabled=False)

To inspect the router decision before running a full pipeline:

from tabnetics.auto_router import predict_auto_router

decision = predict_auto_router(X_train, y_train)
print(decision.metadata["selected_candidate_id"])
print(decision.enabled_methods)

The V25 router predicts balanced accuracy and macro-F1 for 12 finite candidate profiles and applies a conservative calibrated policy. It uses only dataset-computable descriptors and candidate action encodings; it does not use holdout labels, validation tiers, or dataset identity.

Third-party integrations and licenses

Tabnetics itself is licensed under Apache 2.0, but it depends on several upstream projects with their own licenses or terms. Installing Tabnetics or enabling the full extra does not relicense those projects: you still need to comply with the upstream terms for each library you install or activate.

This table covers the direct runtime dependencies declared for the shipped public install surface (tabnetics plus opt-in tabnetics[full]). It is not a full transitive dependency manifest for every wheel dependency; for redistribution or commercial deployment, review the exact upstream license text for the versions you ship.

Library	Used for in Tabnetics	Surface	Audited upstream license / terms	Notes
NumPy (`numpy`)	Core ndarray/matrix operations across the pipeline, selectors, backends, and validation code	Base package	BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0	PyPI metadata reflects the main BSD-3-Clause project license plus bundled permissive notices.
pandas (`pandas`)	Dataset loaders, metadata tables, reporting, and artifact assembly	Base package	BSD 3-Clause License	PyPI metadata leads with BSD-3-Clause text.
SciPy (`scipy`)	Statistical tests, optimization, sparse helpers, and MAT-file IO	Base package	BSD License	PyPI metadata marks SciPy as BSD-licensed.
scikit-learn (`scikit-learn`)	Estimator interface, preprocessing, CV, metrics, and many baseline models	Base package	BSD-3-Clause	Core runtime dependency.
threadpoolctl (`threadpoolctl`)	Worker-side thread caps for benchmark/runtime process control	Base package	BSD-3-Clause	Used in the benchmark runner to keep BLAS/OpenMP workers bounded.
tqdm (`tqdm`)	Progress bars in copula sampling and data-generation helpers exposed through shipped code paths	Base package	MPL-2.0 AND MIT	Mixed permissive/weak-copyleft metadata; preserve upstream notices if redistributing modified package files.
Boruta (`boruta`)	Boruta feature selection	Base package; `boruta` method	BSD 3 clause	Included in plain `tabnetics`.
SHAP (`shap`)	TreeSHAP values for the `treeshap` selector	Base package; `treeshap` method	MIT License	Included in plain `tabnetics`.
pyvinecopulib (`pyvinecopulib`)	Vine-copula knockoff generation	Base package; `copula_knockoff` methods	MIT	Included in plain `tabnetics`.
MAPIE (`mapie`)	Conformal prediction / uncertainty wrappers	Base package; conformal classifier outputs	BSD-3-Clause	Included in plain `tabnetics`.
statsmodels (`statsmodels`)	BH / multiple-testing correction in prefiltering	Base package	BSD License	Included in plain `tabnetics`.
datasets (`datasets`)	HuggingFace dataset bundle loading for benchmark/validation workflows	Base package	Apache 2.0	Included in plain `tabnetics`.
FLAML (`flaml`)	AutoML classifier backend/oracle candidates	Base package; `classification_backend="flaml"`	MIT License	Included in plain `tabnetics`.
Optuna (`optuna`)	Hyperparameter-search classifier backend/oracle candidates	Base package; `classification_backend="optuna"`	MIT License	Included in plain `tabnetics`.
pytabkit (`pytabkit`)	Official TabM / RealMLP-TD sklearn-compatible backends	Base package; `tabm_official`, `realmlp_td`	Apache-2.0	Included in plain `tabnetics`; paired with PyTorch.
PyTorch (`torch`)	Runtime for official `pytabkit` backends and benchmark worker/runtime paths	Base package	BSD-3-Clause	Included in plain `tabnetics`.
LightGBM (`lightgbm`)	Gradient-boosted tree classifier candidate	Base package	MIT	Included in plain `tabnetics`.
XGBoost (`xgboost`)	Gradient-boosted tree classifier candidate	Base package	Apache-2.0	Included in plain `tabnetics`.
CatBoost (`catboost`)	Gradient-boosted tree classifier candidate	Base package	Apache License, Version 2.0	Included in plain `tabnetics`.
TabPFN (`tabpfn`)	Foundation-model classifier candidate	`full` extra only; `tabpfn` classifier path	Prior Labs License (Apache 2.0 with additional attribution)	Kept opt-in in `tabnetics[full]`; upstream packaging terms add attribution obligations, and the default TabPFN-2.5 weights remain non-commercial.

Your own CSV files

For ad-hoc datasets, the simplest path is to load your CSV with pandas, keep metadata columns separate, and pass only numeric features into the pipeline. The examples below assume a label column plus a few metadata fields such as sample IDs, study IDs, or batch IDs.

1. Default settings

import pandas as pd

from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig

df = pd.read_csv("data/my_hdlss_dataset.csv")

label_col = "label"
metadata_cols = ["sample_id", "study_id", "batch_id"]
feature_cols = [c for c in df.columns if c not in metadata_cols + [label_col]]

X = df[feature_cols].apply(pd.to_numeric, errors="coerce").to_numpy(dtype=float)
y = df[label_col].to_numpy()

pipeline = DistributionFeatureSelectionPipeline(DFFSConfig(random_seed=42))
result = pipeline.run(X, y, dataset_name="my_hdlss_dataset", seed=42)

print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")
print(f"Selected features: {result.selected_features_count}")

This CSV example also uses the auto-router. Add auto_router_enabled=False to DFFSConfig if you want to force manual defaults.

2. Different profile + additional flags

If you want one of the benchmark method profiles on your own CSV, import the profile from tabnetics.benchmarks.profiles and layer extra flags on top of it:

import pandas as pd

from tabnetics.benchmarks.profiles import FS_METHOD_SETS
from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig

df = pd.read_csv("data/my_hdlss_dataset.csv")

label_col = "label"
metadata_cols = ["sample_id", "study_id", "batch_id"]
feature_cols = [c for c in df.columns if c not in metadata_cols + [label_col]]

X = df[feature_cols].apply(pd.to_numeric, errors="coerce").to_numpy(dtype=float)
y = df[label_col].to_numpy()

profile_name = "mnpo_v14_core_plus_ipss"
config = DFFSConfig(
    random_seed=42,
    auto_router_enabled=False,
    enabled_methods=FS_METHOD_SETS[profile_name],
    n_final_features=75,
    prefilter_top_k=800,
    screening_enabled=True,
    screening_method="evalue",
    eval_models_enabled=True,
    df_stage_position="after_fs",
    max_dist_features=512,
)

pipeline = DistributionFeatureSelectionPipeline(config)
result = pipeline.run(X, y, dataset_name="my_hdlss_profiled_csv", seed=42)

print(f"Profile:           {profile_name}")
print(f"Chosen model:      {result.model_name}")
print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")

3. Multi-omics correction on specific fields

For real multi-omics CSVs with explicit blocks such as transcriptomics, proteomics, or metabolomics, define the block columns yourself and fit the correction/integration step on the training split only. This keeps the workflow leakage-safe.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from tabnetics.multiomics import MINTIntegrator
from tabnetics.pipeline import DistributionFeatureSelectionPipeline, DFFSConfig

df = pd.read_csv("data/my_multiomics_dataset.csv")

label_col = "label"
study_fields = ["study_id", "plate_id"]
transcript_cols = [c for c in df.columns if c.startswith("rna_")]
protein_cols = [c for c in df.columns if c.startswith("prot_")]
clinical_cols = ["age", "stage_code"]

y = df[label_col].to_numpy()
study_labels = (
    df[study_fields[0]].astype(str)
    + "__"
    + df[study_fields[1]].astype(str)
).to_numpy()

idx = np.arange(len(df))
train_idx, test_idx = train_test_split(
    idx,
    test_size=0.20,
    random_state=42,
    stratify=y,
)

train_blocks = [
    (
        df.iloc[train_idx][transcript_cols]
        .apply(pd.to_numeric, errors="coerce")
        .to_numpy(dtype=float),
        "transcriptomics",
    ),
    (
        df.iloc[train_idx][protein_cols]
        .apply(pd.to_numeric, errors="coerce")
        .to_numpy(dtype=float),
        "proteomics",
    ),
]
test_blocks = [
    (
        df.iloc[test_idx][transcript_cols]
        .apply(pd.to_numeric, errors="coerce")
        .to_numpy(dtype=float),
        "transcriptomics",
    ),
    (
        df.iloc[test_idx][protein_cols]
        .apply(pd.to_numeric, errors="coerce")
        .to_numpy(dtype=float),
        "proteomics",
    ),
]

mint = MINTIntegrator(n_components=2)
mint.fit(train_blocks, y[train_idx], study_labels[train_idx])
Z_train = mint.transform(train_blocks, study_labels[train_idx])
Z_test = mint.transform(test_blocks, study_labels[test_idx])

X_train = np.hstack([
    Z_train,
    df.iloc[train_idx][clinical_cols]
    .apply(pd.to_numeric, errors="coerce")
    .to_numpy(dtype=float),
])
X_test = np.hstack([
    Z_test,
    df.iloc[test_idx][clinical_cols]
    .apply(pd.to_numeric, errors="coerce")
    .to_numpy(dtype=float),
])

pipeline = DistributionFeatureSelectionPipeline(DFFSConfig(random_seed=42))
result = pipeline.run_pre_split(
    X_train,
    y[train_idx],
    X_test,
    y[test_idx],
    dataset_name="my_multiomics_dataset",
    seed=42,
)

print(f"Balanced accuracy: {result.balanced_accuracy:.3f}")

If you want explicit multi-block integration without study/batch correction, replace MINTIntegrator with MultiBlockPLSDA. If those block matrices contain missing values, impute them on the training split before calling fit(...) and transform(...).

Configuration

DFFSConfig is a dataclass with ~120 fields. Most have sensible defaults. The key groups are:

Core

Parameter	Default	Description
`random_seed`	`42`	Global random seed
`test_size`	`0.20`	Fraction held out for test
`n_final_features`	`50`	Target number of features after selection
`n_jobs`	`1`	Parallelism for FS methods and distribution fitting
`fs_fraction`	`0.40`	Fraction of training data used for feature selection

Distribution fitting

Parameter	Default	Description
`dist_criterion`	`"simple"`	Criterion: `simple`, `cvm_p`, `ks_p`, `aic`, `bic`, `aicc`, `cv`, `cv_loglik`, `crps`, `mnpo_oracle`
`apply_cdf_transform`	`True`	Apply CDF transform after fitting
`df_stage_position`	`"after_fs"`	`before_fs` or `after_fs`
`max_dist_features`	`256`	Max features to distribution-fit (skip rest)

Prefilter (Tier 1)

Parameter	Default	Description
`use_rank_prefilter`	`True`	Enable univariate prefilter before FS
`prefilter_top_k`	`600`	Keep top-k features from prefilter
`prefilter_strategies`	`("mi_ftest_blend","rf_importance","wsnr","bh_fdr")`	Prefilter scoring strategies
`batch_correction`	`"none"`	Batch correction: `none`, `combat`, `combat_seq`, `cdf_center`, `center_scale`

Screening (Tier 2)

Parameter	Default	Description
`screening_enabled`	`True`	Enable interaction-aware screening
`screening_method`	`"evalue"`	Screening method: `stir`, `evalue`, `none`

Feature selection

Parameter	Default	Description
`enabled_methods`	7-method default stack	Tuple of method keys to run
`fs_portfolio_size`	`5`	MNPO portfolio max candidates
`fs_oracle_weighting_mode`	`"banzhaf"`	Oracle weighting: `tritrust`, `uniform`, `shapley`, `banzhaf`

Classification

Parameter	Default	Description
`model_candidates`	`("lr","svm_rbf","svm_linear","dlda","knn","rf","nb","elastic_net_lr")`	Classifier pool (legacy default; regime pools override this when regime gating is enabled)
`folding_method`	`"pls_da"`	Dimensionality reduction: `none`, `rff`, `tensor_sketch`, `pls_da`
`scaler_mode`	`"standard"`	Input scaling: `standard`, `robust`, `quantile`

Regime-aware classifier pools

When regime_gating_enabled=True, the pipeline ignores model_candidates and uses regime-appropriate pools defined in REGIME_POOLS:

Regime	Pool size	Key families
`hdlss_extreme` (n/p < 0.1)	22	LR, elastic-net LR, SVM-linear, BC-SVM, RP-ensemble, DLDA, shrinkage-LDA, NSC, PLS-DA, NB, DBDA, GQDA, SGLNN, RFF-LR, near-subspace, spatial-median-DA, copula-DA, CPDA, TabM, RealMLP, TabM-official, RealMLP-TD
`hdlss_moderate` (0.1 ≤ n/p < 1)	27	All of extreme + SVM-RBF, GPC, KNN, vote-ensemble, TabPFN
`standard` (n/p ≥ 1)	All available	Full pool including tree models (RF, XGBoost, LightGBM, CatBoost)

*Included in plain tabnetics and tabnetics[full]. If pytabkit is missing in a custom environment, these paths still fail gracefully.

Dual implementations. TabM and RealMLP each have two backends:

tabm / realmlp: numpy-only approximations (zero extra deps, millisecond fit times on HDLSS data)
tabm_official / realmlp_td: official PyTorch implementations from pytabkit (higher fidelity, requires PyTorch)

Both variants are in the regime pools. The oracle evaluates whichever are available and selects the best combination.

Opt-in features

Parameter	Default	Description
`enable_ratio_features`	`False`	Construct log-ratio features (pairs of original features)
`regime_gating_enabled`	`False`	Route datasets to regime-appropriate profiles
`eval_models_enabled`	`True`	Multi-classifier evaluation proxy

Standalone feature selection

Use FeatureSelector directly when you only need feature selection without the full pipeline:

from tabnetics.feature_selection import FeatureSelector

fs = FeatureSelector(
    random_state=42,
    selection_strategy="mnpo_portfolio",
    portfolio_size=6,
    n_folds=5,
    n_bootstrap_iterations=10,
)

X_selected, fs_result = fs.fit_transform(
    X_train, y_train, n_final_features=30, return_result_object=True
)

# Selected feature indices
print(fs_result.selected_feature_indices)

# Per-method results
for method, info in fs_result.method_results.items():
    print(f"{method}: {len(info.get('selected', []))} features")

Configuring the MNPO oracle

from tabnetics.feature_selection import OracleConfig

oracle = OracleConfig.from_preset("full")
# Presets: perf_only, perf_complexity, perf_complexity_stability, full, minimal_cvar

# Or customize:
oracle = OracleConfig(
    weighting_mode="banzhaf",
    diversity_mode="mi_redundancy",
    use_cvar=True,
    cvar_alpha=0.33,
)

Standalone distribution fitting

Fit a single feature to the best parametric distribution:

from tabnetics.distribution.selector import UnifiedDistributionSelectorV6

selector = UnifiedDistributionSelectorV6(robust_mode=True, n_jobs=1)
best_name, best_fit, all_fits = selector.select_best_distribution(
    data_column,          # 1-D numpy array
    criterion="simple",   # or cvm_p, ks_p, aic, bic, cv, crps, mnpo_oracle
)

print(f"Best distribution: {best_name}")
print(f"Parameters: {best_fit.params}")
print(f"KS p-value: {best_fit.ks_pvalue:.4f}")

Available criteria:

Criterion	Description
`simple`	Fast prescreening with KS p-value and CVM ranking (default)
`cvm_p`	Cramér–von Mises p-value
`ks_p`	Kolmogorov–Smirnov p-value
`aic` / `bic` / `aicc`	Information criteria
`cv` / `cv_loglik`	Cross-validated log-likelihood
`crps`	Continuous ranked probability score
`mnpo_oracle`	Game-theoretic multi-criterion selection

Running benchmarks

Command line

# Run on a specific dataset
tabnetics-benchmark --datasets leukemia_golub --seeds 11 23 37

# Run on a named dataset group
tabnetics-benchmark --dataset-sets fs_easy --max-workers 8

# Run with a specific method profile
tabnetics-benchmark --datasets leukemia_golub dlbcl_shipp \
    --fs-method-set mnpo_broad_stable \
    --seeds 42

# Full benchmark with distribution fitting diagnostics
tabnetics-benchmark --dataset-sets core \
    --dist-criterion simple \
    --df-stage-position after_fs \
    --compute-budget standard \
    --max-workers 16 \
    --seeds 11 23 37

Key CLI flags

Flag	Default	Description
`--datasets`	—	Space-separated dataset IDs
`--dataset-sets`	—	Named groups: `core`, `fs_easy`, `fs_medium`, `fs_hard`, `smoke`
`--seeds`	`11 23 37`	Random seeds for repeated runs
`--max-workers`	`1`	Parallel dataset workers
`--test-size`	`0.20`	Test split fraction
`--fs-method-set`	—	Named method profile (see profiles)
`--dist-criterion`	`simple`	Distribution fitting criterion
`--df-stage-position`	`after_fs`	`before_fs` or `after_fs`
`--dataset-integrity-policy`	`error`	Fail, skip, or fallback when class-diversity sanity checks fail
`--compute-budget`	`standard`	`fast`, `standard`, `thorough`
`--prefilter-top-k`	`600`	Prefilter feature count
`--screening-enabled`	`False`	Enable Tier-2 screening
`--screening-method`	`none`	`stir`, `evalue`, `none`
`--eval-models-enabled`	`False`	Multi-classifier evaluation proxy
`--multiomics-adapter`	`none`	Benchmark-time shortcut for synthetic block construction (`split_halves`)
`--enable-classifier-conformal`	`False`	Turn on classifier-side conformal diagnostics
`--classifier-conformal-method`	`split`	`split`, `aps`, `raps`, or `cross`
`--task-timeout-sec`	`300`	Per-dataset timeout
`--quiet-worker-logs`	`False`	Suppress worker output
`--enable-nestedcv-audit`	`False`	Nested CV robustness audit

Programmatic

from tabnetics.benchmarks.cli import main

# Equivalent to CLI invocation
import sys
sys.argv = ["tabnetics-benchmark", "--datasets", "leukemia_golub", "--seeds", "42"]
main()

Validation-catalog benchmark datasets are not allowed to silently fall back to synthetic proxies. When those datasets are selected, the benchmark CLI requires the authoritative HuggingFace bundle via TABNETICS_HF_ORG or TABNETICS_HF_REPO_ID. That bundle is an operational mirror of the public upstream datasets rather than a separate private corpus.

Validation campaigns

Tabnetics ships three packaged validation surfaces:

tabnetics-validation-plan / python -m tabnetics.validation.generate_plan
tabnetics-validation-shard / python -m tabnetics.validation.core.shard_runner
tabnetics-validation-suite / python -m tabnetics.validation.suite

Typical workflow:

export TABNETICS_HF_ORG=klokedm
export TABNETICS_HF_REPO_ID=klokedm/tabnetics-validation

# 1. Build a sharded campaign plan.
tabnetics-validation-plan --plan-kind validation17 --num-pods 4

# 2. Run a small local slice with the unified suite.
tabnetics-validation-suite \
    --dataset-sets fs_easy \
    --seeds 11 23 37 \
    --dataset-integrity-policy error \
    --output-dir run_artifacts/validation/unified_suite

# 3. Execute one shard from a generated plan.
tabnetics-validation-shard \
    --shard-id 1 \
    --plan pod_validation/plan_4_val17.json \
    --shards pod_validation/shards_4_val17.json

Use the suite for smaller slices or component ablations. Use the plan + shard flow when you want a reproducible multi-job campaign across a larger benchmark catalog.

Datasets

Tabnetics ships with a registry of 70+ HDLSS benchmark datasets. Operationally, many validation-catalog runs are loaded through the HuggingFace bundle, but the underlying data come from public upstream sources such as OpenML, GEO, CuMiDa, UCSC Xena, public HuggingFace datasets, and Orange / University of Ljubljana (biolab.si) mirrors where applicable.

Registry

from tabnetics.datasets import CATALOG, DATASET_SETS

# List all registered datasets
for ds_id, spec in CATALOG.items():
    print(f"{ds_id}: {spec.display_name} ({spec.n_samples}×{spec.n_features}, {spec.n_classes} classes)")

# List named dataset groups
print(list(DATASET_SETS.keys()))
# ['all', 'smoke', 'core', 'extended', 'fs_all', 'fs_easy', 'fs_medium', ...]

Dataset groups

Group	Description
`smoke`	3 datasets for quick sanity checks
`core`	All non-extended datasets
`extended`	Full catalog including CuMiDa and TCGA
`fs_easy` / `fs_medium` / `fs_hard` / `fs_very_hard`	FS pipeline datasets by difficulty tier

Example datasets

ID	Name	Samples	Features	Classes
`leukemia_golub`	Leukemia (Golub)	72	7,129	2
`dlbcl_shipp`	DLBCL (Shipp)	77	5,469	2
`ovarian_petricoin`	Ovarian Cancer (Petricoin)	253	15,154	2
`srbct_khan`	SRBCT (Khan)	83	2,308	4
`prostate_singh`	Prostate Cancer (Singh)	102	12,600	2
`carcinom_11class`	Carcinom 11-class	174	9,182	11
`nci9_60_9class`	NCI9	60	9,712	9
`gla_bra_180`	GLA-BRA-180	180	49,151	4

Reproducibility and data source policy

The public Pages site reflects the packaged implementation, not earlier draft workflows. The operational rules to keep in mind are:

df_stage_position="after_fs" is the current default in the packaged pipeline and benchmark CLI.
Validation-catalog and evidence-bearing benchmark runs treat the HuggingFace bundle as the authoritative reproducibility mirror of the public upstream sources.
The benchmark CLI and validation suite default to --no-synthetic-fallback; if you explicitly re-enable fallback for non-evidence workflows, do so knowingly and record the source policy in your artifacts.
dataset_integrity_policy="error" is the default so mislabeled or collapsed class structures fail fast instead of silently contaminating results.
The built-in multiomics_adapter="split_halves" is intended for benchmark-time stress tests; use explicit omics blocks with MultiBlockPLSDA or MINTIntegrator on real data.

If you are preparing a public benchmark or a paper-facing validation run, keep the HF bundle, no-synthetic-fallback, and integrity-error settings unchanged.

Uncertainty and conformal outputs

Classifier-side conformal prediction is available as an opt-in diagnostic layer:

tabnetics-benchmark \
    --datasets leukemia_golub \
    --enable-classifier-conformal \
    --classifier-conformal-method aps \
    --classifier-conformal-output-sets

Interpret those outputs as uncertainty diagnostics:

coverage and mean prediction-set size
singleton rate / compactness of the prediction sets
optional per-sample prediction sets when artifact size is acceptable

They are not expected to improve balanced accuracy. This follows the MAPIE interpretation in Taquet et al. 2022: conformal wrappers calibrate prediction sets around a fixed base classifier. For singleton-oriented efficiency and reject-option interpretations, see Wang, Sun & Dobriban 2025 and Hallberg Szabadváry et al. 2025.

Oracle presets

The MNPO oracle can be configured with presets via OracleConfig.from_preset():

Preset	Oracles	Use case
`perf_only`	Performance	Fastest; single-criterion selection
`perf_complexity`	Performance + Complexity	Prefer simpler feature sets
`perf_complexity_stability`	Performance + Complexity + Stability	Add bootstrap stability
`full`	Performance + Stability + Complexity + Robust + Diversity	Production default — all 5 oracles
`minimal_cvar`	Performance + CVaR	Tail-risk focus for small datasets

Feature selection methods

All 35+ methods in the registry, grouped by paradigm:

Stability

Key	Label
`stability_lasso`	Stability Selection (Lasso)
`stability_subsample`	Stability Selection (complementary subsampling)
`tigress_stability`	TIGRESS-style Stability Selection
`subspace_stability`	Subspace Stability Selection
`decorrelated_stability`	Decorrelated Stability Selection
`ipss`	Integrated Path Stability Selection (IPSS)
`cluster_stability`	Cluster Stability Selection

Wrapper

Key	Label
`rfecv`	Recursive Feature Elimination
`boruta`	Boruta
`iterative_redundancy_pruning`	Iterative redundancy-pruning wrapper
`iterative_redundancy_pruning_bounded`	Iterative redundancy-pruning wrapper (runtime-bounded)

Filter

Key	Label
`mutual_information`	Mutual Information
`anova_f`	ANOVA F-test
`chi_square`	Chi-Square univariate filter
`relieff`	ReliefF instance-based filter
`fcbf`	FCBF correlation-based filter
`cmim`	CMIM conditional MI filter
`hsic_lasso`	HSIC Lasso-style kernelized selection
`mrmr_jmi`	mRMR/JMI redundancy-aware selection

Embedded

Key	Label
`gradient_boosting`	Gradient Boosting
`linear_svm`	Linear SVM
`treeshap`	TreeSHAP embedded selector
`oaenet`	OAENet adaptive elastic-net selector
`slce_centroid_encoder`	SLCE centroid-encoder selection
`group_sparse_lasso`	Group sparse lasso

Pairwise / AUC

Key	Label
`wmw_auc`	WMW univariate AUC filter
`joint_auc_l1`	Joint AUC-aware L1 selector (binary only)
`ktsp`	k-TSP pairwise rank selection

Knockoff

Key	Label
`copula_knockoff`	Copula knock-off selection

Multiclass

Key	Label
`ova_ensemble`	OVA multiclass ensemble selection
`ecoc_class_aware`	ECOC class-aware decomposition selection
`joint_multiclass_support`	Joint multiclass shared-support selection
`dove_class_specific`	DOvE-style class-specific multiclass selection
`sparse_multinomial`	Sparse multinomial multiclass selection
`nearest_shrunken_centroid`	Nearest shrunken centroids multiclass selection
`class_pareto_front`	Class-specific Pareto-front multiclass selection
`sir_sdr` / `save_sdr` / `pfc_sdr`	Sufficient dimension reduction selectors

Method profiles

Named method sets for benchmark runs. Use with --fs-method-set <profile> on the CLI.

Profile	Methods	Notes
`strict_plus_mrmr`	GB, SVM, MI, ANOVA, mRMR	5-method baseline
`strict_plus_mrmr_auc`	Baseline + WMW AUC	6-method
`mnpo_copula_extended`	Baseline + copula knockoff	Knockoff expansion
`mnpo_ipss_extended`	Baseline + IPSS	Stability expansion
`mnpo_broad_stable`	14 production-safe methods	Adds Boruta, copula knockoff, decorrelated stability, ReliefF, stability lasso, RFECV, HSIC lasso
`mnpo_v14_core`	15 methods	broad_stable + joint multiclass support
`mnpo_v14_core_plus_ipss`	16 methods	v14_core + IPSS
`mnpo_broad_all`	36 methods	Exhaustive — all non-deprecated selectors

TabArena pipeline profiles

Pipeline-level profiles for the TabArena general tabular benchmark. Use with --profile <name> via python -m experiments.benchmarking.tabarena_benchmark.

Profile	Classifier pool	Key settings	Best for
`hdlss`	LR, SVM-RBF, RF, KNN, elastic-net, NB	CDF transform on, HDLSS screening/folding active	HDLSS baseline (p » n)
`general`	LR, SVM-RBF, RF, KNN, elastic-net, NB, XGBoost, LightGBM	CDF off, FLAML-backed MNPO hybrid selection, 10-candidate cap	Moderate general tabular
`general_full` (default)	16 classifiers (full surface excl. TabPFN)	CDF off, FLAML-backed MNPO, no runtime cap, tritrust oracle k=3, ensemble, post-Val-17 FS defaults	Broad general tabular
`general_tabular`	12 tree-weighted classifiers (RF, ExtraTrees, XGBoost, LightGBM, CatBoost, LR, elastic-net, SVM-linear, SVM-RBF, KNN, NB, copula-DA)	CDF off, legacy selection with FLAML tuning, no screening/folding, adaptive FS fraction by N/p ratio, no HDLSS machinery	N » p datasets (many samples, few features)

The general_tabular profile adapts feature selection by the sample-to-feature ratio:

N/p > 100 or p ≤ 20: fs_fraction=1.0 (keep all features through FS)
N/p > 10 or p ≤ 50: fs_fraction=0.90
Otherwise: fs_fraction=0.50

Prefilter is only activated when p > 200 (with BH correction disabled per Val-18 P03 evidence). FS still uses the post-Val-17 Banzhaf-weighted defaults, while classifier selection now stays on the legacy FLAML path because the strongest MNPO-collapse evidence so far is HDLSS/small-sample specific rather than general-tabular. Classifier-side conformal uses APS (per Val-18 C07 evidence). Screening and folding are always off (these are HDLSS-specific). HDLSS-oriented classifiers (GPC, PLS-DA, NSC, vote_ensemble, shrinkage_LDA) are excluded in favour of tree ensembles plus copula_da as an extra generative family.

Multi-omics

Tabnetics includes DIABLO-style multi-block PLS-DA and MINT batch correction for multi-omics integration:

from tabnetics.multiomics import MultiBlockPLSDA

# X_blocks: list of arrays, one per omics layer
# y: shared class labels
model = MultiBlockPLSDA(n_components=3)
model.fit(X_blocks, y)
scores = model.transform(X_blocks)

For the full pipeline, multiomics_adapter="split_halves" is the built-in benchmark-style shortcut: it derives two blocks from the first and second half of the feature matrix. For real CSVs with named omics fields, prefer the explicit field-based pattern in Your own CSV files and build the blocks yourself with MultiBlockPLSDA or MINTIntegrator.

For the broader modeling context, see Cai et al. 2022, which reviews why multi-omics integration remains a meaningful lever for cancer classification even when small-sample settings make the engineering path harder.

See tabnetics.multiomics for full API.

License

Apache 2.0 — see LICENSE.

Documentation and webpages on this site are generated from authoritative internal sources using a combination of deterministic rules and generative AI. Errors are possible. Please report issues via GitHub Discussions or email [email protected].