vbpca-py
Variational Bayesian PCA for incomplete data with full posterior uncertainty, automatic component pruning, native missingness handling, and C++-accelerated kernels.
vbpca-py is a variational Bayesian PCA framework for incomplete data. It jointly infers latent components, noise variance, and effective dimensionality while propagating full posterior uncertainty (covariances on loadings, scores, and bias) through the entire estimation pipeline.
Key features:
- Native per-entry missingness handling via shared observation patterns that reuse matrix factorizations — no imputation required
- Automatic Relevance Determination prunes uninformative components; built-in model selection sweep identifies optimal rank
- Missing-aware preprocessing pipeline: one-hot encoding, scaling, power transforms, winsorization
- C++-accelerated kernels via pybind11 with runtime autotuning for performance-critical updates
- Full scikit-learn estimator API (
fit,transform,score)
Applied to genetic, cultural, and ecological datasets. Available on PyPI and archived at Zenodo.
The companion paper (Macdonald et al., 2024) develops the Bayesian rank estimation methodology using posterior predictive eigenvalue testing.