vbpca-py | Joshua C. Macdonald

vbpca-py is a variational Bayesian PCA framework for incomplete data. It jointly infers latent components, noise variance, and effective dimensionality while propagating full posterior uncertainty (covariances on loadings, scores, and bias) through the entire estimation pipeline.

Key features:

Native per-entry missingness handling via shared observation patterns that reuse matrix factorizations — no imputation required
Automatic Relevance Determination prunes uninformative components; built-in model selection sweep identifies optimal rank
Missing-aware preprocessing pipeline: one-hot encoding, scaling, power transforms, winsorization
C++-accelerated kernels via pybind11 with runtime autotuning for performance-critical updates
Full scikit-learn estimator API (fit, transform, score)

Applied to genetic, cultural, and ecological datasets. Available on PyPI and archived at Zenodo.

The companion paper (Macdonald et al., 2024) develops the Bayesian rank estimation methodology using posterior predictive eigenvalue testing.

References