Alzheimer's Multi-Omics Biomarker Discovery

English · 한국어

Role: Statistical Research Assistant, Columbia University Irving Medical Center — Taub Institute · Stack: R, Cox / GEE, sPLS, Lasso/Ridge/RF/SVM/GBM, PCA/K-means/DBSCAN

A graduate-level biostatistics project on large-scale multi-omics analysis for Alzheimer’s disease biomarker discovery.

Highlights

Integrated genomic, metabolomic, and clinical data to surface 13 key biomarkers (p < 0.01) from ~3,000 metabolites, and uncovered a confounder that had gone undetected for eight months.
Handled a high-dimensional, small-sample setting (~3,000 variables, far more features than samples) with MCAR/MAR/MNAR-aware missing-data treatment and a 3-stage EDA → statistics → ML validation.
Compared 10+ ML algorithms and selected sPLS (84% classification accuracy with interpretability); built 20-year onset-risk models with Cox hazard and family-based GEE, and characterized GWAS + metabolite associations.
Research competition top-3; Chair’s Award; full-time offer from the Taub Institute.

Approach

A staged pipeline integrates three data modalities, treats missingness explicitly, and validates findings from exploratory analysis through statistics to machine learning.

flowchart TB
    G[Genomic] --> INT[Multi-omics integration]
    M[Metabolomic] --> INT
    C[Clinical] --> INT
    INT --> MISS[Missing-data treatment<br/>MCAR / MAR / MNAR]
    MISS --> VAL[EDA, statistics, ML validation]
    VAL --> SEL[10+ models compared<br/>sPLS selected, 84% accuracy]
    SEL --> OUT[13 biomarkers<br/>+ 20-year risk models]