1-2. Estimate single-cell density at observed time point. (Optional)¶

Pseudodyanmics+ leverage a concpet called Physics-Informed Neural Networks (PINN) which can solve the forward and inverse problem of PDE. We applies the continuous PINN which requires:

the model correctly predict density at observed time points and data points
the model follows the PDE (physics formular) anywhere (collocation point : unobserved cell state) and anytime (collocation time : unobserved time point)

In this notebook, we will go through several density estimator that could served as the label for solving PINN.

Gaussian KDE
Mellon (the latest)
HBE (scalabl)

Remember that density estimation in high-dimensional space is not retrival. None of the method above captured the true density as the true manifold of the single-cell data is never known.

%load_ext autoreload
%autoreload 2

import os, sys
if sys.platform.startswith("darwin"):
    os.environ['KMP_DUPLICATE_LIB_OK']='True'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scanpy as sc
sc.settings.set_figure_params(frameon=False, dpi=30)

import pseudodynamics as pdp
os.chdir(pdp.main_dir)
print("workding directory changed to:", pdp.main_dir)

workding directory changed to: /Users/weizhongzheng/Documents/python_project/pseudodynamics_plus

adata = sc.read_h5ad('data/tom_pos.h5ad')
adata

AnnData object with n_obs × n_vars = 49390 × 4814
    obs: 'n_genes', 'n_counts', 'mt_count', 'mt_frac', 'doublet_scores', 'predicted_doublets', 'xist_logn', 'Ygene_logn', 'xist_bin', 'Ygene_bin', 'sex_adata', 'biosample_id', 'cellid', 'RBG', 'SLXid', 'index', '10xsample_description', 'sex_mixed', 'sex_meta', 'mouse_id', 'sortedcells', 'expected_cells_10x', 'cellranger_cellsfound', 'chemistry', 'tom', 'expdate', 'batch', 'timepoint_tx_days', 'start_age', 'sample_id', 'countfile', 'S_score', 'G2M_score', 'phase', 'leiden', 'SLX', 'plate_sorted', 'plate_rearranged', 'well_sorted', 'well_rearranged', 'set_index', 'CI_index', 'mouse_platelabel', 'sort_method', 'sample.name', 'population', 'sex', 'countfolder', 'batch_plate_sorted', 'data_type', 'sex_combined', 'longname', 'anno_man', 'leiden_DM', 'HSCscore', 'nn_HSCscore', 'isroot', 'dpt_pseudotime', 'leiden_orig', 'logk', 'net_prolif', 'log10SR', 'log_density_at_E3', 'log_density_at_E7', 'log_density_at_E12', 'log_density_at_E12_clip', 'log_density_at_E3_clip', 'log_density_at_E7_clip', 'log_density_at_E27', 'log_density_at_E27_clip', 'log_density_at_E49', 'log_density_at_E49_clip', 'log_density_at_E76', 'log_density_at_E76_clip', 'log_density_at_E112', 'log_density_at_E112_clip', 'log_density_at_E161', 'log_density_at_E161_clip', 'log_density_at_E269', 'log_density_at_E269_clip', 'pseudo_bulk', 'rs50', 'split', 'fine_anno'
    var: 'symbol', 'gene_ids-10x', 'feature_types-10x', 'genome-10x', 'n_counts-10x', 'highly_variable-10x', 'means-10x', 'dispersions-10x', 'dispersions_norm-10x', 'gene_removed-10x', 'mean-10x', 'std-10x', 'geneid-SS2', 'chromosome-SS2', 'is_mito-SS2', 'highly_variable', 'mean', 'std'
    uns: 'DM', 'DM_EigenValues', 'anno_man_colors', 'biosample_id_colors', 'data_type_colors', 'density_predictor', 'diffmap_evals', 'fine_anno_colors', 'iroot', 'isroot_colors', 'leiden', 'leiden_DM_colors', 'leiden_DM_sizes', 'leiden_colors', 'leiden_sizes', 'neighbors', 'original_pop', 'paga', 'pagaDM', 'pagaPCA', 'pca', 'phase_colors', 'pop', 'predicted_doublets_colors', 'pseudo_bulk', 'pseudo_bulk_settings', 'rank_genes_groups', 'rs50', 'umap_init_pos', 'umap_init_pos_DM'
    obsm: 'DM_EigenVectors', 'DM_scaled', 'Delta_DM', 'TIGON_u', 'X_diffmap', 'X_pca', 'X_pca_harmony', 'X_pca_scaled', 'X_umap', 'X_umap_2d', 'X_umap_3d', 'X_umap_DM_2d', 'X_umap_DM_3d', 'mellon_u', 'surrogate_u'
    varm: 'PCs'
    obsp: 'DM_Kernel', 'DM_Similarity', 'DM_connectivities', 'DM_distances', 'connectivities', 'distances'

1. Gaussian KDE¶

This is the conventional density estimation method. In the manuscript, we benchmarked several methods, and this traditional Gaussian KDE works stably and can reproduce the shift in dense cell states along the differrentaition trajectory.

Gaussian KDE is used by default when runing the training script main_train.py. Two args can be passed to adjust

--bw : the kernel size
--knn_volume : scale density by a cell’s KNN volumn (unstable).

# list cell by time
obs = adata.obs
CB_ay = [obs.query('`timepoint_tx_days` == @t').index for t in adata.uns['pop']['t']]


# cell state coordiate

cellstate_key = "DM_scaled"
n_dim = 5

# scale to 0-1
DM_x = adata.obsm[cellstate_key][:,:n_dim]

DM_min = DM_x.min(axis=0, keepdims=True)
DM_x_range = (DM_x.max(axis=0, keepdims=True) - DM_min)

# update 
norm_DMx = (DM_x - DM_min) / DM_x_range
adata.obsm[cellstate_key] = norm_DMx

# calculate
DM_normed = [adata[cbs].obsm[cellstate_key] for cbs in CB_ay]
print(DM_normed[0].shape)

(394, 5)

# bandwidth
bw = None
bw = 0.5

from scipy.stats import gaussian_kde

kde_kernel = []
kde_u = []

for dm in DM_normed:
    # print(dm.shape)
    # kde_fn = gaussian_kde(dm.T, bw_method='silverman')       # take in [n_dim, n_sample]
    kde = gaussian_kde(dm.T, bw_method=bw)
    kde_kernel.append(kde)
    u_t = kde(DM_x.T)  # full DM
    kde_u.append(u_t)

kde_u = np.stack(kde_u)

pdp.pl.params_in_umap(adata, kde_u, param='KDE u')

2. Mellon¶

Mellon use a gaussian process to model the cell density. See Otto, D.J., Jordan, C., Dury, B. et al. 2024 for the manuscript and mellon github for the package.

Our package envelope the Mellon model in a simple way.

pdp.tl.compute_mellon_u
pdp.tl.compute_mellon_timesense_u

pseudodyanmics+ integrate mellon for solving sc population dynamics. Use a different training script dudt_train_mellon.py

# For time sensitive model as an example

cellstate_key = 'DM_EigenVectors'
timepoint_key = 'timepoint_tx_days'    # 

log_u, mellon_fns = pdp.tl.compute_mellon_u(adata, 
                            cellstate_key = cellstate_key,
                            timepoint_key = timepoint_key, 
                            n_dimension=None)

# like KDE, each time point has its own density function 
print("# of estimators : ", len(mellon_fns))

# a big difference is that Mellon returns log density
pdp.pl.params_in_umap(adata, log_u, param='mellon log u')

To use time-sensitive mellon model, we need to specify the ls_time parameter, which prefers short and identical interval between timepoints. For our long-term in vivo dataset, we use take the log-time for demonstration (which is a strong assumption).

adata.obs['log_timepoint'] = adata.obs['timepoint_tx_days'].apply(lambda x: np.log(x))

# For time sensitive model as an example

cellstate_key = 'DM_EigenVectors'
timepoint_key = 'log_timepoint'    # 

log_u, density_predictor = pdp.tl.compute_mellon_timesense_u(adata, 
                            cellstate_key = cellstate_key,
                            timepoint_key = timepoint_key, 
                            n_dimension=None)

# this time-dependent predictor allows us to predict density at any time point

t = np.log(46) # an 
X = adata.obsm[cellstate_key] 
density_d46 = density_predictor(X, t)

We can save this precomputed predictor for later use during training

density_predictor.to_json(f"data/tom_pos_mellon_timecontinuous_predictor.json") 

3. Efficient kernel density estimator for atlas level data¶

Traditionaly gaussidan KDE scales badly to sample size. For atals with millions of cells, we can use a hash-based estimator for sample-efficiency.
The package efficient_kde provide several estimators for accelaration. see the original github for details.
The manuscript by Moses Charikar and Paris Siminelakis FOCS 2017

!git clone https://github.com/talwagner/efficient_kde.git

from efficient_kde.kde import FastLaplacianKDE, kde

# Hyperparameter for HBE
bandwidth = 1
L = 90

# list cell by time
obs = adata.obs
CB_ay = [obs.query('`timepoint_tx_days` == @t').index for t in adata.uns['pop']['t']]


# cell state coordiate
# scale to 0-1
DM_x = adata.obsm['DM_EigenVectors_multiscaled']

DM_min = DM_x.min(axis=0, keepdims=True)
DM_x_range = (DM_x.max(axis=0, keepdims=True) - DM_min)

# update 
norm_DMx = (DM_x - DM_min) / DM_x_range
adata.obsm['DM_EigenVectors_multiscaled'] = norm_DMx

# calculate
DM_normed = [adata[cbs].obsm['DM_EigenVectors_multiscaled'] for cbs in CB_ay]

hbe_kernel = []
hbe_u = []

for dm in DM_normed:
    # print(dm.shape)
    # kde_fn = gaussian_kde(dm.T, bw_method='silverman')       # take in [n_dim, n_sample]
    hbe = FastLaplacianKDE(dm, bandwidth=0.05, L=30)
    hbe_kernel.append(hbe.kde)
    u_t = hbe.kde(DM_x)  # full DM
    hbe_u.append(u_t)

hbe_u = np.stack(hbe_u)

pdp.pl.params_in_umap(adata, hbe_u, param='HBE u')

What’s next

set up training configuration