1-1. Prepare Training Data

In this notebook, we go through steps to :

  • embed the population size measurement into the AnnData.uns['pop'].

  • rename timepoint obs key

  • perform dimension reduction : diffusion map

  • sample local state transition

  • train test split

%load_ext autoreload
%autoreload 2

import os, sys
if sys.platform.startswith("darwin"):
    os.environ['KMP_DUPLICATE_LIB_OK']='True'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scanpy as sc
sc.settings.set_figure_params(frameon=False, dpi=30)

import our python package as pdp (pseudo dynamics plus)

import pseudodynamics as pdp
os.chdir(pdp.main_dir)
print("workding directory changed to:", pdp.main_dir)
workding directory changed to: /Users/weizhongzheng/Documents/python_project/pseudodynamics_plus

mouse in vivo BM haematopoiesis

In this tutorial, we use a special persistent labling single-cell data from Kucinski & Barile et. al, Cell Stem Cell. 2024.

Persistent labeling technique can induces fluorescence in Hoxb5-expressing cells and propogate the label to their progenies. As a result, we can identify offsprings of haematopoitic stem and progenitor cells (HSPC) produced after induction from the by checking the tomato fluorescence (Tom+).

This dataset contains 130700 cells from 9 timepoints spanning 9 month’s time. At each time point we counted the total amount of Tom+ cells in the bone marrow of the mice.

The data is available at figshare 👈, or you can download it with the bash script data_downloading.sh.

# we can download the expression matrix together with experimental settings

# create a data folder 
if not os.path.exists("data"):
    os.mkdir("data")
adata_full = sc.read_h5ad("data/tompos/combined_filt.h5ad")
adata_full
AnnData object with n_obs × n_vars = 130700 × 4814
    obs: 'n_genes', 'n_counts', 'mt_count', 'mt_frac', 'doublet_scores', 'predicted_doublets', 'xist_logn', 'Ygene_logn', 'xist_bin', 'Ygene_bin', 'sex_adata', 'biosample_id', 'cellid', 'RBG', 'SLXid', 'index', '10xsample_description', 'sex_mixed', 'sex_meta', 'mouse_id', 'sortedcells', 'expected_cells_10x', 'cellranger_cellsfound', 'chemistry', 'tom', 'expdate', 'batch', 'timepoint_tx_days', 'start_age', 'sample_id', 'countfile', 'S_score', 'G2M_score', 'phase', 'leiden', 'SLX', 'plate_sorted', 'plate_rearranged', 'well_sorted', 'well_rearranged', 'set_index', 'CI_index', 'mouse_platelabel', 'sort_method', 'sample.name', 'population', 'sex', 'countfolder', 'batch_plate_sorted', 'data_type', 'sex_combined', 'longname', 'anno_man', 'leiden_DM', 'HSCscore', 'nn_HSCscore', 'isroot', 'dpt_pseudotime'
    var: 'symbol', 'gene_ids-10x', 'feature_types-10x', 'genome-10x', 'n_counts-10x', 'highly_variable-10x', 'means-10x', 'dispersions-10x', 'dispersions_norm-10x', 'gene_removed-10x', 'mean-10x', 'std-10x', 'geneid-SS2', 'chromosome-SS2', 'is_mito-SS2', 'highly_variable', 'mean', 'std'
    uns: 'DM', 'anno_man_colors', 'biosample_id_colors', 'data_type_colors', 'diffmap_evals', 'iroot', 'isroot_colors', 'leiden', 'leiden_DM_colors', 'leiden_DM_sizes', 'leiden_colors', 'leiden_sizes', 'neighbors', 'paga', 'pagaDM', 'pagaPCA', 'pca', 'phase_colors', 'predicted_doublets_colors', 'rank_genes_groups', 'umap_init_pos', 'umap_init_pos_DM'
    obsm: 'X_diffmap', 'X_pca', 'X_pca_harmony', 'X_umap', 'X_umap_2d', 'X_umap_3d', 'X_umap_DM_2d', 'X_umap_DM_3d'
    varm: 'PCs'
    obsp: 'DM_connectivities', 'DM_distances', 'connectivities', 'distances'

Data overview and filtering

The preprocess single-cell landscape is batch-correted and cell type annotated.

sc.pl.umap(adata_full, 
           color=['timepoint_tx_days', 'tom', 'dpt_pseudotime',
                         'batch',  'anno_man', 'biosample_id', ], 
           ncols=3, wspace=0.3)

filtering Tom+ cells

For this particular data, the newly generated cells are labeled with fluorensence. We thus filtered Tomato-positive (Tom+) cells and count the number of all the Tom+ cells in the bone marrow. At each time point, we sequenced and FACS-counted cells from several mouse.

# select positive cells and saved under data
adata = adata_full[adata_full.obs['tom'] == 'pos'].copy()
adata.write_h5ad('data/tom_pos.h5ad')

Important !!!

To train the model, you need to placed the processed data under data.

1. Attaching population size measurement

One of the main innovations of pseudodyanmics+ is the use of population/tissue size measurement that comes from additional experiment, considering single-cell experiment has the issue of limiting capacity and sampling bias.

However, for dataset that without tissue size measurement, we can go back to sequenced cell number in the sc library. The parameters inferred in this manner may fail to reflect the absolute tissue growth and flux transiting between cell state.

The population info is stored as a dictionary and saved in adata.uns['pop']. The pop dictionary must contain the following keys:

  • 't' : the real experiment time point when the cells are counted and sequenced (assume on the same day)

  • 'mean' : the mean cell number of different experimental repeats for each time point

  • 'std' : the standard derivation of different experimental repeats

  • 'n_lib' : the number of repeats

For example, a single cell data with 3 time points (say day0, day4, day10) and 3 repeats each time looks like this:

    adata.uns['pop'] = {
        "t" = np.array([0, 4, 10]),
        "mean" = np.array([mean_d0, mean_d4, mean_d10]),
        "var" = np.array([var_d0, var_d4, var_d10]),
        "std" = np.array([std_d0, std_d4, std_d10]),
        "n_lib" = np.array([3, 3, 3])
    }
adata = sc.read_h5ad('data/tom_pos.h5ad')
adata
AnnData object with n_obs × n_vars = 49390 × 4814
    obs: 'n_genes', 'n_counts', 'mt_count', 'mt_frac', 'doublet_scores', 'predicted_doublets', 'xist_logn', 'Ygene_logn', 'xist_bin', 'Ygene_bin', 'sex_adata', 'biosample_id', 'cellid', 'RBG', 'SLXid', 'index', '10xsample_description', 'sex_mixed', 'sex_meta', 'mouse_id', 'sortedcells', 'expected_cells_10x', 'cellranger_cellsfound', 'chemistry', 'tom', 'expdate', 'batch', 'timepoint_tx_days', 'start_age', 'sample_id', 'countfile', 'S_score', 'G2M_score', 'phase', 'leiden', 'SLX', 'plate_sorted', 'plate_rearranged', 'well_sorted', 'well_rearranged', 'set_index', 'CI_index', 'mouse_platelabel', 'sort_method', 'sample.name', 'population', 'sex', 'countfolder', 'batch_plate_sorted', 'data_type', 'sex_combined', 'longname', 'anno_man', 'leiden_DM', 'HSCscore', 'nn_HSCscore', 'isroot', 'dpt_pseudotime', 'leiden_orig', 'logk', 'net_prolif', 'log10SR', 'log_density_at_E3', 'log_density_at_E7', 'log_density_at_E12', 'log_density_at_E12_clip', 'log_density_at_E3_clip', 'log_density_at_E7_clip', 'log_density_at_E27', 'log_density_at_E27_clip', 'log_density_at_E49', 'log_density_at_E49_clip', 'log_density_at_E76', 'log_density_at_E76_clip', 'log_density_at_E112', 'log_density_at_E112_clip', 'log_density_at_E161', 'log_density_at_E161_clip', 'log_density_at_E269', 'log_density_at_E269_clip', 'pseudo_bulk', 'rs50', 'split', 'fine_anno'
    var: 'symbol', 'gene_ids-10x', 'feature_types-10x', 'genome-10x', 'n_counts-10x', 'highly_variable-10x', 'means-10x', 'dispersions-10x', 'dispersions_norm-10x', 'gene_removed-10x', 'mean-10x', 'std-10x', 'geneid-SS2', 'chromosome-SS2', 'is_mito-SS2', 'highly_variable', 'mean', 'std'
    uns: 'DM', 'DM_EigenValues', 'anno_man_colors', 'biosample_id_colors', 'data_type_colors', 'density_predictor', 'diffmap_evals', 'fine_anno_colors', 'iroot', 'isroot_colors', 'leiden', 'leiden_DM_colors', 'leiden_DM_sizes', 'leiden_colors', 'leiden_sizes', 'neighbors', 'original_pop', 'paga', 'pagaDM', 'pagaPCA', 'pca', 'phase_colors', 'pop', 'predicted_doublets_colors', 'pseudo_bulk', 'pseudo_bulk_settings', 'rank_genes_groups', 'rs50', 'umap_init_pos', 'umap_init_pos_DM'
    obsm: 'DM_EigenVectors', 'DM_scaled', 'Delta_DM', 'TIGON_u', 'X_diffmap', 'X_pca', 'X_pca_harmony', 'X_pca_scaled', 'X_umap', 'X_umap_2d', 'X_umap_3d', 'X_umap_DM_2d', 'X_umap_DM_3d', 'mellon_u', 'surrogate_u'
    varm: 'PCs'
    obsp: 'DM_Kernel', 'DM_Similarity', 'DM_connectivities', 'DM_distances', 'connectivities', 'distances'

load FACS data

Here we are loading the FACS experiment data. The BM cells of the same mouse are sorted for tomato positive and count total cell number before sending for sequencing.

pop_df = pd.read_csv("data/tompos/model_input_tompos.csv")
pop_df.head()
biosample_id 0 1 2 3 4 5 6 7 8 ... 20 24 25 26 28 sc_ncells_filt flow_total start_age time sc_ncells_total
0 2w4w_SITTC4_146819.61b 42 16 8 5 15 7 3 19 28 ... 0 0 0 0 0 157 634.907992 young 27 161
1 2w4w_SITTC4_146819.61f 184 122 103 109 118 97 73 149 146 ... 4 11 2 5 2 1304 2320.519107 young 27 1357
2 2w4w_SITTD4_146819.61c 257 122 107 76 116 120 78 175 161 ... 7 4 3 2 1 1422 3535.791587 young 27 1449
3 2w4w_SITTD4_146819.61h 98 61 42 45 42 26 28 52 76 ... 2 2 0 1 0 547 895.421264 young 27 563
4 2w4w_SITTE4_146818.66d 35 10 10 9 6 16 4 19 16 ... 0 0 0 1 0 137 456.845376 young 12 143

5 rows × 26 columns

We summarize the total cell number over different mouse for each time point

pop_by_time = pop_df.groupby("time").agg({"flow_total":["mean", "std","count"], "sc_ncells_filt":["mean", "std"]})
pop_by_time
flow_total sc_ncells_filt
mean std count mean std
time
3 147.000000 45.745674 4 101.00 36.487441
7 388.645708 254.329995 6 150.50 69.856281
12 576.692900 169.724308 4 155.25 106.859331
27 1846.659988 1347.949806 4 857.50 606.941238
49 3123.277705 2110.007646 4 1143.00 813.058833
76 8233.932637 446.605979 2 2678.00 560.028571
112 13949.548025 6560.283561 4 3003.25 1733.310392
161 4941.850616 3795.978265 4 1854.25 1635.804262
269 41722.985970 1788.570460 2 4320.75 1446.069241
import matplotlib.pyplot as plt


def plot_mean_std(col_name, ax):
    time_points = pop_by_time.index
    mean_values = pop_by_time[(col_name, 'mean')]
    std_values = pop_by_time[(col_name, 'std')]

    ax.plot(range(len(time_points)), mean_values, marker='o', label=col_name)
    ax.fill_between(range(len(time_points)), 
                    mean_values - std_values, 
                    mean_values + std_values, 
                    alpha=0.3, color='gray', label=col_name+"_error")
    ax.set_xticks(range(len(time_points)))
    ax.set_xticklabels(time_points)

fig = plt.figure(dpi=80, figsize=(6, 3))
ax = fig.gca()
plot_mean_std('flow_total', ax=ax)

Here we found that the measurment at Day 161 seems to be an outlier. We do the following correction :

  • we estimate the average growth rate from Day112 to Day269 with \( N_t = N_0 * e^{g \Delta t}\)

  • next we impute the mean pop size at Day161 with the average growth rate

time_points = pop_by_time.index
mean_values = pop_by_time[('flow_total', 'mean')].values
g = np.log(mean_values[-1]  / mean_values[-3]) / (time_points[-1] - time_points[-3]) 
print(f'the estimated mean g is {g:.5f}')

impute_Day161 = np.exp(g*(time_points[-2] - time_points[-3]) ) * mean_values[-3]
print(f'the imputed mean pop size is {impute_Day161:.2f}')

# replace the mean value
mean_values[-2] = impute_Day161
the estimated mean g is 0.00698
the imputed mean pop size is 19636.45
fig = plt.figure(dpi=80, figsize=(6, 3))
ax = fig.gca()
plot_mean_std('flow_total', ax)
plot_mean_std('sc_ncells_filt', ax)
ax.legend()
<matplotlib.legend.Legend at 0x3264a7f50>
../_images/6063deecc2c92a0cda9f039c7e042a2b5a50adaccb03c0982500b90ccb034f6f.png
# assign to uns['pop']

adata.uns['pop'] = {
    't' : time_points.to_numpy(),
    'mean' : mean_values,
    'var' : pop_by_time[('flow_total', 'std')].values ** 2,  # variance per timepoint
    'std' : pop_by_time[('flow_total', 'std')].values,
    'n_lib' : pop_by_time[('flow_total', 'count')].values
}

adata.uns['pop']

2. rename Timepoint Key

We use a unique obs key timepoint_tx_days to store the timepoint.

if 'timepoint_tx_days' not in adata.obs_keys():
    adata.obs['timepoint_tx_days'] = adata.obs['day'].astype(int) # example

check if the time point from the pop info match with that in the sequencing data

adata.obs['timepoint_tx_days'].isin(adata.uns['pop']['t']).all()
True

3. Compute diffusion map (DM)

The example dataset has pre-computed diffusion maps. For reproducibility, plesae keep the original diffusion map and pseudotime.
Here, for demonstration, we show two popular methods of getting DM cooridnates and pseudotime.

  • scanpy diffusion map and dpt_psuedotime

  • palantir diffusion map and palantir pseudotime

adata
AnnData object with n_obs × n_vars = 49390 × 4814
    obs: 'n_genes', 'n_counts', 'mt_count', 'mt_frac', 'doublet_scores', 'predicted_doublets', 'xist_logn', 'Ygene_logn', 'xist_bin', 'Ygene_bin', 'sex_adata', 'biosample_id', 'cellid', 'RBG', 'SLXid', 'index', '10xsample_description', 'sex_mixed', 'sex_meta', 'mouse_id', 'sortedcells', 'expected_cells_10x', 'cellranger_cellsfound', 'chemistry', 'tom', 'expdate', 'batch', 'timepoint_tx_days', 'start_age', 'sample_id', 'countfile', 'S_score', 'G2M_score', 'phase', 'leiden', 'SLX', 'plate_sorted', 'plate_rearranged', 'well_sorted', 'well_rearranged', 'set_index', 'CI_index', 'mouse_platelabel', 'sort_method', 'sample.name', 'population', 'sex', 'countfolder', 'batch_plate_sorted', 'data_type', 'sex_combined', 'longname', 'anno_man', 'leiden_DM', 'HSCscore', 'nn_HSCscore', 'isroot', 'dpt_pseudotime', 'leiden_orig', 'logk', 'net_prolif', 'log10SR', 'log_density_at_E3', 'log_density_at_E7', 'log_density_at_E12', 'log_density_at_E12_clip', 'log_density_at_E3_clip', 'log_density_at_E7_clip', 'log_density_at_E27', 'log_density_at_E27_clip', 'log_density_at_E49', 'log_density_at_E49_clip', 'log_density_at_E76', 'log_density_at_E76_clip', 'log_density_at_E112', 'log_density_at_E112_clip', 'log_density_at_E161', 'log_density_at_E161_clip', 'log_density_at_E269', 'log_density_at_E269_clip', 'pseudo_bulk', 'rs50', 'split', 'fine_anno'
    var: 'symbol', 'gene_ids-10x', 'feature_types-10x', 'genome-10x', 'n_counts-10x', 'highly_variable-10x', 'means-10x', 'dispersions-10x', 'dispersions_norm-10x', 'gene_removed-10x', 'mean-10x', 'std-10x', 'geneid-SS2', 'chromosome-SS2', 'is_mito-SS2', 'highly_variable', 'mean', 'std'
    uns: 'DM', 'DM_EigenValues', 'anno_man_colors', 'biosample_id_colors', 'data_type_colors', 'density_predictor', 'diffmap_evals', 'fine_anno_colors', 'iroot', 'isroot_colors', 'leiden', 'leiden_DM_colors', 'leiden_DM_sizes', 'leiden_colors', 'leiden_sizes', 'neighbors', 'original_pop', 'paga', 'pagaDM', 'pagaPCA', 'pca', 'phase_colors', 'pop', 'predicted_doublets_colors', 'pseudo_bulk', 'pseudo_bulk_settings', 'rank_genes_groups', 'rs50', 'umap_init_pos', 'umap_init_pos_DM'
    obsm: 'DM_EigenVectors', 'DM_scaled', 'Delta_DM', 'TIGON_u', 'X_diffmap', 'X_pca', 'X_pca_harmony', 'X_pca_scaled', 'X_umap', 'X_umap_2d', 'X_umap_3d', 'X_umap_DM_2d', 'X_umap_DM_3d', 'mellon_u', 'surrogate_u'
    varm: 'PCs'
    obsp: 'DM_Kernel', 'DM_Similarity', 'DM_connectivities', 'DM_distances', 'connectivities', 'distances'
if 'X_diffmap' not in adata.obsm_keys():
    # use batch corrected cell representation like Harmonied PC or SCVI latent
    sc.pp.neighbors(adata, n_pcs=30, use_rep='X_pca_harmony') 
    sc.tl.diffmap(adata, n_comps=10)

adata.uns['iroot']= np.argmin(adata.obs['HSCscore'].values)
sc.tl.dpt(adata)

sc.pl.umap(adata, color=['HSCscore', 'dpt_pseudotime'])
# next we scaled the diffusion map to 0-1.
DM = adata.obsm['X_diffmap']
DM_min = DM.min(axis=0,keepdims=True)
DM_range = DM.max(axis=0,keepdims=True) - DM_min
DM_scaled = (DM - DM_min)/DM_range

Use palantir to compute DM

import palantir
dm_res = palantir.utils.run_diffusion_maps(adata, pca_key='X_pca_harmony')
ms_data = palantir.utils.determine_multiscale_space(adata)

print(f"shape of palantir DM : {adata.obsm['DM_EigenVectors'].shape} and scaled DM : {adata.obsm['DM_EigenVectors_multiscaled'].shape} ")

4. Sample local cell state changes in diffusion map

Pseudodynamics+ can make use of the local cell state changes in diffusion map to assists learning velocity, inspired by TrajectoryNet by Tong et. al.. This state changes is defined as \(\Delta x = \sum_{i \in N(x)} \frac{w_{i,x}}{w_{i,x} + \sum_{j \in N(x)} w_{i,j}} \Delta x_i\) ,Here we provdie two ways of sampling this local state transition:

  • by default we rely on KNN and pseudotime with \(w_{i,j}\) denoting the connectivities in KNN.

  • alternatively, when a transition matrix is provided, e.g. cellrank. The \(w_{i,j}\) denote the probability of transitioning from \(i\) to \(j\).

delta_DM , neighbors = pdp.tl.sample_deltax(adata, xkey='DM_scaled', pseudotimekey='dpt_pseudotime', repeat=10)
delta_DM_ay = np.stack(delta_DM)

print(delta_DM_ay.shape)
adata.obsm['Delta_DM'] = delta_DM_ay.mean(axis=-1) # use the averaged ∆DM
(49390, 10, 30)

5. train test split

def train_test_split_adata(adata,  leaveout=None, val_size=0.1, test_size=0.1, timepoint_key='timepoint_tx_days'):

    obs = adata.obs
    val_test_cbs = obs.query(f"`{timepoint_key}` not in @leaveout").sample(frac=0.2).index
    test_cb = np.random.choice(val_test_cbs, size=len(val_test_cbs)//2,replace=False)

    split_mapper = {cb:'test' for cb in test_cb}
    val_cbs = {cb:'val' for cb in val_test_cbs if cb not in test_cb}
    train_cbs = {cb:'train' for cb in adata.obs_names if cb not in val_test_cbs}

    split_mapper.update(val_cbs)
    split_mapper.update(train_cbs)

    return adata.obs.index.map(split_mapper)

adata.obs['split'] = train_test_split_adata(adata, leaveout=[3])
adata.write_h5ad("data/tom_pos.h5ad")

What’s next

  • Estimate single-cell density at observed time point (optional).

  • set up training configuration