Topology is a field of mathematics dealing with "forms". Topological Data Analysis (TDA) is a data analysis method that captures the shape represented by a large number of data in a high-dimensional space as a feature of the data and analyzes it. TDA is applied in various fields such as natural language processing and image recognition. I tried to analyze the scRNA-seq data using this TDA. Single cell RNA-seq(scRNA-seq) Within an individual multicellular organism, all cells basically carry the same gene. However, the genes expressed in it differ from cell to cell. RNA-seq is a technology that detects all genes expressed in cells, and single cell RNA-seq (scRNA-seq) performs that RNA-seq at the single cell level. scRNA-seq is a suitable technique for analyzing cell diversity at the gene expression level. Recently, the diversity of cancer stem cells and the diversity of cells in the differentiation process have been well studied.
Currently, more than 20,000 genes have been identified in humans. RNA-seq can be used to obtain the expression levels of all genes. Since scRNA-seq handles data from thousands to tens of thousands of cells, it is necessary to analyze thousands to tens of thousands of data in about 20,000 dimensions. TDA is a good method for analyzing such a large amount of high-dimensional data.
In this analysis, we used data published in a public database. Published in the GEO database, the accession number is GSE67310is. You can get the data that the expression level called GSE67310_iN_data_log2FPKM_annotated.txt.gz is converted into the FPKM value. Information such as the expression level of each gene, time point (time_point), and cell type (assignment) is described. iNeuron Forcible expression of the Ascl1 gene in mouse embryonic fibroblast causes cell reprogramming, returning to an undifferentiated state and allowing it to differentiate into neurons. Cells are collected at 5 time points (day0, day2, day5 day20, day22) from the start of differentiation and scRNA-seq is performed.
Run Vietris-Rips filtration to get a simplicial complex. A primary unit is a line segment connecting data points. I tried to illustrate this as an undirected graph using networkx.
Load libraries
import numpy as np
import pandas as pd
import gudhi as gd
import networkx as nx
Reading data
x = pd.read_csv('GSE67310_iN_data_log2FPKM_annotated.txt', delimiter = '\t')
Triming data
y = x.drop('cell_name', axis = 1)
y = y.drop('assignment', axis = 1)
y = y.drop('log_tauGFP_intensity', axis = 1)
y = y.drop('experiment', axis = 1)
y = y.drop('time_point', axis = 1)
y.index = x.cell_name
Creating color table by day
day_color = pd.DataFrame()
for i in range(len(x)):
if x.time_point[i] == 0:
day_color[y.index[i]] = 'red'
elif x.time_point[i] == 2:
day_color[y.index[i]] = 'yellow'
elif x.time_point[i] == 5:
day_color[y.index[i]] = 'green'
elif x.time_point[i] == 20:
day_color[y.index[i]] = 'purple'
else:
day_color[y.index[i]] = 'blue'
Creating color table by cell type
type_color = pd.Series()
for i in range(len(x)):
if x.assignment[i] == 'MEF':
type_color[y.index[i]] = 'red'
elif x.assignment[i] == 'd2_induced':
type_color[y.index[i]] = 'yellow'
elif x.assignment[i] == 'd2_intermediate':
type_color[y.index[i]] = 'orange'
elif x.assignment[i] == 'd5_earlyiN':
type_color[y.index[i]] = 'skyblue'
elif x.assignment[i] == 'd5_earlyMyocyte':
type_color[y.index[i]] = 'lightgeen'
elif x.assignment[i] == 'd5_intermediate':
type_color[y.index[i]] = 'brown'
elif x.assignment[i] == 'd5_failedReprog':
type_color[y.index[i]] = 'gray'
elif x.assignment[i] == 'd22_failedReprog':
type_color[y.index[i]] = 'black'
elif x.assignment[i] == 'Neuron':
type_color[y.index[i]] = 'blue'
elif x.assignment[i] == 'Myocyte':
type_color[y.index[i]] = 'green'
else:
type_color[y.index[i]] = 'white'
Computing Vietris-Rips complex
rips = gd.RipsComplex(y.values, max_edge_length = 250)
Computing simplex tree
simplex_tree = rips.create_simplex_tree(max_dimension = 2)
Computing skeleton
skeleton = simplex_tree.get_skeleton(2)
Getting persistence diagram
diag = simplex_tree.persistence()
Plotting persistence diagram
gd.plot_persistence_diagram(diag)
Plotting persistence density
gd.plot_persistence_density(diag)
Constructing netowrk
g = nx.Graph()
for i in range(len(skeleton)):
if len(skeleton[i][0]) == 2:
g.add_edge(y.index[skeleton[i][0][0]], y.index[skeleton[i][0][1]])
layout = nx.kamada_kawai_layout(g)
nx.draw_networkx_nodes(g,layout,lineidths=0.2, edgecolors='black', node_size=20, node_color = day_color[list(g.nodes())].values)
nx.draw_networkx_edges(g, layout, width = 0.2, edge_color = 'gray')
nx.draw_networkx_nodes(g,layout,lineidths=0.2, edgecolors='black', node_size=20, node_color = type_color[list(g.nodes())].values)
nx.draw_networkx_edges(g, layout, width = 0.2, edge_color = 'gray')
When color-coded by time point, it spreads from day 0 (red) to day 22 (blue), and it can be seen that the gene expression pattern is diversifying. If you color-code by cell type, you can see that the cells on day 22 are divided into Neuron (blue) and Myocyte (green).
I analyzed the scRNA-seq data with TDA. The cells are separated by time point and cell type, and I think we were able to cluster them neatly. Isn't TDA effective for analyzing scRNA-seq data?
GUDHI Python modules documentation Treutlein B, Lee QY, Camp JG, Mall M et al. Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 2016 Jun 16;534(7607):391-5. GSE67310
Recommended Posts