[PYTHON] I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)

What is Topological Data Analysis?

Topology is a field of mathematics dealing with "forms". Topological Data Analysis (TDA) is a data analysis method that captures the shape represented by a large number of data in a high-dimensional space as a feature of the data and analyzes it. TDA is applied in various fields such as natural language processing and image recognition. I tried to analyze the scRNA-seq data using this TDA. Single cell RNA-seq(scRNA-seq) Within an individual multicellular organism, all cells basically carry the same gene. However, the genes expressed in it differ from cell to cell. RNA-seq is a technology that detects all genes expressed in cells, and single cell RNA-seq (scRNA-seq) performs that RNA-seq at the single cell level. scRNA-seq is a suitable technique for analyzing cell diversity at the gene expression level. Recently, the diversity of cancer stem cells and the diversity of cells in the differentiation process have been well studied.

Analysis of scRNA-seq data

Currently, more than 20,000 genes have been identified in humans. RNA-seq can be used to obtain the expression levels of all genes. Since scRNA-seq handles data from thousands to tens of thousands of cells, it is necessary to analyze thousands to tens of thousands of data in about 20,000 dimensions. TDA is a good method for analyzing such a large amount of high-dimensional data.

data set

In this analysis, we used data published in a public database. Published in the GEO database, the accession number is GSE67310is. You can get the data that the expression level called GSE67310_iN_data_log2FPKM_annotated.txt.gz is converted into the FPKM value. Information such as the expression level of each gene, time point (time_point), and cell type (assignment) is described. iNeuron Forcible expression of the Ascl1 gene in mouse embryonic fibroblast causes cell reprogramming, returning to an undifferentiated state and allowing it to differentiate into neurons. Cells are collected at 5 time points (day0, day2, day5 day20, day22) from the start of differentiation and scRNA-seq is performed.

python code

Run Vietris-Rips filtration to get a simplicial complex. A primary unit is a line segment connecting data points. I tried to illustrate this as an undirected graph using networkx.

Load libraries
import numpy as np
import pandas as pd
import gudhi as gd
import networkx as nx
Reading data
x = pd.read_csv('GSE67310_iN_data_log2FPKM_annotated.txt', delimiter = '\t')
Triming data
y = x.drop('cell_name', axis = 1)
y = y.drop('assignment', axis = 1)
y = y.drop('log_tauGFP_intensity', axis = 1)
y = y.drop('experiment', axis = 1)
y = y.drop('time_point', axis = 1)
y.index = x.cell_name
Creating color table by day
day_color = pd.DataFrame()
for i in range(len(x)):
  if x.time_point[i] == 0:
    day_color[y.index[i]] = 'red'
  elif x.time_point[i] == 2:
    day_color[y.index[i]] = 'yellow'
  elif x.time_point[i] == 5:
    day_color[y.index[i]] = 'green'
  elif x.time_point[i] == 20:
    day_color[y.index[i]] = 'purple'
  else:
     day_color[y.index[i]] = 'blue'
Creating color table by cell type
type_color = pd.Series()
for i in range(len(x)):
  if x.assignment[i] == 'MEF':
    type_color[y.index[i]] = 'red'
  elif x.assignment[i] == 'd2_induced':
    type_color[y.index[i]] = 'yellow'
  elif x.assignment[i] == 'd2_intermediate':
    type_color[y.index[i]] = 'orange'
  elif x.assignment[i] == 'd5_earlyiN':
    type_color[y.index[i]] = 'skyblue'
  elif x.assignment[i] == 'd5_earlyMyocyte':
    type_color[y.index[i]] = 'lightgeen'
  elif x.assignment[i] == 'd5_intermediate':
    type_color[y.index[i]] = 'brown'
  elif x.assignment[i] == 'd5_failedReprog':
    type_color[y.index[i]] = 'gray'
  elif x.assignment[i] == 'd22_failedReprog':
    type_color[y.index[i]] = 'black'
  elif x.assignment[i] == 'Neuron':
    type_color[y.index[i]] = 'blue'
  elif x.assignment[i] == 'Myocyte':
    type_color[y.index[i]] = 'green'
  else:
    type_color[y.index[i]] = 'white'
Computing Vietris-Rips complex
rips = gd.RipsComplex(y.values, max_edge_length = 250)
Computing simplex tree
simplex_tree = rips.create_simplex_tree(max_dimension = 2)
Computing skeleton
skeleton = simplex_tree.get_skeleton(2)
Getting persistence diagram
diag = simplex_tree.persistence()
Plotting persistence diagram
gd.plot_persistence_diagram(diag)
Plotting persistence density
gd.plot_persistence_density(diag)
Constructing netowrk
g = nx.Graph()
for i in range(len(skeleton)):
  if len(skeleton[i][0]) == 2:
    g.add_edge(y.index[skeleton[i][0][0]], y.index[skeleton[i][0][1]])
layout = nx.kamada_kawai_layout(g)
nx.draw_networkx_nodes(g,layout,lineidths=0.2, edgecolors='black', node_size=20, node_color = day_color[list(g.nodes())].values)
nx.draw_networkx_edges(g, layout, width = 0.2, edge_color = 'gray')
nx.draw_networkx_nodes(g,layout,lineidths=0.2, edgecolors='black', node_size=20, node_color = type_color[list(g.nodes())].values)
nx.draw_networkx_edges(g, layout, width = 0.2, edge_color = 'gray')

iNeuron_Persistence_diagram.tiff iNeuron_Persistence_density.tiff iNeuron_Network_time_point_samll.tiff iNeuron_Network_cell_type_small.tiff When color-coded by time point, it spreads from day 0 (red) to day 22 (blue), and it can be seen that the gene expression pattern is diversifying. If you color-code by cell type, you can see that the cells on day 22 are divided into Neuron (blue) and Myocyte (green).

Conclusion

I analyzed the scRNA-seq data with TDA. The cells are separated by time point and cell type, and I think we were able to cluster them neatly. Isn't TDA effective for analyzing scRNA-seq data?

reference

GUDHI Python modules documentation Treutlein B, Lee QY, Camp JG, Mall M et al. Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 2016 Jun 16;534(7607):391-5. GSE67310

Recommended Posts

I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I tried to perform a cluster analysis of customers using purchasing data
I tried to predict the J-League match (data analysis)
I tried to analyze J League data with Python
I tried to search videos using Youtube Data API (beginner)
I tried to get data from AS / 400 quickly using pypyodbc
I tried to analyze my favorite singer (SHISHAMO) using Spotify API
I tried to visualize BigQuery data using Jupyter Lab on GCP
[Python] I tried to get various information using YouTube Data API!
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
I tried using Azure Speech to Text.
I tried using YOUTUBE Data API V3
I tried factor analysis with Titanic data!
I tried to classify text using TensorFlow
I tried to predict Covid-19 using Darts
I tried logistic regression analysis for the first time using Titanic data
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried to analyze the New Year's card by myself using python
I tried web scraping to analyze the lyrics.
I tried to save the data with discord
I tried to synthesize WAV files using Pydub.
I tried principal component analysis with Titanic data!
I tried to get CloudWatch data with Python
Qiita Job I tried to analyze the job offer
I tried DBM with Pylearn 2 using artificial data
I tried to make a ○ ✕ game using TensorFlow
I tried to explain multiple regression analysis as easily as possible using concrete examples.
I tried to make a suspicious person MAP quickly using Geolonia address data
I tried using parameterized
Data analysis using xarray
I tried using mimesis
I tried using anytree
Data analysis using Python 0
I tried using aiomysql
I tried using Summpy
I started to analyze
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried to debug.
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried to paste
I tried using Jupyter
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I tried clustering ECG data using the K-Shape method
I tried to get an AMI using AWS Lambda
I tried to approximate the sin function using chainer
I tried to become an Ann Man using OpenCV