Automatic acquisition of gene expression level data by python and R

This article Blue Collar Bioinformatics; Automated retrieval of expression data with Python and R It is a translation. It's a little old, but I can also study for myself. We welcome any typographical errors or mistranslations.

Translated below

This February, I will go to Japan to participate in the Bio Hackathon 2010, which focuses on the integration of biology data. So, I was thinking about a use case as a preparation.

When wrestling with a dataset, what additional data would help solve the problem? A typical example is information on gene expression level data. If you have a list of genes of interest and want to know their expression levels.

In a world where the dream of the Semantic Web has come true, I think you'll be querying a set of RDF triples like this:

Then, the expression level of the cell tumor (proB) of interest under normal conditions can be obtained, and then the expression level of the gene at hand can be obtained. Ideally, I would like the expression level set to have metadata. Then you can see what the number 7.65 means from the context of the experiment.

You can't throw this query from a Semantic Web channel, but you can't automatically solve a similar problem. There are a lot of tools.

NCBI's Gene Expression Omnibus (GEO) hosts expression data online. You can query the dataset with the Eutils API in Biopython. Once you have identified the dataset you are interested in, retrieve the data at Bioconductor's GEOQuery and correspond to the UCSC RefGene identifier. You can. All of these can be concatenated by running R from python on Rpy2.

Translation: Now maybe pipeR is better

First, set top-level goals. Here, we will obtain wild-type expression level data of mouse proB cells.

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = "[email protected]"
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir)
    

def _is_wild_type(result):
    """Judge from the title whether the sample is wild type"""
    return result.samples[0][0].startswith("WT")

It then sends a query to GEO to get the available data based on the species and cell tumor. In real work, you might replace your private _is_wild_type function with one that acts interactively, letting your bioscientist choose according to your experimental method. Also note that the final result is pickle and saved at hand. By querying only once in this way, the load on the external server is reduced.

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)
                break

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The challenge of sending a query to GEO and getting the results is done by Biopython's Entrez interface. After pumping up the query, the results are parsed into a simple object with a description, holding the expression paired with the title and the ID for each sample.

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to obtain GEO for specific species and cell tumors.
    """
    Entrez.email = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term

    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results = Entrez.read(handle)
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary = Entrez.read(handle)
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                 if sample['Title'].find(cell_type) >= 0:
                     samples.append((sample['Title'], sample['Accession']))
                     break
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

With this, if you choose the type of experiment, you will break through the first barrier. Next, get the value of the expression level data. This is done by creating a dictionary that maps RefGene IDs to expression levels. In other words, the result is:

WT ProB, biological rep1
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]),
('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
('NM_177325', [7.3312364040000002, 11.831475960000001]),
('NM_177323', [6.9779868059999997, 6.3926399939999996]),
('NM_177322', [5.0833683379999997])]

The real hassle is getting the expression level and mapping it to the gene ID, which Bioconductor's GEOquery library does for you. Gene-expression correspondence tables and higher-order metadata are stored in local files. The former is an R-style tab-delimited table, and the latter is held in JSON. Save both locally and in an easy-to-read format.

def write_gsm_map(self, gsm_id, meta_file, table_file):
    """GEO expression level data is obtained using Bioconductor and stored in a table.
    """
    robjects.r.assign("gsm.id", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
    robjects.r('''
        library(GEOquery)
        library(rjson)
        gsm <- getGEO(gsm.id)
        write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
        col.names = TRUE)
        cat(toJSON(Meta(gsm)), file = meta.file)
        ''')

This function creates a correspondence table between probe names and expression levels. To make it more convenient, the expression microarray probe should be matched to the ID of the biological gene. This is done by calling the GEO library again. The data is retrieved again in the form of an R data frame and only the elements of interest are saved in a tab-delimited file.

def _write_gpl_map(self, gpl_id, gpl_file):
    """Use R to retrieve GEO platform data and save it in tabular format"""
    robjects.r.assign("gpl.id", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
    robjects.r('''
    library(GEOquery)
    gpl <- getGEO(gpl.id)
    gpl.map <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
    write.table(gpl.map, file = gpl.file, sep = "\t", row.names = FALSE,
    col.names = TRUE)
    ''')

Finally, I will summarize the processing so far. Download and parse each correspondence table file, and connect the correspondence with the expression level-> probe ID-> Transcript ID. The final dictionary will be the ID of the gene associated with the expression level, so it will be returned.

def get_gsm_tx_values(self, gsm_id, save_dir):
    """"Transcribed product"=> 「GEO,Create a correspondence table of "expression level obtained from GSM file"
    """
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or \
        not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab')
        reader.next() # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
                tx_to_vals[tx_id].append(float(probe_val))
    return tx_to_vals

The Complete Script (https://github.com/chapmanb/bcbb/blob/022c50b3132bc3d009ed1d533a3954bd564a9ed3/rest_apis/find_geo_data.py) is an integration of the old snippets. We started with the query and finally got the required expression level data.

However, it has become a detour compared to the ideal one using RDF. Thus, the desire to make access to question-driven data as simple as possible is a recurring challenge for bioinformatics tool creators.


The translation ends here

The script itself is https://github.com/chapmanb/bcbb/blob/022c50b3132bc3d009ed1d533a3954bd564a9ed3/rest_apis/find_geo_data.py It is in.

The author's bcbb repository has many other useful bioinformatics programs.

Recommended Posts

Automatic acquisition of gene expression level data by python and R
Hashing data in R and Python
Recommended books and sources of data analysis programming (Python or R)
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Comparison of R and Python writing (Euclidean algorithm)
Automatic acquisition of stock price data with docker-compose
Python / Automatic low wrench unfitting of experimental data
Analysis of financial data by pandas and its visualization (2)
Full-width and half-width processing of CSV data in Python
Analysis of financial data by pandas and its visualization (1)
Visualization method of data by explanatory variable and objective variable
Understand the status of data loss --Python vs. R
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Get rid of dirty data with Python and regular expressions
Scraping desired data from website by linking Python and Excel
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Comparison of data frame handling in Python (pandas), R, Pig
Automatic update of Python module
Visualization of data by prefecture
Source installation and installation of Python
Works with Python and R
"Introduction to data analysis by Bayesian statistical modeling starting with R and Stan" implemented in Python
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Impressions of touching Dash, a data visualization tool made by python
[Python] Implementation of Nelder–Mead method and saving of GIF images by matplotlib
Automatic creation of 2021 monthly calendar (refill for personal organizer) by Python
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
The story of Python and the story of NaN
Installation of SciPy and matplotlib (Python)
Data acquisition using python googlemap api
Expansion by argument of python dictionary
This and that of python properties
10 selections of data extraction by pandas.DataFrame.query
Behavior of python3 by Sakura's server
Animation of geographic data by geopandas
Coexistence of Python2 and 3 with CircleCI (1.0)
Story of power approximation by Python
Summary of Python indexes and slices
Reputation of Python books and reference books
Preprocessing of Wikipedia dump files and word-separation of large amounts of data by MeCab
Easily exchange data between Python, R and Julia using the Feather format
Derivation of multivariate t distribution and implementation of random number generation by python
Sensor data acquisition and visualization for plant growth with Intel Edison and Python
I have 0 years of programming experience and challenge data processing with python
I tried to verify and analyze the acceleration of Python by Cython
__Getattr__ and __getattribute__ to customize the acquisition of object attributes by dots
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
[Control engineering] Calculation of transfer functions and state space models by Python
I want to output a path diagram of distributed covariance structure analysis (SEM) by linking Python and R.