PubMed publishes abstracts of papers, but it can be very tedious to manually look up the abstracts of hundreds or thousands of reports.

Therefore, in this article, we will use the word cloud to text-mine a large number of papers. With the word cloud, you can see the words that are often used in your dissertation.

environment

All the analyzes in this article were implemented in Google Colaboratory.

Until you get the abstract of your dissertation

This time, I will analyze 1000 papers, but it would be a waste of time to download them one by one manually, so I will let them download automatically by programming.

First, install biopython.

!pip install biopython

Next, load biopython and register your email address.

from Bio import Entrez
Entrez.email = "My email address"

Specify the keyword you want to look up. The search results for these keywords are stored in the variable pmids in list format.

term = "covid-19 age risk"
handle = Entrez.esearch(db="pubmed", term=term, retmax=1000)
record = Entrez.read(handle)
pmids = record["IdList"]

Of the search results, only PMID was stored in pmids. PMID is the identification number of the article stored in PubMed.

Next, enter the PMID and create a function to get the abstract of the dissertation.

def get_abstract(pmid):
    try:
        handle = Entrez.efetch(db="pubmed", id=pmid, rettype="medline", retmode="xml")
        return " ".join(Entrez.read(handle)["PubmedArticle"][0]["MedlineCitation"]["Article"]["Abstract"]["AbstractText"])
    except:
        return "emoriiin979"

PubMed also stores papers that do not have abstracts, and I get an error when trying to get AbstractText key data from them, so I made it output my name if abstracts do not exist.

This will allow me to work fine if I later specify that my name should not be used in the analysis.

Now let's use this function to access PubMed's API and get a gist.

from tqdm import tqdm
from time import sleep

text = ""
for i in tqdm(range(len(pmids))):
    text += " " + get_abstract(pmids[i])
    sleep(0.5)

This time, in order to create a word cloud with the WordCloud library, the abstracts of 994 reports (6 reports had no abstract) are summarized in one character string.

Since there is a 0.5 second pause for each access to PubMed, it takes about 20 minutes to get 1000 reports.

Create a word cloud

Now that we have the strings to use for the analysis, it's time to create the word cloud.

First, download the dictionary used to load and analyze the library.

import nltk
from nltk import tokenize
from nltk import stem
from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("stopwords")

Next, the string is decomposed (tokenized) into words.

For example, if you give the string "I like an apple.", It will be decomposed into ["I", "like", "an", "apple", "."].

words = tokenize.word_tokenize(text)
filtered_words = [w for w in words if w not in stopwords.words("english")]

Here, we are also removing words that are not used in the analysis (stopwords).

Next, we will perform word lemming (heading wording). For example, if you give the word "apples", it will be converted to "apple".

lemmatizer = stem.WordNetLemmatizer()

lem_text1 = ""
for word in filtered_words:
    lem_text1 += lemmatizer.lemmatize(word) + " "

This completes the normalization of the word list used for analysis.

Finally, create a word cloud with the Word Cloud library.

from wordcloud import WordCloud

wc1 = WordCloud(background_color="white", width=600, height=400, min_font_size=15)
wc1.generate(lem_text1)
wc1.to_file("wordcloud1.png ")

The created word cloud is as follows.

In this figure, the larger font size is the word that appears frequently in the paper.

Since "covid-19" is included in the search keyword this time, the frequency of appearance of "COVID" and "CoV" is naturally high.

Exclude unnecessary words

It is natural that words such as "COVID" and "CoV" are output, and since these are less important information, I would like to exclude these words from the analysis.

First, select the words you don't think you need from the current word cloud.

more_stopwords = [
    "patient",
    "study",
    "infection",
    "pandemic",
    "result",
    "coronavirus",
    "among",
    "outcome",
    "data",
    "may",
    "included"
]

Then remove these words from filtered_words and recreate the word cloud.

lem_text2 = ""
for word in filtered_words:
    tmp = lemmatizer.lemmatize(word)
    if tmp not in more_stopwords:
        if "COVID" not in tmp and "CoV" not in tmp:
            lem_text2 += tmp + " "

wc2 = WordCloud(background_color="white", width=600, height=400, min_font_size=15)
wc2.generate(lem_text2)
wc2.to_file("wordcloud2.png ")

For "COVID" and "CoV", there are notational fluctuations such as "COVID-19" and it was not possible to exclude exact matches, so partial matches are excluded individually.

The result of excluding these words is as follows.

The word you selected earlier appears to be excluded to some extent.

If you repeat this process, it seems that you will eventually be able to create a word cloud with only the necessary words.

Finally

In this article, we used biopython and WordCloud to analyze the frequency of use of words in PubMed's abstracts.

As for the impression of this analysis, I felt that the accuracy of tokenization and lemming of NLTK was a little poor.

In tokenization, commas (,) and periods (.) Were not removed, and in lemmaization, "accompanies" could not be converted to "accompany" and were output as they were, so improve their accuracy. I feel that this is a future issue.

In addition to the word cloud, there seem to be other analysis methods for the frequency of word appearance, so I would like to investigate and use these as well.

that's all.

[PYTHON] Analyze PubMed paper abstracts in word cloud

environment

Until you get the abstract of your dissertation

Create a word cloud

Exclude unnecessary words

Finally