Personal blog of Matthias M. Fischer


Tracking Trends in Biomedical Research Topics

Posted: 12th January 2022

Preface

In 2018, I read two blogposts by Jeremy Fox and by Brian McGill, respectively, which talk about "bandwagons" in ecological research, i.e. topics many researchers jump onto because they are "hot" or "sexy" at the respective point of time.

Inspired by these posts, I had written a little Python script using the Biopython library to access the pubmed database of biomedical papers. I intended to use the script to track the occurence of keywords in the biomedical literature over time, and thus be able to identify which topics are currently "hot" or have been in the past. However, after finishing the script I have never really used it and quickly forgotten about it afterwards.

Some days ago, however, I accidentally stumbled upon it again, gave it a little overhaul and finally started using it a bit. In this post, I want to introduce this script itself, as well as share some findings I obtained by playing around with it.

The Script

The script is available in this github repository. Its most important part lies in the function _number_by_year(query, year), which, as the name already suggests, returns the number of publications in a given year containing the keyword(s) contained in query.

To be more precise, the function reads as follows:

from Bio import Entrez

def _number_by_year(query, year):
    handle = Entrez.esearch(db='pubmed',
                            retmax='200000000',
                            retmode='xml',
                            term=query+" "+str(year)+"[pdat]")
    results = Entrez.read(handle)
    return len(results["IdList"])

What exactly is going on here? First, we use the method esearch() from the Entrez module of the Biopython library to query the pubmed database, which is a databse of biomedical publications. We specify that we want to get the results in the XML format, and search for our keyword contained in query. However, we limit our search to the specific publication year contained in year by using the [pdat] search operator.

This returns a handle, from which we can then read the results using the Entrez.read() function. The returned object contains, among some other small pieces of information, a list of pubmed IDs fulfilling our search criteria, whose length we return. In this way, we get the number of eligible publications without having to query detailled information (authors, title, keywords, abstract, affilitions, journal name, etc. etc.) for every single publication. This saves a lot of time by greatly reducing bandwidth and computational load. (We are also consuming way less of NCBI's/pubmed's ressources this way.)

Establishing a Baseline

It is well-known that the number of annually published research papers increases steadily. Thus, simply checking whether the number of occurences of a specific keyword increases over time does not suffice to reliably answer the question whether a keyword is indeed becoming more popular or not. Thus, we first want to establish a baseline of overall growth. We here do so by tracking the number of publications containing very general words like "cancer" or "cell". Doing so, we quickly notice how the number of new publications indeed increases steadily every year by approximately 5%, reflecting nicely the ever-increasing number of annually published papers altogether. In the following sections, we will now examine some specific keywords and compare their numbers to this baseline.

Examining some established Topics

Let's start by examining some topics / keywords that by now have been established for some time. The following figure shows the results of tracking six of them:

Let's look at these plots one after another:

Some more modern topics

Now for some more modern topics:

Let's go through these four subplots one by one again.

Conclusions

Here, I have presented a very simple Python tool available in this github repository, that allows to track the development of a keyword's popularity in biomedical research papers over time. The trends I have examined match nicely with my personal experiences and expectations.

Interestingly, none of the examined fields seems to "die out", i.e. show a continuously decreasing number of annual publications. Instead, at most we only see the convergence to an approximately constant number. However, given the rising number of papers published a year (~5% increase per year), this might just be the way a topic "dies" or starts to "die", since people shift their attention away from them?

As seen with "model(l)ing", alternative spellings can be problematic and change the results. This is even more so the case with abbreviations that have multiple meanings such as "ALT", which might refer to either "alternative lengthening of telomeres", or to "alanine transaminase." Thus, a bit of caution is advisable when defining the search query.