Tracking Trends in Biomedical Research Topics
Posted: 12th January 2022Preface
In 2018, I read two blogposts by Jeremy Fox and by Brian McGill, respectively, which talk about "bandwagons" in ecological research, i.e. topics many researchers jump onto because they are "hot" or "sexy" at the respective point of time.
Inspired by these posts, I had written a little Python script using the Biopython library to access the pubmed database of biomedical papers. I intended to use the script to track the occurence of keywords in the biomedical literature over time, and thus be able to identify which topics are currently "hot" or have been in the past. However, after finishing the script I have never really used it and quickly forgotten about it afterwards.
Some days ago, however, I accidentally stumbled upon it again, gave it a little overhaul and finally started using it a bit. In this post, I want to introduce this script itself, as well as share some findings I obtained by playing around with it.
The Script
The script is available in this github repository. Its most important part lies in the function _number_by_year(query, year)
, which, as the name already suggests, returns the number of publications in a given year
containing the keyword(s) contained in query
.
To be more precise, the function reads as follows:
from Bio import Entrez
def _number_by_year(query, year):
handle = Entrez.esearch(db='pubmed',
retmax='200000000',
retmode='xml',
term=query+" "+str(year)+"[pdat]")
results = Entrez.read(handle)
return len(results["IdList"])
What exactly is going on here? First, we use the method esearch()
from the Entrez
module of the Biopython library to query the pubmed database, which is a databse of biomedical publications. We specify that we want to get the results in the XML format, and search for our keyword contained in query
. However, we limit our search to the specific publication year contained in year
by using the [pdat]
search operator.
This returns a handle, from which we can then read the results using the Entrez.read()
function. The returned object contains, among some other small pieces of information, a list of pubmed IDs fulfilling our search criteria, whose length we return. In this way, we get the number of eligible publications without having to query detailled information (authors, title, keywords, abstract, affilitions, journal name, etc. etc.) for every single publication. This saves a lot of time by greatly reducing bandwidth and computational load. (We are also consuming way less of NCBI's/pubmed's ressources this way.)
Establishing a Baseline
It is well-known that the number of annually published research papers increases steadily. Thus, simply checking whether the number of occurences of a specific keyword increases over time does not suffice to reliably answer the question whether a keyword is indeed becoming more popular or not. Thus, we first want to establish a baseline of overall growth. We here do so by tracking the number of publications containing very general words like "cancer" or "cell". Doing so, we quickly notice how the number of new publications indeed increases steadily every year by approximately 5%, reflecting nicely the ever-increasing number of annually published papers altogether. In the following sections, we will now examine some specific keywords and compare their numbers to this baseline.
Examining some established Topics
Let's start by examining some topics / keywords that by now have been established for some time. The following figure shows the results of tracking six of them:
Let's look at these plots one after another:
- Epigenetics started around the 90s, growing exponentially (straight line in the semilog plot) and way faster than the baseline until approximately 2015 or so, when it started to slow down. Currently, its growth is approximately equal to the baseline, indicating that the "boom phase" of epigenetics has indeed ended. I personally noticed this nicely during my undergrad studies. When I started in 2014, epigenetics was still very much a "hot topic", however over the subsequent years I noticed people speaking less and less about it.
- Apoptosis shows a very similar, albeit shifted, trend with a short boom phase between the early and late 1990s, and a stable growth at baseline rate afterwards.
- Oncogene boomed a bit before "apoptosis" did, starting in the early 1980s and converging to baseline growth around the late 80s / early 90s. Since then, it has been growing steadily at baseline pace, however lately the number of new annual publications has actually decreased a tiny bit. I really do wonder whether this is just noise, or might indeed hint at people starting to lose interest in the general concept of oncogenes? I look forward to seeing the numbers of the next five years or so.
- p53 started out in the early 80s and has been booming until the late 90s; however since then the number of new annual publication grows more slowly than the baseline. Thus, the fraction of new papers containg this term actually decreases steadily.
- Telomerase, at some point seen as one of the most important targets for treating cancer, shows a similar, but more extreme trend. Booming throughout the 90s, the topic now shows a nearly constant number of publications per year. This might in part be due to reseachers becoming less enthusiastic about targeting telomerase, as inhibitors have not proven to be as successful as hoped initially (due to other, alternative mechanisms of telomere lengthening that have been discovered to play a role in some cancers as well). The expected reduction in telomerase papers, however, has likely been mitigated by the ongoing research on some telomerase components such as dyskerin, which play a role in non-neoplastic disease.
- Bacteriophage, finally, shows a particular trend. Or rather the absence of any trend. The number of new bacteriophage-related publications is pretty much constant since the 1970s. As somebody who worked on phages during his undergrad years and is still fascinated by them, I have to admit that I find this sad ;-). Then again, I guess the biomedical applications of bacteriophages are indeed rather limited, so it does not really make me wonder that the phage field hasn't been flourishing all too much.
Some more modern topics
Now for some more modern topics:
Let's go through these four subplots one by one again.
- Crispr. We probably all know about this one, right? Its boom did not at all surprise me. Interestingly, in the last few years, the Crispr field has started to slow down in its tremendous growth, which matches my personal observations from conversations with colleagues. I am under the impression that the novelty effect of Crispr has worn off by now, and people are now working on the (more boring) tasks of optimising methods and protocols, and using it as one tool among others.
- Organoid is an interesting one. Currently in its boom stage as expected, however surprisingly the keyword already showed up way earlier. I have checked some of the "organoid" papers from the 70s, finding that they used the word more in the sense of "organelle" or "cell component" or "cluster of cells," which should explain this observation.
- Coronavrirus research shows a huge spike in 2020. No surprise there ;-). Interestingly, in 2021 the number of coronavirus-related papers has only grown approximately at baseline pace. I do wonder about the reasons of that. While I have some speculations about possible reasons, I'm not at all sure about them at this point of time. I'm looking forward to seeing how the numbers will develop over the next years and after the pandemic is over.
- Model(l)ing has been on the rise continuously, growing stably at a pace greater than baseline. Being a theoretical/mathematical biologist myself, it makes me immensely happy to see that our field has been and continues to become more quantitative in nature. This is, in my opinion, a very good trend from which the field will continue to benefit a lot.
Conclusions
Here, I have presented a very simple Python tool available in this github repository, that allows to track the development of a keyword's popularity in biomedical research papers over time. The trends I have examined match nicely with my personal experiences and expectations.
Interestingly, none of the examined fields seems to "die out", i.e. show a continuously decreasing number of annual publications. Instead, at most we only see the convergence to an approximately constant number. However, given the rising number of papers published a year (~5% increase per year), this might just be the way a topic "dies" or starts to "die", since people shift their attention away from them?
As seen with "model(l)ing", alternative spellings can be problematic and change the results. This is even more so the case with abbreviations that have multiple meanings such as "ALT", which might refer to either "alternative lengthening of telomeres", or to "alanine transaminase." Thus, a bit of caution is advisable when defining the search query.