Predicting amount of funding the NIH awarded research grants (R01s)

The National Institutes of Health (NIH) awards funds to grants related to public health research, and comprehensive information regarding these grants funded by the NIH is publicly available and can be downloaded from here. Ultimately we want to determine the likelihood a grant receives a certain amount of money, given the NIH decided to fund the proposal (the NIH is not required to publish data on proposals that are rejected). In today's highly competitive funding environment this information could be relevant to those submitting grant proposals. Our analysis will be limited to R01 grants (we will simply refer to them as grants here), which are the major source of funding for most academic research labs.

Insights regarding which grants receive high or low levels of funding

Based on the analysis of NIH grant data, the following were found to be important predictors for total funds awarded to an R01 research grant:

Research topic: It is unsurprising that the research topic, as analyzed from grant text data, plays a major role in determining the amount of awarded funds. We identified 20 topics and counted the number of grants in each topic that fell in a funding quartile (see table below; counts were from a test subset of the full dataset). We highlight a few topics associated with high or low funding:
- Grants associated with child obesity/diabetes research were the most common highly funded research topic, with 1635 grants funded in the top 75th quartile compared to 619 in the bottom 25th quartile.
- Although fewer total HIV grants were funded compared to child obesity, an HIV proposal is almost 4x as likely to be in the top 75th quartile compared to the bottom 25th quartile (706:192).
- Suprisingly, cancer research proposals were almost 2.5x more likely to be funded in the bottom 25th quartile (1119 cancer grants) compared to the upper 75th quartile (459 grants).
- Also surprising given the hype around large-scale sequencing, genetics and genomics research had more grants funded in the bottom 25th quartile than any other group.

research topic	<$319,102.5	$319,102.5-$371,250	$371,250-$464,929.5	>$464,929.5
bacteriology	308	300	437	244
cancer cell biology	766	829	799	313
cancer research	1072	692	387	459
cardiac research	118	244	493	269
cell signaling	380	373	318	123
child obesity/diabetes	521	411	550	1635
clinical research	1078	1021	1016	1613
drug development	951	555	414	249
gene expression	479	447	455	369
genetics and genomics	550	354	265	388
hiv	159	133	284	706
human disease mechanisms	1404	1655	1449	643
kidney research and imaging	341	361	283	432
liver research	118	177	114	70
lung research	209	374	497	295
neuroscience	567	978	909	603
parasitology (malaria)	69	97	130	88
sleep/ageing	320	237	226	315
substance abuse/addiction	247	216	197	469
virology	266	330	491	405

Study section: Grants are reviewed and funded by one of 250 different study sections, depending on the research topic, and our trained model identified the study section as the most important predictor of funds. For example, almost 4800 grants funded by the NIH's special emphasis panels were in the top 75th quartile of funding compared to only ~2500 that were in the bottom 25th quartile of funding. A grant that is reviewed and accepted by one of these panels thus has a very high chance of being funded.
Organization: The organization/institute associated with the grant was also predictive of the proposal's funding quartile; however, given the number of unique organizations (1067), it is difficult to resolve the contribution of any individual organization.

The notebooks containing the full analyses are as follows:

Downloading and cleaning grant and principal investigator data
Scraping grant information
Initial exploration of R01 information. We only have abstract information for grants from 2008 and later and will therefore only analyze grants from these years.
Analysis of grant abstract text data and predictive modeling

Unsupervised clustering of abstract text

Term frequency-inverse document frequency (tf-idf) is a common method of assessing the importance of a word in a document corpus. It weights the frequency of a given term in an individual document with the inverse frequency of that term in the entire corpus. The general idea is that a term that appears in all the documents is unlikely to be useful for identifying individual documents/subjects and is therefore unimportant. A term that appears frequently in a small number of documents, however, is likely very important in distinguishing those documents. We will use tf-idf vectorization to transform our abstract text and perform K-Means clustering. To determine the number of clusters to fit, we can plot the sum of squared errors vs. the number of clusters, commonly known as an elbow plot.

Although the elbow plot suggests that using 40-50 clusters may lead to the best separation of our data, that is a large number of clusters. As we will want to use distance from cluster centroid as features in later predictive modeling, a large number of clusters will lead to a high-dimensional feature space. We therefore initially fit 20 clusters and isolated the top 10 words closest to each cluster centroid, listed below.

cluster 0	cluster 1	cluster 2	cluster 3	cluster 4	cluster 5	cluster 6	cluster 7	cluster 8	cluster 9	cluster 10	cluster 11	cluster 12	cluster 13	cluster 14	cluster 15	cluster 16	cluster 17	cluster 18	cluster 19
alcohol	receptor	brain	hiv	liver	cell	age	patient	children	infect	parasit	heart	gene	imag	lung	diseas	cancer	protein	dna	pathogen
smoke	tumor	disord	infect	diseas	stem	sleep	treatment	obes	virus	iron	failur	express	kidney	injuri	mechan	breast	drug	genet	bacteri
tobacco	signal	neuron	1	hepat	tumor	relat	pain	intervent	vaccin	malaria	cardiac	diseas	method	inflamm	human	tumor	structur	genom	bacteria
cocain	activ	function	aid	fatti	cancer	cognit	improv	risk	immun	drug	diseas	human	develop	diseas	regul	prostat	diseas	human	antibiot
addict	target	memori	risk	alcohol	immun	diseas	effect	diabet	viral	infect	caus	regul	diseas	inflammatori	signal	cell	develop	diseas	infect
effect	regul	neural	intervent	cancer	diseas	declin	clinic	behavior	cell	diseas	death	mutat	patient	asthma	cell	therapi	target	sequenc	resist
abus	cell	circuit	drug	cirrhosi	regul	older	care	adolesc	host	caus	blood	develop	mri	pulmonari	develop	target	function	repair	host
drink	mechan	develop	prevent	metabol	develop	function	develop	effect	respons	develop	attack	genet	clinic	mechan	function	develop	biolog	mutat	diseas
behavior	therapeut	mechan	transmiss	caus	mechan	increas	therapi	prevent	human	human	lead	identifi	detect	airway	molecular	patient	enzym	cancer	immun
treatment	role	cognit	antiretrovir	develop	function	popul	diseas	social	develop	host	mechan	caus	improv	acut	pathway	treatment	membran	chromosom	human

We can immediately see that K-Means identified different research subjects, such as research regarding addiction and addictive substances (cluster 0) and HIV/AIDS research (cluster 3). This is confirmation that our unsupervised clustering was able to identify structure within our abstract data.

We can visualize these data using t-SNE, which is a non-linear dimensionality technique that can map multi-dimensional clusters to two-dimensional space. We first reduced our dimensions using PCA before running t-SNE:

We can see pretty distinct clusters (although some are broken into multiple clusters), again suggesting our K-Means clustering did a relatively good job separating grants based on abstract text.

Predicting funding groups based on grant abstracts

We will first plot the distribution of funds awarded to grants in our dataset.

Other than a small number of outliers, the majority of grants received less than $1,000,000. Let's re-plot the distribution of funds for grants that received less than $1,000,000.

We see that most grants receive similar amounts of funding. For our first model, we will attempt to predict which funding group quartile a grant falls into based on tf-idf of the abstracts. Grants will be divided into funding groups as follows:

low: grants awarded less than $319,102.5 (<25th quartile)
med-low: grants awarded $319,102.5-$371,250 (25th-50th quartile)
med-high: grants awarded $371,250-$464,929.5 (50th-75th quartile)
high: grants awarded more than $464,929.5 (>75th quartile)

Dividing the grants in this manner ensures each class has the same number of observations and thus we will not have to worry about class imbalance. Although we will lose some information by grouping grants in this manner rather than running a regression, we want to make predictions based on grant abstracts and multinomial Naive Bayes classifiers are good default models that can handle high-dimensional text data. Furthermore, because most grant funds are clustered together, we will not lose too much information.

The confusion matrix resulting from fitting a mulitnomial Naive Bayes trained on our tf-idf abstract data is shown below:

Given that this is a naive model trained only on abstract text data, our predictions are relatively good. We see that the best predictions by far are for highly funded grants, but the model predicts the true class label for the other funding groups at higher frequencies than the incorrect classes. Furthermore, when our naive classifier mis-classified grants, it tended to predict the adjacent funding group at higher frequencies than a more extreme funding group.

Regression analysis of grant data

We will now predict funding amount using regression analysis and will use the following features:

Distance: Due to dimensionality concerns we will not directly use our tf-idf matrix to perform regression analysis. Instead, we will use the distance of each grant from the center of the 20 clusters identified by K-Means.
Application type: The type of grant application (new, renewal, etc.)
Funding year
Study section: NIH grant proposals are reviewed by different study sections, depending on the topic of the proposal
Support year: How many years the grant has been funded by NIH
Number of associated investigators (PIs)
Organization: Organizations are identified via a DUNS number

First we fit a random forest regressor to predict funds based on the features listed above. We then convert the predicted funds to funding groups to compare with the multinomial Naive Bayes and plot the confusion matrix.

The random forest trained on the listed features performs better than the Naive Bayes trained on text data, with a larger number of true positives and fewer false positives across all classes.

Let's take a look at the features the random forest used to split the data to make predictions.

We can see the most important feature the random forest model used for splits was the grant study section, followed by what organization (DUNS number) the grant was associated with and distance from the centroids represented by genetics and genomics and child obesity/diabetes.

Conclusions

Although the abstract alone can lead to better than random predictions of how many funds a grant was awarded, combining abstract information with other data about the grant results in relatively good predictive power. Our final model was a random forest trained on these data. Although we can determine which features random forests used to split the data, and therefore which variables were deemed important, it is more difficult to determine how the feature corresponds to our response. We do know that the study section the grant was assigned to as well as the associated organization seem to have a large effect on ultimately how many funds the grant is awarded. Furthermore, the strength of the assignment of a grant to a certain research topic, as measured by Euclidean distance from K-Means cluster centroids, were also important predictors for funding, particulary for the child obesity/diabetes and genetics/genomics clusters.

Our model is based on only a static snapshot of NIH grants, specifically from grants funded between 2008 and 2016 and would need to be re-trained as more grants are funded. This is particularly true of the unsupervised clustering we performed; in fact, we could possibly get better predictions by fitting more clusters, although the increase in dimensionality may make model training more difficult.

Although grant information is available from as early as 1985, as of 2017/05/09 when the data was downloaded abstracts were only available from 2008 and later. As of 2017/06 it seems the NIH may have updated these earlier grants to include abstracts and to fix DUNS numbers. It would be interesting to investigate these earlier grants and see if grant funding and abstracts have changed over time.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
column-info		column-info
data		data
images		images
saved-models-info		saved-models-info
README.md		README.md
abstracts-EDA.ipynb		abstracts-EDA.ipynb
abstracts-clustering-prediction.ipynb		abstracts-clustering-prediction.ipynb
cleaning-pi-info.ipynb		cleaning-pi-info.ipynb
cleaning_strings.py		cleaning_strings.py
download-clean-data.ipynb		download-clean-data.ipynb
generate_random_color.py		generate_random_color.py
nih_functions.py		nih_functions.py
scrape-grant-info.ipynb		scrape-grant-info.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting amount of funding the NIH awarded research grants (R01s)

Insights regarding which grants receive high or low levels of funding

Unsupervised clustering of abstract text

Predicting funding groups based on grant abstracts

Regression analysis of grant data

Conclusions

About

Releases

Packages

Languages

yuwie10/nih-awards

Folders and files

Latest commit

History

Repository files navigation

Predicting amount of funding the NIH awarded research grants (R01s)

Insights regarding which grants receive high or low levels of funding

Unsupervised clustering of abstract text

Predicting funding groups based on grant abstracts

Regression analysis of grant data

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages