Is it possible to measure the quality of research applications in a reliable way?

Is it possible to measure the quality of research applications in a reliable way?

Most researchers have told or in any case heard stories of the following type: "We got five points on our research application last year, then we developed it on the basis of the reviews we got and then we got a four this year, that's really strange…." The central point of this short story is to illustrate that the assessment of quality that is assigned to applications by research councils, foundations and the like, tend to appear quite less systematic than expected.

This blog will thus deal with the assessment of research applications and whether it is possible to assess the quality of such applications. It may seem particularly strange to ask the question at all. Of course, why would such assessments otherwise be made? And the assessments have far-reaching financial consequences, since in many cases a lot of money is distributed. Is it not true that such an activity rests on a secure foundation? A basic requirement for an assessment is that it reliable, that is, the grade we give to an application should not be arbitrary, but independent experts are expected to assess the same application to a great extent in the same way. What does the research say about researchers' ability to assess applications in a uniform manner?


Reliability in the assessment of research applications

Research about the assessment of research applications provide an unusually coherent picture with regard to the ability to achieve reliability in ratings. After analyzing applications to the Australian Research Council in a large project, Marsh e.g. concludes (2008): "Peer reviews lacked reliability". Interestingly, the outcome was no better for applications in the natural sciences than in social sciences and the humanities. In a more recent study mimicking the assessment procedure of the National Institute of Health in the United States, the authors draw the following almost devastating conclusions:

”We examined 43 individual reviewers' ratings and written critiques of the same group of 25 NIH grant applications. Results showed no agreement among reviewers regarding the quality of the applications in either their qualitative or quantitative evaluations. Although all reviewers received the same instructions on how to rate applications and format their written critiques, we also found no agreement in how reviewers "translated" a given number of strengths and weaknesses into a numeric rating.” (Pier et al, 2018).

It seems to be a consistent result in studies of independent assessments of research applications that interrater reliability is on embarrassingly low levels, not least for those who claim that quality can be assessed in this way. Note that I only talk about the reliability, that is, the ability to make similar judgments, and not the more advanced and complex question of validity, which is about whether it really is quality you measure. However, as we know, reliability is a necessary prerequisite for validity, which is why I will stay on the discussion about reliability.

One argument in responding to the fact that independent judgments have low reliability is to argue that panels of experts arrive at better assessments of scientific applications than individual assessors. The idea is then that when all assessors jointly put their perspectives on an application, the final assessment will be better than if each one assesses from their own perspective. I myself have felt quite skeptical in relation to that kind of argumentation since it seems more like a legitimization of a decision-making process than being built upon evidence. We can expect an occurrence of such arguments because it is in the interest of many to show that assessments are made with a process that is exact.

However, I want to warn against arguments that are based only on trust in processes and which are not substantiated by empirical facts. Interestingly, I have found a study by Fogelholma et al. (2012) that examined whether discussion in group panels improve the reliability of assessments of research applications and their conclusion there was: ”This indicates that panel discussions per se did not improve the reliability of the evaluation. These quantitative and experimental data support the conclusion by Obrecht et al., who based their findings on mainly qualitative data”. In fact, Fogelholma and his collaborators recommend that you should not have panels because they are costly without contributing to better reliability.

It should be noted that there seem to be some candidates for how the reliability could increase of which the perhaps most promising to at least raise the reliability somewhat seems to be to have several independent assessors. Other proposals are that researchers should assess applications in areas they really master, which is not the case in e.g. the educational sciences where assessors meet applications in areas they have little knowledge in. It is also the case that if there are many really bad applications the reliability increases. However, the current funding system has meant that universities and colleges, at least in Sweden, arrange work-shops and similar activities with the aim of writing successful research applications, which leads to a reduction in the number of substandard applications.



Thus, research implies that assessments of applications have a very low level of reliability and thus hardly any validity. Further, it seems that panel discussions did not increase reliability when examined systematically. Perhaps I should point out that the reliability (and validity) of research applications is not my research area and it may happen that there is some study that I have missed. I hope the reader who knows of any such study can get in touch with me. Moreover, there is also the possibility that new studies can come that put things in new light.

But what conclusions can you draw if the pattern I found is correct? One obvious conclusion is that researchers who are surprised by the grading of applications as in the example initially have no reason to be surprised. Different assessment of the same application seem to be the rule rather than the exception. Or more generally, the researchers, and there are quite many, who believe that gradings of research applications have an objective character can abandon this idea. Many researchers, including myself, are convinced that we can assess the quality of an application. Facts, however, point to a need for a greater element of humility on this issue.

A second conclusion is about the importance of discussing how research funds should be distributed. Often professional assessments are used because there is no other way that has proven better. If it now turns out that the emperor is naked, we cannot pretend that we do not see it and therefore we should seriously discuss how much resources should be given to peer reviews of applications.

Thirdly, the outcome may not be so surprising upon closer reflection. Researchers simply have very different opinions about what is the most urgent research.

Fourth, and finally, research applications are not the only area where peer-review is conducted. There are many reasons to also discuss the possibilities and limitations of such processes in other contexts.


Fogelholma, M. et al. (2012) Panel discussion does not improve reliability of peer review for medical research grant proposals. Journal of Clinical Epidemiology 65 (2012) 47-52.

Marsh, H., Jayasinghe, U. och Bond, N. (2008) Improving the Peer-Review Process for Grant Applications Reliability, Validity, Bias, and Generalizability. American Psychologist, 63 (3), 160-168.

Pier, E. et al. (2018) Low agreement among reviewers evaluation the same NIH grant proposals. Proceedings of the National Academy of Sciences of the United States of America, 115 (12), 2952-2957.



Lägg till kommentar