To mark this year’s Peer Review Week Ed Gerstner, Director, Research Environment Alliances, writes on the importance of training when it comes to research integrity and open data.
This question, posed by XKCD’s Randall Munroe, illustrates one of the reasons why open data is essential to improving the integrity of research.
The story goes like this. Someone hears that jelly beans cause acne and demands that scientists investigate. In my telling of this story, which I’ve been giving to researchers around the world for the better part of the last decade, twenty different labs seek to answer this question one colour at a time. One lab looks for a link between purple jelly beans and acne. Another looks for a link between yellow jelly beans and acne. Still another, red jelly beans. And so on.
Most of the labs find no correlation between the consumption of jelly beans of each particular colour, except for one, which observes a correlation between green jelly beans and acne with a confidence level of 95%.
In an ideal world, the researchers in the green jelly bean lab would use this unexpected result as the starting point for further research. In a world where their careers depend on having a prolific publication record, they would be better off seeking to publish without delay. And in Munroe’s telling, the result will inevitably end up making the news.
Although seemingly fanciful, variants of this story are played out again and again in real life. If you run any experiment enough times, you are likely to get a seemingly extraordinary result. Of course, it’s only extraordinary if you ignore all the other results. But what if you aren’t aware of all the other results, because the experiments that found nothing were conducted in labs other than your own?
Some argue that a solution to this problem is for authors to publish their negative results. I don’t think so. The reason this doesn’t happen isn’t that there aren’t any journals to publish in — there are plenty that do. The reason I think this doesn’t happen is simply that most researchers don’t want to spend their valuable time writing negative results papers, and most readers don’t want to spend their valuable time reading negative results papers. There are exceptions, such as negative results that overturn current understanding.
Without some way of making negative results available that on their own are unexceptional but in aggregate with other negative results might enable us to better scrutinize extraordinary results, we will continue to see spurious false positives hit the headlines.
The answer to this, I believe, is open data. We don’t need more negative result papers. But we still need the data from those experiments that return negative results.
And there are myriad other benefits to increasing the widespread sharing of research data.
This won’t be news to many and the reason that the US Office of Science and Technology Policy (OSTP) recently announced that from the end of 2025, all relevant data from all federally funded research must be made openly available.
However, it’s not enough for researchers to simply upload their data to a repository. In order for others to be able to even discover the existence of those data, they need to be accompanied by appropriate metadata.
In areas of research where open data sharing has been common practice for many years, such as genetics and proteomics, this may not be a problem. But most don’t know how to generate the relevant metadata that is needed to make their data findable, accessible, interoperable, and reusable (FAIR, for short) to their colleagues.
Mark Musen, Director of the Stanford Center for Biomedical Informatics Research, argues that an important step to making data FAIR is for researchers from each community to come together to develop standards for the metadata they need for the datasets they generate to be useful to others in that community.
That would be a good start. But an even more pressing need is training.
Acting on a key recommendation of a meeting that Nature hosted in Melbourne in 2019 with stakeholders from all parts of the Australian research community to discuss issues around research integrity, we partnered with the Australian Academy of Science to survey the level of understanding and training provided. One of the most notable results from the survey was in answer to a question of what researchers felt was lacking in their training. Eight of the top ten most common responses to this question spoke to the need for more training around data, including curation, long-term storage and management, understanding and ensuring compliance with policies on access, ownership, sharing and re-use, and associated metadata.
This echoes similar results found in other surveys, such as the 2020 State of Open Data report, which found that 49% of researchers surveyed said that they would find it difficult to develop a practical data management plan ― an increasingly common requirement by funders worldwide ― without further training.
The OSTP mandate on sharing data represents a real opportunity for improving the rigour, impact, and integrity of research. It also has the potential to be a significant burden on already stretched researchers. There is still time before the mandate comes into force.
The time to give researchers the training they need to embrace this opportunity is now.
About the Author
Ed Gerstner is Director, Research Environment Alliances at Springer Nature.