By Christopher Eaker (Data Curation Librarian, University of Tennessee Libraries)
Against the Grain Vol. 33 No. 1
Primary research data, the data collected directly as part of the research process, is increasingly being included as an essential part of the article publication process. Over the years, research funders and journal publishers have been asking researchers to take better care of their data; as a result, datasets have come to be seen as valuable assets that must be preserved and shared rather than by-products of the research process (Uhlir, 2010). Much data is produced in research projects and including these datasets as high-class counterparts to the articles they support helps to “shed light” on these datasets which in the past were part of the dark “long tail” of data (Heidorn, 2008). Reasons to make data publicly available include discovering new results from existing data, reproducing and verifying the results of research, and meeting funder or journal requirements (Borgman, 2012). Most federal funders and many journal publishers require the primary data to be made public at the same time as the article (Briney et al, 2017). Authors and researchers have struggled to meet this demand for a number of reasons, not the least of which being the additional effort involved in preparing a dataset for archiving and publication. For years, many university libraries have been building services targeted at this process to aid researchers in preparing their datasets for archiving and preservation. This process is called data curation and has been formalized by the Data Curation Network (DCN) using the acronym C.U.R.A.T.E.D. (Johnston et al, 2018). These steps are explained below.
University of Illinois School of Information Sciences defines data curation as “the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education (n.d.).” The keywords here are active and ongoing. As mentioned, datasets do not automatically prepare themselves for archiving and publication; they must be actively prepared by a data curator. Further, they must be assessed in an ongoing manner to ensure they are available over the long term. Curated data is better suited for publication alongside articles since it is better prepared. That extra preparation directly and positively impacts its discoverability and its usability. Additional tangential benefits include a higher citation count for the articles that include the data (Christensen et al 2019; Piwowar & Vision, 2013). However, even though journal publishers are increasingly including data availability policies in their author instructions, additional citations do not necessarily follow without enforcement of these policies (Christensen et al, 2019).
While many publishers have data availability policies, some are taking a hands-off approach and trusting the authors to submit their data to a place others can access it. Others are taking a stricter stance. The publisher Science has stated their data availability policy with strong wording: “Science Journals generally require all data underlying the results in published papers to be publicly and immediately available,” and “Citations to unpublished data…cannot be used to support significant claims in the paper” (Science Editorial Board).
As more data is made openly accessible as a part of journal articles or federal funder requirements, the importance of data curation can not be over-emphasized. Data is not intrinsically useful. Furthermore, datasets do not simply become useful because they are publicly available. Data is useful only insofar as it meets the needs of the user. Likewise, more data does not mean more value (Binggeser, 2017). Data is of the highest value for those who collected it. Others who were not involved in the data collection and analysis efforts can find data less useful for their needs, especially if the data is not properly curated. Including as supplemental information a dataset that has not been properly prepared for public use reduces the usefulness of the data. Data must be cleaned and prepared properly for it to be useful. And this process does not happen by accident; it must be purposely conducted by someone trained in properly curating a dataset for public use (Johnston et al, 2018).
It is difficult to say how much value other user groups would place on data. Data that has no value beyond the data creator today may be far more useful in the future. Palmer, Weber, and Cragin (2011) identified three areas of assessment to determine the value of a publicly available dataset: preservation readiness, potential user communities, and fit for purpose. Curating a dataset increases its preservation readiness, thereby increasing its fit for purpose within a specific community and increasing its usability by other potential user communities. Data that is well prepared has a higher value than data that is not well prepared. Curating the dataset increases the chances it can be used as secondary research data by user communities outside of the original. As datasets get reused by more and more user communities, their value increases (Uhlir, 2010).
What value does the curation process provide for data? The data curation steps formalized by the DCN in the C.U.R.A.T.E.D. acronym include the following: Check (the files for completeness and viability), Understand (the contents), Request (additional information), Augment (metadata), Transform (to open formats), Evaluate (for FAIRness), and Document (the curation process) (Johnston et al, 2018).
When checking the files for completeness, data curators must ensure that all necessary files, including metadata records, are included in the dataset prior to publishing. They must also check viability by ensuring that all files are complete within themselves, contain accurate information, and load in appropriate software packages properly. If files are missing or corrupt, their usefulness to others suffers.
Understanding the data files’ contents may be the most difficult part of the curation process, as most data curators do not have expertise in the discipline represented within the files. This should not deter them for performing this task. Putting themselves in the position of someone needing to reuse the dataset, the data curator attempts to determine if enough information is present to understand and make adequate use of it.
When information is discovered to be missing, or additional information would assist in understanding the dataset, the curator requests this information from the data creator. This additional information augments the full package that is archived openly. The metadata is expanded so that as much information is available as possible, which aids in discovery and later reuse of the dataset. As the pool of potential user groups for a dataset widens, the descriptive information needs to be more detailed.
Ideally, the data creator will provide the data files in a format that is preservation ready. In other words, the data files will be in open formats not tied to a proprietary software package. If not, the data curator will transform them as needed. Data files in open formats are far less likely to become obsolete over time. Archived datasets can include both the original, often proprietary, file format as well as the open format. The availability of both provides access to a wider pool of potential user groups.
When all these tasks have been complete, the data curator will evaluate whether or not the dataset is considered FAIR, another acronym meaning “Findable, Accessible, Interoperable, and Reusable.” These principles were developed so that data users can “more easily discover, access, interoperate, and sensibly re-use” the data being created (FORCE11, n.d.).
As all curation tasks are conducted and completed, the data curator must document each step along the way. This documentation provides a backdrop for the current state of the data and the context in which the data is provided. New users will be interested in this information to determine how the dataset was changed from the point of submission to the point of being accessed for their use.
As more journals require authors to make their data publicly available along with articles, the skills data curation specialists offer and the services they provide will become more important. More information science graduate programs are training new crops of information specialists to become data curators. These positions often lie within the university libraries where librarians have a strong reputation for describing and organizing information for access. Funders and journal publishers have a vested interest in helping authors make their data FAIR, i.e., usable by others. FAIR data will be key to reversing the reproducibility crisis and providing the foundations to build upon current research and extend the boundaries of knowledge (Weir, 2015, Baker, 2016 & Jeffries, 2019). Data curation is the first step in this process. Researchers who must make data publicly available are encouraged to seek the expertise of a competent data curator to help them prepare their data prior to archiving.
Baker, M. (2016). Reproducibility crisis. Nature, 533(26), 353-366.
Binggeser, P. (2017). Data does not have intrinsic value. towards data science. Retrieved January 28, 2021, from https://towardsdatascience.com/data-does-not-have-intrinsic-value-2824c2409d86.
Board, J. E. (2014). The article is not enough: Introducing the jlsc data sharing policy. Journal of Librarianship and Scholarly Communication, 2(3), 1186. doi:10.7710/2162-3309.1186
Board, S. E. Science journals: Editorial policies. Retrieved from Science website: https://www.sciencemag.org/authors/science-journals-editorial-policies.
Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology, 63(6), 1059-1078. doi:10.1002/asi.22634
Briney, K., Goben, A., & Zilinski, L. (2017). Institutional, funder, and journal data policies. In L. Johnston (Ed.), Curating research data: Practical strategies for your digital repository: Association of College and Research Libraries.
Christensen, G., Dafoe, A., Miguel, E., Moore, D. A., & Rose, A. K. (2019). A study of the impact of data sharing on article citations using journal policies as a natural experiment. PLoS ONE, 14(12), e0225883. doi:10.1371/journal.pone.0225883
FORCE11 (n.d.) Guiding principles for findable, accessible, interoperable, and re-useable data. Retrieved January 28, 2021, from https://www.force11.org/fairprinciples.
Heidorn, P. B. (2008). Shedding light on the dark data in the long tail of science. Library Trends, 57(2), 280-299. doi:10.1353/lib.0.0036.
Jeffries, J. (2019, November 18). Living in the reproducibility crisis. Early Career Research Community, PLoS blogs. Retrieved January 28, 2021 from https://ecrcommunity.plos.org/2019/11/18/living-in-the-reproducibility-crisis/.
Johnston, L.R., Carlson, J., Hudson-Vitale, C., Imker, H., Kozlowski, W., Olendorf, R., Stewart, C., Blake, M., Herndon, J., Mcgeary, T.M., Hull, E., and Coburn, E. 2018. Data curation network: A cross-institutional staffing model for curating research data. International Journal of Digital Curation, 13(1), pp.125-140. doi:10.2218/ijdc.v13i1.616.
Palmer, C. L., Weber, N. M., & Cragin, M. H. (2011). The analytic potential of scientific data: Understanding reuse value. Proceedings of the American Society for Information Science and Technology, 48(1), 1-10. doi:10.1002/meet.2011.14504801174
Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. doi:10.7717/peerj.175
Uhlir, P. F. (2010). Information gulags, intellectual straightjackets, and memory holes: Three principles to guide the preservation of scientific data. Data Science Journal, 9, ES1-ES5. doi:10.2481/dsj.Essay-001-Uhlir
University of Illinois School of Information Sciences. (n.d.) Data Curation. Retrieved January 27, 2021 from https://ischool.illinois.edu/research/areas/data-curation.
Weir, K. (2015, October). A reproducibility crisis? Monitor on Psychology, 46(9). Retrieved January 28, 2021 from http://www.apa.org/monitor/2015/10/share-reproducibility.