Research Data Management: Data Curation
What is Data Curation?
Data curation is a process through which data are organized, described, cleaned, enhanced, and preserved. This includes establishing file naming conventions and organizational schemes, generating metadata, identifying storage solutions, classifying data, and more! For an overview of data curation, check out the ICPSR video below:
Data curation ensures compliance with funder requires, facilitates the preparation of Data Management & Sharing Plans (DSMPs), and helps your research adhere to FAIR data principles. Good data management also means following through each stage of the Data Lifecycle to ensure compliance with national standards and responsible research practices.
What Are FAIR Data?
FAIR data are data which meet principles of
F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
Source: Mark D. Wilkinson et al., "The FAIR Guiding Principles for scientific data management and stewardship," Scientific Data 3, Article number: 160018 (2016).
What is the Data Lifecycle?
The Data Lifecycle refers to the series of stages a particular set of data goes through from the project planning stage through the final storage, sharing, and, potentially, reuse of the data. Although data management experts may identify different numbers of stages with varying degrees of granularity, the most basic definition of the data cycle consists of 6 or 7 key stages.
- Planning - Determine your project's data needs and scope. What data already exist and which data need to be generated to address your research questions? How and from where will you collect your data and which tools will you need access to collect, process, analyze, and store them?
- Collecting - Gather your research data. There are many ways to collect data depending on your discipline and the project at hand. Some examples include forms and surveys, interviews, direct observation, web scraping, or manual transcription. Data can also consist of image, audio, or video files and their attendant metadata.
- Processing - All data need to processed in some way before they can be analyzed. Your processing needs, however, will depend on the original source of information, be it a survey or form, archival documents, or automated results from a digital tool. In addition to format, you will also need to consider the various interpretive or ethical demands of your research. Data processing may include content analysis and coding, data wrangling (also known as data cleaning), data compression, and data encryption.
- Analyzing - Study your processed data to identify patters and trends that will help you to draw meaningful insights related to your research questions. Depending on your discipline, analytical techniques may include statistical modeling, machine learning, algorithms, and text/data mining. Data visualization also plays a critical role in the analysis stage of the data lifecycle.
- Preserving - After your have collected and processed your data, it must be stored for future use. There are many ways to preserve your data, from cloud storage to hard drives and data repositories. Depositing your data into a data repository ensures that your data can be referenced through a persistent identifier (PID). Aside from being generally considered a best practice in data management, having a PID for your data also satisfies many publishing and funder requirements. But remember to never place all of your eggs in one basket. Just because your data has a PID and is in a repository, doesn't mean you shouldn't have your data backed up elsewhere. Repositories may only preserve datasets for limited timespans and are not in and of themselves permanent storage options.
- Sharing - Communicate the results of your data and share your research findings with key stakeholders and audiences. Sharing can come in the form of scholarly publications (i.e., journal articles, books, etc.), websites, interactive data visualizations, posters, and more. Consider different sharing platforms and formats for different audiences to make sure your work is being disseminated to all relevant and potentially interested parties. For scholarly output, consider open access publishing even if your funder does not require it. If you need assistance choosing where to publish your work, see our Academic Journals: How to Choose Where to Publish guide.
- Reusing - Promote the use of your data by other researchers. Reusing data allows different researchers to independently publish finding about the same data, which has a number of benefits including gaining references to your work, avoiding unnecessary repetition of data collection, verifying findings, making research more robust, and generating new knowledge. Ensuring that your data is clearly described, curated, and shared makes it more accessible for reuse.
What is a Persistent Identifier?
A persistent identifier is a long-lasting reference to a document, file, web page, or other object. Unlike URLs, which may break, a PID reliably points to a digital entity. Examples of PIDs include Open Researcher and Contributor Identifiers (ORCIDs), Digital Object Identifiers (DOIs), and Archival Resource Keys (ARKs). Many of you may already have an ORCID, which is used to link researchers to their professional activities; it helps us, for example, to distinguish between J. Doe, the entomologist, who wrote a series of articles on acoustic communication among insects and J. Doe, the art historian, who wrote a monograph on postmodern feminist architecture. DOIs and ARKs are used to ensure that publications, datasets, catalogue records, and other digital materials are linked to the correct metadata record.
For more on PIDs, see our Persistent Identifiers guide.
In order for your data to be understandable to any future users—or even to you and any collaborators later down the line—they need to be comprehensively and consistently described. Metadata are structured data that help explain the who, what, where, when, why, and how of your research project or dataset. Consider these guidelines when generating metadata for your data to adhere to best practices and maintain compliancy with FAIR data principles.
|Who contributed to your project? Who is the principal investigator? Who should be contacted with questions about the data or the findings? Who contributed to the project in other ways (authors, research assistants, etc.)?|
|WHAT||What kind of data and analysis were used? What methods or tools did you employ?|
|WHERE||Where do these data pertain to? Does this project involve a particular geographic region or regions?|
|WHEN||When were the data collected? When was the analysis performed? Are there any other important dates associated with the data? When were the data last updated?|
|WHY||Why were these data collected? Which questions were you looking to answer?|
|HOW||How were the data collected? What data collection methods or tools were used? What parameters were set?|
Creating a README File
Create a README plain text file that explains your process including your file naming, variable naming, folder structures, necessary documentation, and processing steps. For more information on README files and best practices, view the video from Harvard's LMA Research Data Management Working Group below.
Look at a sample README plain text file to get an idea of what information should be included. If you are planning on depositing your data into the University of Utah's research data repository, The Hive, you will be required to complete a README file using their README template.
Standardizing Your Documentation
While any metadata is a good start, using community standards for creating metadata allows your data to be searchable across data repositories, even down to a variable level. The best data repositories use standardized metadata to allow users to search by investigator, geography, title, subject, and research question. This type of searching is only possible with structured metadata.
A widely used standard in the social, behavioral, and economic sciences is the Data Documentation Initiative. Expressed in XML, DDI allows data to be described across the research data lifecycle. To learn more about DDI, visit their website at www.ddialliance.org.
To research other community standards, visit the Digital Curation Centre's Disciplinary Metadata Webpage.
Data Curation Resources @ The U
The Hive is the University of Utah's research data repository provided by both the J. Willard Marriott Library and Eccles Health Sciences Library. It designed to broadly disseminate the intellectual contributions in research and creativity produced by the University's faculty, staff and students to ensure its longevity. Here you will find information on preparing, uploading and depositing your datasets and the corresponding documentation.For more information about depositing your data into The Hive, email us at email@example.com.