Metadata is data that provides information about other data. Although information stored on landing and download pages for dataset may also contain metadata, usually in the context of data management and preservation "metadata" refers to the formalized text file that accompanies digital files and provide descriptions of the methodology, sources, and format. You may have downloaded data off the internet and saw a "README.txt" file that came with it - that is metadata!
If you want to make sure your data is properly cited, metadata is a great way to provide information on preferred citation, use restrictions, and data limitations.
Good metadata not only provides contextual information, but also provides greater access to your research through searching.
Search algorithms are usually based on text and does not always transfer well to digital objects. Code, images, video, tabular, audio, and large files often rely exclusively on their metadata files for findability.
When creating metadata, usually it is best to follow a pre-determined format or schema that makes sense with your project, field of study, and file type. This helps users quickly find what information they need and allows for machine readability (in addition to making sure you provide enough information).
If you aren't sure what your industry's standards are, contact your subject librarian and visit the sites below.
I'm sure we have all misplaced, accidentally deleted, or had corrupt data. (If you haven't - be thankful!). Depending on the situation and the amount of work lost, replicating or recovering the file can be time consuming and costly. Thankfully technology has made it easier to prevent or mitigate the damage from these situations, but only if you set it up that way.
One of the popular stories cited as a case for data management comes from Pixar animation. During the creation of Toy Story 2, a simple mistake ended up deleting over 90% of their files, over two months, and hundreds of hours of work by the studio. This wasn't the first time files were accidentally deleted, so the company had taken to cassette tape backups - but hadn't tested the backups properly. After extracting information from their backup, it became clear that less than a third of the data had been restored. Thankfully, an employee had been working from home and had a two-week old, but complete copy of the movie on her own computer. After one painstakingly careful transport and a week of validating, moving, and checking files, they were able to recover enough information (minus a mystery 3000 files) to complete the movie. They were very lucky.
What would happen if 90% of your research files were deleted today? Would you be able to restore them? Would you even know what files were missing?
It may seem like it isn't worth the time to invest in learning data management now, but the returns are high, especially when you save the day (and time and money) when recovering a lost file. Technology isn't perfect, and mistakes, drive crashes, and computer failures happen.
One of the overarching processes in the research data lifecycle is validating, checking, and providing quality assurance and control of your data over the course of your research project. Just as it is important to have your papers peer-reviewed, having someone double-check your data, workflows, and processes is just as important to providing high quality research.
As a part of your data management plan (DMP), you may be required to identify a person in charge of ensuring the DMP is followed in addition to providing check-ins to your funder over the course of your project. Having workflows, plans, and policies in order before you start your project can help make those processes as smooth as possible. Sometimes these plans are called Quality Assurance Plans (QAP).
Here are some considerations when thinking about quality assurance:
Quality Assurance: Prevention through Design - One of the most effective ways to ensure data quality is planning how information will be stored, entered, edited, and manipulated before starting the project. Additionally, utilizing specialized software to limit inputs to expected values (such as domain management and reference data) can ensure consistency and limit input mistakes.
Quality Control: Finding and Fixing Issues - After data is collected, quality control can be applied to find what data is "good" and which values are "bad". This could look like running a software code meant to look for abnormalities, or even just providing quick checks for outliers or nonsensical values.
Documentation - Once you have completed your QA/QC, be sure to include your process in your metadata so other users know that your data is sound.