Module 2 Principles and power of structured data formats

Guru Madhavan, Senior Director of Programs at the National Academy of Engineering, wrote a book in 2015 called Applied Minds: How Engineers Think. In this book, he described a powerful tool for engineers—standards:

“Standards are for products what grammar is for language. People sometimes criticize standards for making life a matter of routine rather than inspiration. Some argue that standards hinder creativity and keep us slaves to the past. But try imagining a world without standards. From tenderloin beef cuts to the geometric design of highways, standards may diminish variety and authenticity, but they improve efficiency. From street signs to nutrition labels, standards provide a common language of reason. From Internet protocols to MP3 audio formats, standards enable systems to work together. From paper sizes … to George Laurer’s Universal Product Code, standards offer the convenience of comparability.”⁷⁷

Standards can be a powerful tool for biomedical researchers, as well, including when it comes to recording data. The format in which experimental data is recorded can have a large influence on how easy and likely it is to implement reproducibility tools in later stages of the research workflow. Recording data in a “structured” format brings many benefits. In this module, we will explain what makes a dataset “structured” and why this format is a powerful tool for reproducible research.

Every extra step of data cleaning is another chance to introduce errors in experimental biomedical data, and yet laboratory-based researchers often share experimental data with collaborators in a format that requires extensive additional cleaning before it can be input into data analysis.⁷⁸ Recording data in a “structured” format brings many benefits for later stages of the research process, especially in terms of improving reproducibility and reducing the probability of errors in analysis.⁷⁹ Data that is in a structured, tabular, two-dimensional format is substantially easier for collaborators to understand and work with, without additional data formatting.⁸⁰ Further, by using a consistent structured format across many or all data in a research project, it becomes much easier to create solid, well-tested code scripts for data pre-processing and analysis and to apply those scripts consistently and reproducibly across datasets from multiple experiments.⁸¹ However, many biomedical researchers are unaware of this simple yet powerful strategy in data recording and how it can improve the efficiency and effectiveness of collaborations.⁸² In this module, we’ll walk through several types of standards that can be used when recording biomedical data.

Objectives. After this module, the trainee will be able to:

Define ontology, minimum information, and file format
List the elements of a structured data format
Explain how standards can improve scientific data recording
Find existing ontologies for biological and biomedical research

2.1 Data recording standards

Many people and organizations (including funders) are excited about the idea of developing and using data standards. Good standards—ones that are widely adapted by researchers—can help in making sure that data submitted to data repositories are used widely and that software can be developed that is interoperable with data from many research groups.

For a simple example, think about recording dates. The minimum information standard for a date might always be the same—a recorded value must include the day of the month, month, and year. However, this information can be structured in a variety of ways. Often in scientific data, it’s common to record this information going from the largest to smallest units, so March 12, 2006, would be recorded “2006-03-12”. Another convention (especially in the US) is to record the month first (e.g., “3/12/06”), while another (more common in Europe) is to record the day of the month first (e.g., “12/3/06”).

If you are trying to combine data from different datasets with dates, and all use a different structure, it’s easy to see how mistakes could be introduced unless the data is very carefully reformatted. For example, March 12 (“3-12” with month-first, “12-3” with day-first) could be easily mistaken to be December 3, and vice versa. Even if errors are avoided, combining data in different structures will take more time than combining data in the same structure, because of the extra needs for reformatting to get all data in a common structure.

Standards can operate both at the level of individual research groups and at the level of the scientific community as a whole. The potential advantages of community-level standards are big: they offer the chance to develop common-purpose tools and code scripts for data analysis, as well as make it easier to re-use and combine experimental data from previous research that is posted in open data repositories. If a software tool can be reused, then more time can be spent in developing and testing it, and as more people use it, bugs and shortcomings can be identified and corrected. Community-wide standards can lead to databases with data from different experiments, and from different laboratory groups, structured in a way that makes it easy for other researchers to understand each dataset, find pieces of data of interest within datasets, and integrate different datasets.⁸³ Similarly, with community-wide standards, it can become much easier for different research groups to collaborate with each other or for a research group to use data generated by equipment from different manufacturers.⁸⁴ As an article on interoperable bioscience data notes,

“Without community-level harmonization and interoperability, many community projects risk becoming data silos.”⁸⁵

However, there are important limitations to community-wide standards, as well. It can be very difficult to impose such standards top-down and community-wide, particularly for low-throughput data collection (e.g., laboratory bench measurements), where research groups have long been in the habit of recording data in spreadsheets in a format defined by individual researchers or research groups. One paper highlights this point:

“The data exchange formats PSI-MI and MAGE-ML have helped to get many of the high-throughput data sets into the public domain. Nevertheless, from a bench biologist’s point of view benefits from adopting standards are not yet overwhelming. Most standardization efforts are still mainly an investment for biologists.”⁸⁶

Further, in some fields, community-wide standards have struggled to remain stable, which can frustrate community members, as scripts and software must be revamped to handle shifting formats.⁸⁷ In some cases, a useful compromise is to follow a general data recording format, rather than one that is very prescriptive. For example, committing to recording data in a format that is “tidy” (which we discuss extensively in module 3) may be much more flexible—and able to meet the needs of a large range of experimental designs—than the use of a common spreadsheet template or a more prescriptive standardized data format.

2.2 Elements of a data recording standard

Standards can clarify several elements: the vocabulary used within data, the content that should be included in a dataset, and the format in which that content is stored. One article names these three facets of a data standard as ontologies, minimum information, and file formats.⁸⁸

2.2.1 Ontology standards

The first facet of a data standard is called an ontology (sometimes called a terminology).⁸⁹ An ontology helps define a vocabulary that is controlled and consistent. It helps researchers, when they want to talk about an idea or thing, to use one word, and just one word, and to ensure that it will be the same word used by other researchers when they refer to that idea or thing. Ontologies also help to define the relationships between ideas or concrete things in a research area,⁹⁰ but here we’ll focus on their use in provided a consistent vocabulary to use when recording data.

Let’s start with a very simple example to give you an idea of what an ontology is. What do you call a small mammal that is often kept as a pet and that has four legs and whiskers and purrs? If you are recording data that includes this animal, do you record this as “cat” or “feline” or maybe, depending on the animal, even “tabby” or “tom” or “kitten”? Similarly, do you record tuberculosis as “tuberculosis” or “TB” or maybe even “consumption”? If you do not use the same word consistently in a dataset to record an idea, then while a human might be able to understand that two words should be considered equivalent, a computer will not be able to immediately tell.

At a larger scale, if a research community can adapt an ontology—one they agree to use throughout their studies—it will make it easier to understand and integrate datasets produced by different research laboratories. If every research group uses the term “cat” in the example above, then code can easily be written to extract and combine all data recorded for cats across a large repository of experimental data. On the other hand, if different terms are used, then it might be necessary to first create a list of all terms used in datasets in the respository, then pick through that list to find any terms that are exchangeable with “cat”, then write script to pull data with any of those terms.

Several onotologies already exist or are being created for biological and other biomedical research.⁹¹ For biomedical science, practice, and research, the BioPortal website (http://bioportal.bioontology.org/) provides access to over 1,000 ontologies, including several versions of the International Classification of Diseases, the Medical Subject Headings (MESH), the National Cancer Institute Thesaurus, the Orphanet Rare Disease Ontology and the National Center for Biotechnology Information (NCBI) Organismal Classification. For each ontology in the BioPortal website, the website provides a link for downloading the ontology in several formats.

Try downloading one of the ontologies using a plaintext file format (the “CSV” choice in the download options at the BioPortal link). Once you do, you can open it in your favorite spreadsheet program and explore how it defines specific terms to use for each idea or thing you might need to discuss within that topic area, as well as synonyms for some of the terms.

To use an ontology when recording your own data, just make sure you use the ontology’s suggested terms in your data. For example, if you’d like to use the Ontology for Biomedical Investigations (http://bioportal.bioontology.org/ontologies/OBI) and you are recording how many children a woman has had who were born alive, you should name that column of the data “number of live births”, not “# live births” or “live births (N)” or anything else. Other collections of ontologies exist for fields of scientific research, including the Open Biological and Biomedical Ontology (OBO) Foundry (http://www.obofoundry.org/).

If there are community-wide ontologies in your field, it is worthwhile to use them in recording experimental data in your research group. Even better is to not only consistently use the defined terms, but also to follow any conventions with capitalization. While most statistical programs provide tools to change capitalization (for example, to change all letters in a character string to lower case), this process does require an extra step of data cleaning and an extra chance for confusion or for errors to be introduced into data.

2.2.2 Minimum information standards

Another part of a data standard is minimum information. Within a data recording standard, minimum information (sometimes also called minimum reporting guidelines⁹² or reporting requirements)⁹³ specify what should be included in a dataset.⁹⁴ Using minimum information standards help ensure that data within a laboratory, or data posted to a repository, contain a number of required elements. This makes it easier to re-use the data, either to compare it to data that a lab has newly generated, or to combine several posted datasets to aggregate them for a new, integrated analysis, considerations that are growing in importance with the increasing prevalence of research repositories and research consortia in many fields of biomedical science.⁹⁵

One article that discusses software for systems biology provides a definition as well as examples of minimum information within this field:

“Minimum information is a checklist of required supporting information for datasets from different experiments. Examples include: Minimum Information About a Microarray Experiment (MIAME), Minimum Information About a Proteomic Experiment (MIAPE), and the Minimum Information for Biological and Biomedical Investigations (MIBBI) project.”⁹⁶

2.2.3 Standardized file formats

While using a standard ontology and a standard for minimum information is a helpful start, it just means that each dataset has the required elements somewhere, and using a consistent vocabulary—it doesn’t specify where those elements are in the data or that they’ll be in the same place in every dataset that meets those standards. As a result, datasets that all meet a common standard can still be very hard to combine, or to create common data analysis scripts and tools for, since each dataset will require a different process to pull out a given element.

Computer files serve as a way to organize data, whether that’s recorded datapoints or written documents or computer programs.⁹⁷ A file format defines the rules for how the bytes in the chunk of memory that makes up a certain file should be parsed and interpreted anytime you want to meaningfully access and use the data within that file.⁹⁸ There are many file formats you may be familiar with—a file that ends in “.pdf” must be opened with a Portable Document Format (PDF) Reader like Adobe Acrobat, or it won’t make much sense (you can try this out by trying to open a “.pdf” file with a text editor, like TextEdit or Notepad). The PDF Reader software has been programmed to interpret the data in a “.pdf” file based on rules defining what data is stored where in the section of computer memory for that file. Because most “.pdf” files conform to the same file format rules, powerful software can be built that works with any file in that format.

For certain types of biomedical data, the challenge of standardizing a format has similarly been addressed through the use of well-defined rules for not only the content of data, but also the way that content is structured. This can be standardized through standardized file formats (sometimes also called data exchange formats)⁹⁹ and often defines not only the upper-level file format (e.g., use of a comma-separated plain text, or “.csv”, file format), but also how data within that file type should be organized. If data from different research groups and experiments is recorded using the same file format, researchers can develop software tools that can be repeatedly used to interpret and visualize that data. On the other hand, if different experiments record data using different formats, bespoke analysis scripts must be written for each separate dataset.

This is a blow not only to the efficiency of data analysis, but also a threat to the accuracy of that analysis. If a set of tools can be developed that will work over and over, more time can be devoted to refining those tools and testing them for potential errors and bugs, while one-shot scripts often can’t be curated with similar care. One paper highlights the problems that come with working with files that don’t follow a defined format:

“Vast swathes of bioscience data remain locked in esoteric formats, are described using nonstandard terminology, lack sufficient contextual information, or simply are never shared due to the perceived cost or futility of the exercise.”¹⁰⁰

Some biomedical data file formats have been created to help smooth over the transfer of data that’s captured by complex equipment into software that can analyze that data. For example, many immunological studies need to measure immune cell populations in experiments, and to do so they use piece of equipment called a flow cytometer that probes cells in a sample with lasers and measures resulting intensities to determine characteristics of that cell. The data created by this equipment are large (often measurements from several lasers are taken for a million or more cells in a single run). The data also are complex, as they need to record not only the intensity measurements from each laser, but also some metadata about the equipment and characteristics of the run.

If every model of flow cytometer used a different file format for saving the resulting data, then a different set of analysis software would need to be developed to accompany each piece of equipment. For example, a laboratory at a university with flow cytometers from two different companies would need licenses for two different software programs to work with data recorded by flow cytometers, and they would need to learn how to use each software package separately. There is a chance that software could be developed that used shared code for data analysis, but only if it also included separate sets of code to read in data from all types of equipment and to reformat them to a common format.

This isn’t the case, however. Instead, there is a commonly agreed on file format that flow cytometers should use to record the data they collect, called the the FCS file format. This format has been defined through a series of papers (e.g., Josef Spidlen et al.¹⁰¹), with several separate versions as the file format has evolved. It provides clear specifications regarding where to save each relevant piece of information in the block of memory devoted to the data recorded by the flow cytometer. As a result, people have been able to create software, both proprietary and open-source, that can be used with any data recorded by a flow cytometer, regardless of which company manufacturer the piece of equipment that was used to generate the data.

Other types of biomedical data also have some standardized file formats, including the FASTQ file format for sequencing data and the mzML file format for metabolomics data. In some cases these were defined by an organization, society, or initiative (e.g., the Metabolomics Standards Initiative),¹⁰² while in some cases the file format developed by a specific equipment manufacturer has become popular enough that it’s established itself as the standard for recording a type of data.¹⁰³

2.3 Defining data recording standards for data recorded “by hand”

If some of the data you record from your experiments comes from complex equipment, like flow cytometers or mass spectrometers, you may be recording much of that data in a standardized format without any extra effort, because that format is the default output format for the equipment. However, you may have more control over other data recorded from your experiments, including smaller, less complex data that you record directly into a laboratory notebook or spreadsheet. You can derive a number of benefits from defining and using a standard for collecting these data, which one paper describes as the output of “traditional, low-throughput bench science.”¹⁰⁴

When recording this type of data, the data may be written down in an ad hoc way—however the particular researcher doing the experiment thinks makes sense—and that format might change with each experiment, even if many experiments collect similar data. As a result, it becomes harder to create standardized data processing and analysis scripts that work with this data or to integrate the data with other data collected through the experiment. Further, if everyone in a laboratory sets up their spreadsheets for data recording in their own way, it is much harder for one person in the group to look at data another person recorded and immediately find what they need within the spreadsheet.

As a step in a better direction, the head of a research group may designate some common formats (e.g., a spreadsheet template) that all researchers in the group will use when recording the data from a specific type of experiments. One key advantage to using standardized data formats even for recording simple, “low-throughput” data is that everyone in the research group will be able to understand and work with data recorded by anyone else in the group—data will not become impenetrable once the person who recorded it leaves the group. Also, once a group member is used to the format, the process of setting up to record data from a new experiment will be quicker, as it won’t require the effort of deciding and setting up a de novo format for a spreadsheet or other recording file. Instead, a template file can be created that can be copied as a starting point for any new data recording.

It also allows your team to create tools or scripts that read in and analyze the data and that can be re-used across multiple experiments with minor or no changes. This helps improve the efficiency and reproducibility of data analysis, visualization, and reporting steps of the research project.

Developing these kinds of standards does require some extra time commitment.¹⁰⁵ First, time is needed to design the format, and it does take a while to develop a format that is inclusive enough that it has a place to put all data you might want to record for a certain type of experiment. Second, it will take some time to teach each laboratory member what the format is and some oversight to make sure they comply with it when they record data.

On the flip side, the longer-term advantages of using a defined, structured format will outweigh the short-term time investments for many laboratory groups for frequently used data types. By creating and using a consistent structure to record data of a certain type, members of a laboratory group can increase their efficiency (since they do not need to re-design a data recording structure repeatedly). They can also make it easier for downstream collaborators, like biostatisticians and bioinformaticians, to work with their output, as those collaborators can create tools and scripts that can be recycled across experiments and research projects if they know the data will always come to them in the same format. One paper suggests that the balance can be found, in terms of deciding whether the benefits of developing a standard outweigh the costs, by considering how often data of a certain type is generated and used:

“To develop and deploy a standard creates an overhead, which can be expensive. Standards will help only if a particular type of information has to be exchanged often enough to pay off the development, implementation, and usage of the standard during its lifespan.”¹⁰⁶

These benefits are even more dramatic if data format standards are created and used by a whole research field (e.g., if a standard data recording format is always used for researchers conducting a certain type of drug development experiment). In that case, the tools built at one institution can be used at other insitutions. However, this level of field-wide coordination can be hard to achieve, and so a more realistic immediate goal might be formalizing data recording structures within your research group or department, while keeping an eye out for formats that are gaining popularity as standards in your field to adopt within your group.

Once you commit to creating a defined, structured format, you’ll need to decide what that structure should be. There are many options here, and it’s very tempting to use a format that is easy on human eyes.¹⁰⁷ For example, it may seem appealing to create a format that could easily be copied and pasted into presentations and Word documents and that will look nice in those presentation formats. To facilitate this use, a laboratory might set up a recording format based on a spreadsheet template that includes multiple tables of different data types on the same sheet, or multi-level column headings.

Unfortunately, many of these characteristics—which make a format attractive to human eyes—will make it harder for a computer to make sense of. For example, if you include two tables in the same spreadsheet, it might make it easier for a person to get a look at two small data tables without having to toggle to different parts of the spreadsheet. However, if you want to read that data into a statistical program (or work with a collaborator who would), it will likely take some complex code to try to tell the computer how to find the second table in the spreadsheet. The same applies if you include some blank lines at the top of the spreadsheet, or use multi-level headers, or use “summary” rows at the bottom of a table. Further, any information you’ve included with colors or with text boxes in the spreadsheet will be lost when the data’s read into a statistical program. These design elements make it much harder to read the data embedded in a spreadsheet into other computer programs, including programs for more complex data analysis and visualization, like R and Python.

As one article notes:

“Data should be formatted in a way that facilitates computer readability. All too often, we as humans record data in a way that maximizes its readability to us, but takes a considerable amount of cleaning and tidying before it can be processed by a computer. The more data (and metadata) that is computer readable, the more we can leverage our computers to work with this data.”¹⁰⁸

One of the easiest format for a computer to read is a two-dimensional “box” of data, where the first row of the spreadsheet gives the column names, and where each row contains an equal number of entries. This type of two-dimensional tabular structure forms the basis for several popular “delimited” file formats that serve as a lingua franca across many simple computer programs, like the comma-separated values (CSV) format, the tab-delimited values (TSV) format, and the more general delimiter-separated values (DSV) format, which are a common format for data exchange across databases, spreadsheet programs, and statistical programs.¹⁰⁹

Any deviations from this two-dimensional “box” shape can crate problems when a computer program tries to parse the data. For anything in a data format that requires extra coding when reading data into another program, you are introducing a new opportunity for errors at the interface between data recording and data analysis. If there are strong reasons to use a format that requires these extra steps, it will still be possible to create code to read in and parse the data in statistical programs, and if the same format is consistently used, then scripts can be developed and thoroughly tested to allow this. However, keep in mind that this will be an extra burden on any data analysis collaborators who are using a program besides a spreadsheet program. The extra time this will require could be large, since this code should be vetted and tested thoroughly to ensure that the data cleaning process is not introducing errors. By contrast, if the data is recorded in a two-dimensional format with a single row of column names as the first row, data analysts can likely read it quickly and cleanly into other programs, with low risks of errors in the transfer of data from the spreadsheet. In module 3, we’ll go into detail about a more refined format for these two-dimensional data called the tidy data format.

2.4 Discussion questions

This module discusses standards and how they can facilitate scientific research. Give some examples of standards you have come across in your own research. Did you follow them? Why or why not? What do you see as the advantages and disadvantages of these standards?
The module discusses ontologies in the context of “controlled vocabularies”, which can insure researchers always use the same term when they are describing the same thing. Do you have any examples from your own research where people were using different terms to describe the same thing? What do you see as the advantages and disadvantages of having a controlled vocabulary?
Find an example from your research field of minimum information or minimum reporting guidelines. What do you see as the advantages or disadvantages to the scientific field as a whole of establishing these guidelines? What are the advantages or disadvantages to you as a researcher?
The module discusses how research data are sometimes recorded directly by equipment, while other times they are “low-throughput” data that are recorded “by hand”. Can you give examples of each type of data that you’ve come across in your own scientific work? Discuss how these data did or did not follow standards in how they were recorded. The standards could include ontology, minimum information, and file format.

1 Separating data recording and analysis

3 The “tidy” data format