Module 1 Separating data recording and analysis

Do you use spreadsheets within your scientific work? If so, you’re not alone. In fact, studies have surveyed scientists about their work practices, and they’ve found that spreadsheets are a common tool. Examples include surveys of over 250 biomedical researchers at the University of Washington,⁶ and of neuroscience researchers at the University of Newcastle. In both these studies, most respondents reported that they used spreadsheets and other general-purpose software in their research.⁷ A working group on bioinformatics and data-intensive science similarly found spreadsheets were the most common tool used across attendees.⁸

These software tools, such as Microsoft Excel or Google Sheets, provide for manual or automated entry of data into rows and columns of cells. Standard or custom formulas and other operations can be applied to the cells, and are commonly used to reformat or clean the data, calculate various statistics, and to generate simple plots; all of which are embedded as additional data entries and programming elements within the spreadsheet. While these tools greatly improved the paper worksheets on which they were originally based,⁹ this all-in-one practice impedes the transparency and reproducibility of both recording and analysis of the large and complex data sets that are routinely generated in life science experiments.

To improve the computational reproducibility of a research project, it is critical for biomedical researchers to learn the importance of maintaining recorded experimental data as “read-only” files, separating data recording from any data pre-processing or data analysis steps.¹⁰

In this module, we’ll talk about why spreadsheets are so popular, as well as some of their features that are beneficial for researchers. However, there are also many problems they can introduce, particularly when spreadsheets are used in a way that combines data collection with data pre-processing and analysis. We’ll walk through some of these problems, and in later modules we’ll walk you through alternatives, where spreadsheets are limited to recording data (if they’re used at all), while steps of pre-processing and analysis are done with other tools.

Objectives. After this module, the trainee will be able to:

Explain why spreadsheets are a popular tool among scientists
Explain the difference between data recording and data analysis
Understand why collecting data on spreadsheets with embedded formulas impedes reproducibility
Compare ways that spreadsheets can be used as a solid research tool versus other ways their use can be problematic
Discuss downsides to using spreadsheets for data analysis in scientific research

1.1 Popularity of spreadsheets

All of us authors are old enough to remember when home computers were a novelty. When you first got a computer in your home, it opened up all kinds of new powers.

For one of us, one particularly exciting piece of software was something called The Print Shop. This software let you be an amateur graphic designer. You could design things like signs and invitations. Because the printer paper at the time was connected from one sheet to the next, with perforations between, you could even make long banners. “Happy Birthday” banners, “Congratulations” banners, “Welcome Home” banners: you could do it all. For someone who’d never had these tools before, it was thrilling.

This was evidently how early spreadsheet software made business executives feel. Before these programs, if an executive wanted to crunch some numbers, they’d have to send a request to their accounting department. The initial spreadsheet program (VisiCalc) disrupted this process. It allowed one person to quickly apply and test different models or calculations on recorded data.¹¹ These spreadsheet programs allowed non-programmers to engage with data, including data processing and analysis tasks, in a way that previously required programming expertise.¹² With spreadsheet programs an executive could just play with the numbers themselves.

Because an early target for spreadsheet programs was these business executives, the programs were designed to be very simple and easy to use—just one step up in complexity from crunching numbers on the back of an envelope.¹³Spreadsheet programs in fact became so popular within businesses that many attribute these programs with driving the uptake of personal computers.¹⁴

Spreadsheets have become so popular in part because so many people know how to use them, at least in basic ways, and so many people have the software on their computers that files can be shared with almost a guarantee that everyone will be able to open the file on their own computer.¹⁵ Spreadsheets use the visual metaphore of a traditional gridded ledger sheet,¹⁶ providing an interface that is easy for users to immediately understand and for which they can easily create a mental map.¹⁷ This visually clear interface also means that spreadsheets can be printed or incorporated into other documents “as-is”, as a workable and understandable table of data values. In fact, some of the most popular plug-in software packages for the early spreadsheet program Lotus 1-2-3 were programs for printing and publishing spreadsheets.¹⁸ This “What You See Is What You Get” interface was a huge advance from previous methods of data analysis for the first spreadsheet program, VisiCalc, providing a “window to the data” that was accessible to business executives and others without programming expertise.¹⁹ Several surveys of researchers have found that spreadsheets were popular because of their simplicity and ease-of-use.²⁰ By contrast, databases and scripted programming languages can be perceived as requiring a cognitive load and lengthy training that is not worth the investment when an easier tool is available.²¹

Software tools like The Print Shop and spreadsheet programs were perfectly designed for amateurs to begin to do some of the things that otherwise required outsourcing to a professional. They are fantastic tools for amateur exploration. They make it fun to test out ideas.

However, these types of software tools are so easy and convenient to use that it can be tempting to let them replace more solid, production-level tools. It’s easy, in other words, to make them the only tool used to tackle a problem, rather than just the first tool to use to explore a solution. These tools are also often the cheapest option, either in monetary cost or in the time investment to learn them. However, they often fail when they’re used as a replacement for more solid options. This can be the case with spreadsheet programs in biomedical research, where spreadsheets are often used not only as a straightforward way to record data (for which they can be a very solid tool), but also to develop complex pipelines that process and analyze the data once its been collected.

1.2 Hazards of combining recording and analysis

In some cases, researchers use spreadsheets solely to record data, as a simple type of database.²² However, biomedical researchers often use spreadsheets to both record and analyze experimental data.²³ In this case, data processing and analysis is implemented through the use of formulas and macros embedded within the spreadsheet.

When a spreadsheet has formulas or macros within it, the spreadsheet program creates an internal record of how cells are connected through these formulas. For example, say the value in a specific cell is converted from Fahrenheit to Celsius to fill a second cell, and then that value is combined with other values in a column to calculate the mean temperature across several observations. In this case, the spreadsheet program has internally saved how the later cells depend on the earlier ones. When you change the value recorded in a cell of a spreadsheet, the spreadsheet program queries this record and only recalculates the cells that depend on that cell. This process allows the program to quickly “react” to any change in cell inputs, immediately providing an update to all downstream calculations and analyses.²⁴ Since early in their development, spreadsheet programs have also included macros, “a single computer instruction that stands for a sequence of operations.”²⁵

There are some important downsides to using these tools within scientific research to create spreadsheets that combine data recording with data analysis. Those include:

Raw data are often lost
Analysis steps are often opaque
There is a higher potential for errors in the analysis
There are better software tools available for data analysis
It will make it more difficult to collaborate with statisticians

Let’s take a look at each of these.

1.2.1 Raw data often lost

One of the key tenets of ensuring that research is computationally reproducible is to always keep a copy of all raw data, as well as the steps taken to get from the raw data to a cleaned version of the data through to the results of data analysis. However, maintaining an easily accessible copy of all original raw data for a project is a common problem among biomedical researchers,²⁶ especially as team members move on from a laboratory group.²⁷ In fact, one study of operational spreadsheets found:

“The data used in most spreadsheets is undocumented and there is no practical way to check it. Even the original developer would have difficulty checking the data.”²⁸

One thing that can contribute to this problem is the use of spreadsheets to jointly record and analyze data. First, data in a spreadsheet is typically not saved as “read-only”, so it is possible for it to be accidentally overwritten: in situations where spreadsheets are shared among multiple users, original cell values can easily be accidentally written over, and it may not be clear who last changed a value, when it was changed, or why.²⁹ Further, raw and processed data are combined in a spreadsheet, which makes it hard to identify which data points within the spreadsheet make up the raw data and which are the result of processing that raw data.

Another issue is that many spreadsheets use a proprietary format. In the development of spreadsheet programs, this use of proprietary binary file formats helped a software program keep users, increasing barriers for a user to switch to a new program (since the new program wouldn’t be able to read their old files).³⁰ However, this file format may be hard to open in the future, as software changes and evolves;³¹ by comparison, plain text files should be widely accessible through general purpose tools—a text editor is a type of software available on all computers, for example—regardless of changes to proprietary software like Microsoft Excel.

1.2.2 Analysis steps are often opaque

To keep analysis steps clear—whether the calculation is being done in scripted code or in spreadsheets or in pen-and-paper calculations—it is important to document what is being done at each step and why.³² Scripted languages allow for code comments, which are written directly into the script but not evaluated by the computer, and so can be used to document steps within the code without changing the operation of the code. Further, the program file itself often presents a linear, step-by-step view of the pipeline, stored separated from the data.³³ Calculations done with pen-and-paper (e.g., in a laboratory notebook) can be annotated with text to document the steps. Spreadsheets, on the other hand, are often poorly documented, or documented in ways that are hard to keep track of.

Within spreadsheets, the logic and methods behind the pipeline of data processing and analysis is often not documented, or only documented with cell comments (hard to see as a whole) or in emails, not the spreadsheet file. One study that investigated a large collection of spreadsheets found that most do not include documentation explaining the logic or implementation of data processing and analysis implemented within the spreadsheet.³⁴ A survey of neuroscience researchers at a UK institute found that about a third of respondents included no documentation for spreadsheets used in their research laboratories.³⁵

When spreadsheet pipelines are documented, it is often through methods that are hard to find and interpret later. One study of scientific researchers found that, when research spreadsheets were documented, it was often through “cell comments” added to specific cells in the spreadsheet, which can be hard to interpret inclusively to understand the flow and logic of a spreadsheet as a whole.³⁶

In some cases, teams use email chains, rather than the document itself, to discuss and document functionality and changes in spreadsheets. They pass versions of the spreadsheet file as attachments of emails discuss the spreadsheet in the email body.

One research team investigated over 700,000 emails from employees of Enron that were released during legal proceedings.³⁷ They specifically investigated the spreadsheets attached to these emails (over 15,000 spreadsheets) and how teams discussed the spreadsheets within the emails themselves . They found that the logic and methods of calculations within the spreadsheets were often documented within the bodies of emails. This means that, if someone needs to figure out why a step was taken or identify when an error was introduced into a spreadsheet, they must dig through the chain of old emails documenting that spreadsheet, rather than having the relevant documentation within the spreadsheet’s own file.

Adding to this issue is that data processing and analysis pipelines for spreadsheets are not carefully designed; instead, it’s more typically for spreadsheet user to start by directly entering data and formulas without a clear overall plan.³⁸ As a result, research spreadsheets are often not designed to follow a common structure for the research field or for the laboratory group.³⁹

Another problem comes up because there may only be one person on the team who fully understands the spreadsheet: the person who created the spreadsheet.⁴⁰ This is particularly common if the spreadsheet includes complex macros or a complicated structure in the analysis pipeline.⁴¹ This practice creates a heavy dependence on the person who created that spreadsheet anytime the data or results in that spreadsheet need to be interpreted. This is particularly problematic in projects where the spreadsheet will be shared for collaboration or adapted to be used in a future project, as is often done in scientific research groups. In this case, it can be hard to “onboard” new people to use the file, and much of the work and knowledge about the spreadsheet can be lost when that person moves on from the business or laboratory group.⁴²

If you share a spreadsheet with numerous and complex macros and formulas included to clean and analyze the data, it can take an extensive amount of time, and in some cases may be impossible, for the researcher you share it with to decipher what is being done to get from the original data input in some cells to the final results shown in others and in graphs. Further, if others can’t figure out the steps being done through macros and formulas in a spreadsheet, they will not be able to check it for problems in the logic of the overall analysis pipeline or for errors in the specific formulas used within that pipeline. They also will struggle to extend and adapt the spreadsheet to be used for other projects. These problems come up not only when sharing with a collaborator, but also when reviewing spreadsheets that you have previously created and used (as many have noted, your most frequent collaborator will likely be “future you”). In fact, one survey of biomedical researchers at the University of Washington noted that,

“The profusion of individually created spreadsheets containing overlapping and inconsistently updated data created a great deal of confusion within some labs. There was little consideration to future data exchange of submission requirements at the time of publication.”⁴³

1.2.3 Potential for errors

Because spreadsheets often do a poor job of making the analysis steps transparent, they can be prone to bugs in analysis. As one early article on the history of spreadsheet programs notes:

“People tend to forget that even the most elegantly crafted spreadsheet is a house of cards, ready to collapse at the first erroneous assumption. The spreadsheet that looks good but turns out to be tragically wrong is becoming a familiar phenomenon.”⁴⁴

Indeed, previous studies have found that errors are very common within spreadsheets.⁴⁵ For example, one study of 50 operational spreadsheets found that about 90% contained at least one error.⁴⁶

In part, it is easier to make errors in spreadsheets and harder to catch errors in later work with a spreadsheet because the formulas and connections between cells aren’t visible when you look at the spreadsheet—they’re behind the scenes.⁴⁷ This makes it hard to get a clear and complete view of the pipeline of analytic steps in data processing and analysis within a spreadsheet, or to discern how cells are connected within and across sheets of the spreadsheet.

Some characteristics of spreadsheets may heighten chances for errors. These include high conditional complexity, which can result from lots of branching of data flow through if / else structures, as well as formulas that depend on a large number of cells or that incorporate many functions.⁴⁸ Following the logical chain of spreadsheet formulas can be particularly difficult when several calculations are chained in a row.⁴⁹ In some cases, if you are trying to figure out very long chains of dependent formulas across spreadsheet cells, you may even have to sketch out by hand the flow of information through the spreadsheet to understand what’s going on.⁵⁰ When a spreadsheet uses macros, it can also make it particularly hard to figure out the steps of an analysis and to diagnose and fix any bugs in those steps.⁵¹ One study investigated how spreadsheets are used in practice and noted that, “Many spreadsheets are so chaotically designed that auditing (especially of a few formulas) is extremely difficult or impossible.”⁵²

In some cases, formula dependencies might span across different sheets of a spreadsheet file. These cross-sheet dependencies can make the analysis steps even more opaque,⁵³ as a change in the cell value of one sheet might not be immediately visible as a change in another cell on that sheet (the same is true for spreadsheets so large that all the cells in a sheet are not concurrently visible on the screen). Other common sources of errors included incorrect references to cells inside formulas and incorrect use of formulas⁵⁴ or errors introduced through the common practice of copying and pasting when developing spreadsheets.⁵⁵

There are methods that have been brought from more traditional programming work into spreadsheet programming to try to help limit errors, including a tool called assertions that allows users to validate data or test logic within their spreadsheets.⁵⁶ However, these are often not implemented, in part perhaps because many spreadsheet users see themselves as “end-users”, creating spreadsheets for their own personal use rather than as something robust to future use by others, and so don’t seek out strategies adopted by programmers when creating stable tools for others to use.⁵⁷ In practice, though, a spreadsheet is often used much longer, and by more people, than originally intended. From early in the history of spreadsheet programs, users have shared spreadsheet files with interesting functionality with other users,⁵⁸ and the lifespan of a spreadsheet can extend and extend—a spreadsheet created by one user for their own personal use can end up being used and modified by that person or others for years.⁵⁹

1.2.4 Better software tools are available

While spreadsheets serve as a widely-used tool for data recording and analysis, in many cases spreadsheets programs are poorly suited to pre-process and analyze scientific data compared to other programs. As tools and interfaces continue to develop that make other software more user-friendly to those new to programming, scientists may want to reevaluate the costs and benefits, in terms of both time required for training and aptness of tools, of spreadsheet programs compared to scripted programming languages like R and Python.

Several problems have been identified with spreadsheet programs in the context of recording and, especially, analyzing scientific data. First, some statistical methods may be inferior to those available in other statistical programming language. Many statistical operations require computations that cannot be perfectly achieved with a computer, since the computer must ultimately solve many mathematical problems using numerical approximations (e.g., calculus). The choice of the algorithms used for these approximations heavily influence how closely a result approximates the true answer. Since the most popular spreadsheet program (Excel) is closed-source, it is hard to identify and diagnose such problems, and there is likely less of an incentive for problems in statistical methodology to be fixed (rather than using development time and funds to increase easier-to-see functionality in the program).

A series of papers examined the quality of statistical methods in several statistical software programs, including Excel, starting in the 1990s.⁶⁰ In the earliest studies, they found some concerns across all programs considered.⁶¹ One of the biggest concerns, however, was that there was little evidence over the years that the identified problems in Excel were resolved, or at least improved, over time.⁶² The authors note that there may be little incentive for checking and fixing problems with algorithms for statistical approximation in closed-source software like Excel, where sales might depend more on the more immediately evident functionality in the software, while problems with statistical algorithms might be less evident to potential users.⁶³

Open-source software, on the other hand, offers pathways for identifying and fixing any problems in the software, including for statistical algorithms and methods implemented in the software’s code. Since the full source code is available, researchers can closely inspect the algorithms being used and compare them to the latest knowledge in statistical computing methodology. Further, if an inferior algorithm is in use, most open-source software licenses allow a user to adapt and extend the software, including to implement better statistical algorithms.

Another problem is that spreadsheet programs can include automated functionality that’s meant to make something easier for most users, but that might invisibly create some problems. A critical problem, for example, has been identified when using Excel for genomics data. When Excel encounters a cell value in a format that seems like it could be a date (e.g., “Mar-3-06”), it will try to convert that cell to a “date” class. Many software programs save date as this special “date” format, where it is printed and visually appears in a format like “3-Mar-06” but is saved internally by the program as a number (for Microsoft Excel, the number of days since January 1, 1900).⁶⁴ By doing this, the software can more easily undertake calculations with dates, like calculating the number of days between two dates or which of two dates is earlier. Bioinformatics researchers at the National Institutes of Health found that Excel was doing this type of automatic and irreversible date conversion for 30 gene names, including “MAR3” and “APR-4”, resulting in these gene names being lost for further analysis.⁶⁵

Avoiding this automatic date conversion required specifying that columns susceptible to these problems, including columns of gene names, should be retained in a “text” class in Excel’s file import process. While this problem was originally identified and published in 2004,⁶⁶ along with tips to identify and avoid the problem, a study in 2016 found that approximately a fifth of genomics papers investigated in a large-scale review had gene name errors resulting from Excel automatic conversion, with the rate of errors actually increasing over time.⁶⁷

Other automatic conversion problems caused the lost of clone identifiers that were composed of digits and the letter “E.”⁶⁸ These were assumed to be expressing a number using scientific notation and so automatically and irreversibly converted to a numeric class. Further automatic conversion problems can be caused by cells that start with an operator (e.g., “+ control”) or with leading zeros in a numeric identifier (e.g., “007”).⁶⁹

Finally, spreadsheet programs can be limited as analysis needs become more complex or large.⁷⁰ For example, spreadsheets can be problematic when integrating or merging large, separate datasets.⁷¹ Further, while spreadsheet programs continue to expand in their capacity for data, for very large datasets they continue to face limits that may be reached in practical applications—⁷²until recently, for example, Excel could not handle more than one million rows of data per spreadsheet. Even when spreadsheets can handle larger data, their efficiency in running data processing and analysis pipelines across large datasets can be slow compared to code implemented with other programming languages.

1.2.5 Difficulty collaborating with statisticians

Modern biomedical researchers requires large teams, with statisticians and bioinformaticians often included, to enable sophisticated processing and analysis of experimental data. However, the process of combining data recording and analysis, especially through the use of spreadsheet programs, can create barriers in working across disciplines. One group defined these issues as “data friction” and “science friction”—the extra steps and work required at each interface where data passes, for example, from a machine to analysis or from a collaborator in one discipline to one in a separate discipline.⁷³

When collaborating with statisticians or bioinformaticians, one of the key sources of this “data friction” can result from the use of spreadsheets to jointly record and analyze experiemental data. First, spreadsheets are easy to print or copy into another format (e.g., PowerPoint presentation, Word document), and so researchers often design spreadsheets to be immediately visually appealing to viewers. For example, a spreadsheet might be designed to include hierarchically organized headers (e.g., heading and subheading, some within a cell merged across several columns), or to show the result of a calculation at the bottom of a column of observations (e.g., “Total” in the last cell of the column).⁷⁴ Multiple separate small tables might be included in the same sheet, with empty cells used for visual separation, or use a “horizontal single entry” design , where the headers are in the leftmost column rather than the top row.⁷⁵

These spreadsheet design choices make it much more difficult for the contents of the spreadsheet to be read into other statistical programs. These types of data require several extra steps in coding, in some cases fairly complex coding, with regular expressions or logical rules needed to parse out the data and convert it to the needed shape, before the statistical work can be done for the dataset. This is a poor use of time for a collaborating statistician, especially if it can be avoided through the design of the data recording template. Further, it introduces many more chances for errors in cleaning the data.

Further, information embedded in formulas, macros, and extra formatting like color or text boxes is lost when the spreadsheet file is input into other programs. Spreadsheets allow users to use highlighting to represent information (e.g., measurements for control animals shown in red, those for experiment animals in blue) and to include information or documentation in text boxes. For example, one survey study of biomedical researchers at the University of Washington included this quote from a respondent: “I have one spreadsheet that has all of my chromosomes … and then I’ve gone through and color coded it for homozygosity and linkage.”⁷⁶ All the information encoded in this sheet through color will be lost when the data from the spreadsheet is read into another statistical program.

1.3 Approaches to separate recording and analysis

In the remaining modules in this section, we will present and describe techniques that can be used to limit or remove these problems. First, in the next few modules, we will walk through techniques to design data recording formats so that data is saved in a consistent format across experiments within a laboratory group, and in a way that removes “data friction” for collaboration with statisticians or later use in scripted code. These techniques can be immediately used to design a better spreadsheet to be used solely for data collection.

In later modules, we will discuss the use of project directories to coordinate data recording and analysis steps within a directory, while using separate files for data recording versus data processing and analysis. These more advanced formats will enable the use of quality assurance / control measures like testing of data entry and analysis functionality, better documentation of data analysis pipelines, and easy use of version control to track projects and collaborate transparently and with a recorded history.

1.4 Discussion questions

What types of data do you record in your research? Do you use spreadsheets to record data in your scientific research? For data that aren’t recorded in spreadsheets, what alternatives do you use (paper laboratory notebooks, electronic laboratory notebooks, data recorded directly by equipment, other)?
If you use spreadsheets in your scientific research, what do you like and dislike about them? The reading in this module mentioned that some of the upsides include that they are easy to learn and that many people have access to the software to use them. Are these advantages that resonate with you?
There are a variety of ways that scientists use spreadsheets in their research. Describe examples of when you’ve used spreadsheets to:
- Record data
- Clean data
- Preprocess data
- Analyze data
- Create graphs
- Create reports Based on the reading and your own experience, what are potential advantages and pitfalls of using spreadsheets for each of these purposes?
Describe how you document any steps you take to clean or preprocess your data once you record it. Based on the reading, list some advantages and disadvantages of the method you use versus alternative methods of documentation.
What does it mean to consider raw data as “read-only”? What are some ways to insure that raw data are treated as read-only? Do you have any examples from your research when raw data were not treated as read-only? What problems could that cause for research reproducibility?
When someone new joins your research group, how do you teach them how to record experimental data? Do you have any experiences when this did not go smoothly?
In you research, have you ever used a spreadsheet that included formulas or macros? How did you document these steps and check them for potential errors? What do you see as the advantages and disadvantages of including formulas or macros in spreadsheets?

Overview of these modules

2 Principles and power of structured data formats