Module 13 Selecting software options for pre-processing

Module 12 described some common themes and processes in pre-processing biomedical data. While we’ve covered some key processes of pre-processing, we haven’t talked yet about the tools you can use to implement it. These are often combined together into a pipeline (also called a workflow). These pipelines can become fairly long and complex when you need to pre-process data that are complex.

Most pre-processing pipelines will be run on the computer, with software tools. An exception might be for very simple pre-processing tasks—one example is generating the average cage weight for a group of mice based on the total cage weight and the number of mice. However, even simple processes like this, which can be done by hand, can also be done with a computer, and doing so can help avoid errors and to provide a record of the calculation that was used for the pre-processing.

You will have a choice about which type of software you use for pre-processing. There are two key dimensions that separate these choices—first, whether the software is point-and-click versus script-based, and, second, whether the software is proprietary versus open-source. It is important to note that, in some cases, it may make sense to develop a pipeline that chains together a few different software programs to complete the required pre-processing.

In this module, we’ll talk about the advantages and disadvantages of these different types of software. For reproducibility and rigor, there are many advantages to using software that is script-based and open-source for data pre-processing, and so in later modules, we’ll provide more information on how you can use this type of software for pre-processing biomedical data. We also recognize, however, that there are some cases where such software may not be a viable option for some or all of the data pre-processing for a project.

Objectives. After this module, the trainee will be able to:

Describe software approaches for pre-processing data
Compare the advantages and disadvantages of Graphical User Interface–based versus scripted approaches and of open-source versus proprietary approaches to pre-processing

13.1 GUI-based software versus script-based software

When you pick software for pre-processing, the first key dimension to consider is whether the software is “point-and-click” or script-based. Let’s start with a definition of each.

Point-and-click software is more formally known as GUI-based software, where GUI stand for “graphical user interface”. These are programs where your hand is on the mouse most of the time, and you use the mouse to select actions and options from buttons and other widgets that are shown by the software on the screen. This type of software is also sometimes called “widget-based”, as it is built around widgets like drop-down menus and slider bars.²⁹⁰

A basic example of GUI-based software is your computer’s calendar application (“application” is a common synonym for “software”). To navigate across dates on your calendar, you use your mouse to click on arrows or dates. The software includes some text entry—for example, if you add something to your calendar, you can click on a textbox and enter a description of the activity using your keyboard. However, the basic way that you navigate and use the software is via your computer mouse.

Script-based software uses a script, rather than clickable buttons and graphics, as its main interface. A script, in this case, is a line-by-line set of instructions describing what actions you want the software to perform. With script-based software, you typically keep your keys on the keyboard more often than on the mouse. Many script-based software programs will also allow you to also send the lines of instructions one at a time in an area referred to as a console, which will then return the result from each line after you run it. Script-based software is also sometimes called software that is “used programatically.”²⁹¹ Several script-based software programs are commonly used with biomedical data including R, Python, and Unix bash scripts, as well as some less common but emerging software programs like Julia.

When comparing point-and-click software to script-based software for pre-processing, there are a few advantages to point-and-click software, but many more to script-based software. In terms of code rigor and reproducibility, script-based software comes out well ahead, especially when used to its full advantage.

Let’s start, though, by acknowledging some appealing features of point-and-click software. These features likely contribute to its wide popularity and to the fact that the vast majority of software that you use in your day-to-day life outside of research is probably point-and-click.

First, GUI-based software is often easier to learn to use, at least in terms of basic use. The visual icons help you navigate choices and actions in the software. Most GUI-based software programs are designed to take underlying processes and make them easier for a new user to access and use. They do this through an interface that is visual, rather than language- and script-based. Further, many people are most familiar with point-and-click software, since so many everyday applications are of this type, and so its interface can feel more familiar to users. They also are easier for a new user to pick up because they typically provide a much smaller set of options than a full programming language does.

By contrast, script-based software can take more investment of time and energy to initially learn how to use. This is because they are coding languages, and each one is just that—a language. It is built on a (often large) set of vocabulary that you must learn to be proficient, as you must learn the names and options for a large set of functions within the language. Further, it has rules and logic you must learn in terms of options for how to structure and access data and how the inputs and outputs of different functions can be chained together to build pipelines for pre-processing and analysis.

Script-based software also requires you to be precise in this language. As Brian Kernighan writes in his book D is for Digital:

“A computer is the ultimate sorcerer’s apprentice, able to follow instructions tirelessly and without error, but requiring painstaking accuracy in the specification of what to do.”²⁹²

However, while there is a higher investment required to learn script-based software versus point-and-click software, there is also a higher payoff from that effort. Script-based software creates a full framework for you to combine tools in interesting ways and to build new tools when you need them. With point-and-click software, there’s always a layer between the user and the computer logic, and you are constrained to only use tools that were designed by the person who programmed the point-and-click software. By contrast, with script-based software, you have more direct access to the underlying computer logic, and with many popular script-based languages (R, Python), you have extraordinary power and flexibility in what you can ask the program to do.

As an analogy, think about traveling to a country where you don’t yet speak the language. You have a few choices in how you could communicate. You could memorize a few key phrases that you think you’ll need, or get a phrase book that lists these key phrases. Another choice is to try to learn the language, including learning the grammar of the language, and how thoughts are put together into phrases. Learning the language, even at a basic level, will take much more time. However, it will allow you much greater ability to express yourself. If you only know set phrases, then you may know how to ask someone at a bakery for loaf of bread, if the person who wrote the phrase book decided to include that, but not how to ask at a hotel for an extra blanket, if that wasn’t included. By contrast, if you’ve learned the language, you have learned how to form a question, and so you can extrapolate to express a great variety of things.

GUI-based software can be like using a phrase book for a foreign language—if the person who developed the tool didn’t imagine something that you need, you’re stuck. Scripted software is more like learning a language—you have to learn the rules (grammar) and vocabulary (names of functions and their parameters), but once you do, you can combine them to address a wide variety of tasks, including things no one else has yet thought of.

In the late 1990s, a famous computer scientist named Richard Hamming wrote a book called, “The Art and Science of Engineering”, in which he talks a lot about the process of building things and the role that programming can play in this process. He predicted at the time that by 2020, it will be the experts in a particular field that do programming for that field, rather than experts in computer programming trying to build tools for other fields.²⁹³ He notes:

“What is wanted in the long run, of course, is that the man with the problem does the actual writing of the code with no human interface, as we all too often have these days, between the person who knows the problem and the person who knows the programming language. This date is unfortunately too far off to do much good immediately, but I would think by the year 2020 it would be fairly universal practice for the expert in the field of application to do the actual program preparation rather than have experts in computers (and ignorant in the field of application) do the program preparation.”²⁹⁴

The rise of open-source, scripted programs like Python and R is rapidly helping to achieve this vision—scientists in a variety of fields now write their own small software programs and tools, building on the framework of larger open-source languages. Training programs in many scientific fields recommend or require at least one course in programming in these languages, often taught in conjunction with data analysis and data management.

Another element that has helped make script-based software more accessible is the development of programming languages that are easier to learn and use. Very early programming languages required the programmer to understand a lot about how the computer was built and organized, including thinking about where and how data were stored in the computer’s memory. As programming languages have developed, such “low-level” languages have remained in use, as they often allow for unmatched speed in processing. However, “higher-level” programming languages have become more common, and while these might be somewhat slower in computational processing power, they are much faster for humans to learn and to create tools with, as they abstract away many of the details that make low-level programming more difficult.

Because of the development of easier-to-learn high-level programming languages like R and Python, it is possible for a scientist to become proficient in one of these script-based programs in about a year. In our own experience, we have found that often one semester of a dedicated course or serious self-study, followed with several months of regularly applying the software to research data, is enough for a scientist to become productive in using a script-based software like R or Python for research. With another year or so of regular use, scientists can often start making their own small software extensions to the language. However, in a 2017 article on analyzing single-cell RNA-sequencing data, the author noted that “relatively few biologists are comfortable working in those environments”, referring to Unix and R,²⁹⁵ and noted that this was a barrier to using many of the available tools for working with single-cell RNA-sequencing data at the time.

It is true that this is a substantially larger investment in training than a short course or workshop, which might be adequate for learning the basics of many GUI-based software programs. Time can be a critical barrier, especially for scientists who are advanced in their career and may have minimal time for further training. Further, it’s more of a barrier in analyzing some types of biomedical data, due to the extreme size and complexity of the data.²⁹⁶ However, it is much less of a time investment than it takes to become an expert in a scientific field. It takes years of training to become an expert in cellular biology or immunology, for example. Richard Hamming’s vision was that the experts can ask the best and most creative questions of the data, and that it is best to remove the barrier of a separate person as the computer programmer, so that the expert can directly create the program and leverage the full capabilities of the computer. Higher-level programming languages now are accessible enough that this vision is playing out across scientific fields.

Script-based approaches also encourage the user to learn how the underlying process works. The approach encourages the user to think more like a car owner who gets under the hood from time to time than like one who only drives the car. This approach does take more time to learn and develop, but with the upside that the user will often have a much deeper understanding of what is happenening in each step, as well as how to fix or adjust different steps to fix a pipeline or adapt one pipeline to meet another analysis need.

Another advantage of script-based software—and one that is related to the idea of experts in a scientific field directly programming—is that often the most cutting edge algorithms and pipelines will be available first in scripted languages, and only later be added into point-and-click software programs. This means that you may have earlier access to new algorithms and approaches if you are comfortable coding in a script-based language.

For example, an article about single-cell RNA-sequencing from 2017 noted that, at the time, there were “very few, if any, ‘plug-and-play’ packages” for working with scRNA-seq data, and of those available, they were “user-friendly but have the drawback that they are to some extent a ‘black box’, with little transparency as to the precise algorithmic details and parameters employed.”²⁹⁷ Similarly, another article in the same year noted that, at the time, “most scRNA-seq tools exist as Unix programs or packages in the programming language R”, although “some ready-to-use pipelines have been developed.”²⁹⁸

Another key advantage of script-based software is that, in writing the script, you are thoroughly documenting the steps you took to pre-process the data. When you create a code script, the script itself includes all the steps and details of the process. In combination with information about the version of all software used and the raw data input to the pipeline, it creates a fully reproducible record of the data pre-processing and analysis.

This means both that you will be able to re-do all the steps yourself in the future, if you need to, but that also that other researchers can explore and replicate what you do. You may want to share your process with others in your laboratory group, for example, so they can understand the choices you made and steps you took in pre-processing the data. You may also want to share the process with readers of the articles you publish, and this may in fact be required by the journal. Well-documented code also makes it much easier to write up the method section later in manuscripts that leveraged the data collected in the experiment.

By contrast, while you could write down the steps that you took and the buttons you pressed when using GUI-based software, it’s very easy to forget to record a step. Some GUI-based programs are taking steps to try to ameliorate this, allowing a user to save or download a full record that records the steps taken in a given pipeline or allow the user to develop a full, recorded workflow (one example is FlowJo Envoy’s workflow model for analyzing data from flow cytometry). There are also some movements towards “integrative frameworks”, which can help improve reproducibility for pipelines that span different types of software (Galaxy, Gene Prof).²⁹⁹

When you use a code script, it will not run if you forget a step or a detail of that step. It is like writing a recipe that can be applied again and again. By writing a script, you encode the process a single time, so you can take the time to check and recheck to make sure that you’ve encoded the process correctly. This helps in avoiding small errors when you do the pre-processing—if you are punching numbers into a calculator over and over, it’s easy to mistype a number or forget a step every now and then, while the code will ensure that the same process is run every time and that it faithfully uses the numbers saved in the data for each step, rather than relying on a person correctly entering each number in the calculation.

Scripts can be used across projects, as well, and so they can ensure consistency in the calculation across projects. If different people do the calculation in the lab for different projects or experiments, and they are doing the calculations by hand, they might each do the calculation slightly differently, even if it’s only in small details like how they report rounded numbers. A script will do the exact same thing every time it is applied. You can even share your script with colleagues at other labs, if you want to ensure that your data pre-processing is comparable for experiments conducted in different research groups, and many scientific journals will allow supplemental material with code used for data pre-processing and analysis, or links within the manuscript to a repository of this code posted online.

There are also gains in efficiency when you use a script. This is often a gain that fully pays back the investment in learning the software—it can make data pre-processing and analysis much more efficient over the long term. For small pre-processing steps, these might seem small for each experiment, and certainly when you first write the script, it will likely take longer to write and test the script than it would to just do the calculation by hand (even more if you’re just starting to learn how to write code scripts). However, since the script can be applied again and again, with very little extra work to apply it to new data, you’ll save yourself time in the future, and over a lot of experiments and projects, this can add up. This makes it particularly useful to write scripts for pre-processing tasks that you find yourself doing again and again in the lab.

13.2 Open-source versus proprietary software

When selecting software for pre-processing, the other dimension to consider is whether it is open-source or proprietary. Open-source software is software where you can access, explore, and build on all the underlying code for the software. It also most often is free. By contrast, the code that powers proprietary software is typically kept private, so you can use the product but cannot explore the way that it is built or extend it in the same way that you can open-source software. In biomedical research, many script-based languages are open-source, while many GUI-based programs are proprietary. However, this is not a hard and fast rule, and there are examples of open-source GUI-based software (for example, the Inkscape program for vector graphic design) as well as proprietary script-based software (for example, Matlab and SAS). There are advantages and disadvantages to both types of software, but in terms of rigor and reproducibility, open-source software often has the advantage.

Transparency is a key element of reproducibility.³⁰⁰ As Gordon Lithgow and coauthors note in a commentary on reproducibility, “Improved reproducibility comes from pinning down methods.”³⁰¹ If the algorithms of software can be investigated, then scientists who are using two different programs (for example, one program in Python and one in R) can determine if their choice of program is causing differences in their results. By contrast, if two research groups use two different types of proprietary software, the algorithms that underlie the processing are often kept secret and so cannot be compared. In that case, if the two groups conduct the same experiment and get different results, it’s impossible to rule out whether the difference was caused by the choice of software.

Gordon Lithgow, Monica Driscoll, and Patrick Phillips wrote a commentary for Nature describing their experiences in replicating research. They describe the advice they give students who are trying to do an experiment that should work:

“If there is nothing wrong with the reagents and reproducibility is still an issue, then as I like to tell students, there are two options: (1) the physical constants of the universe and hence the laws of physics are in a state of flux in their round-bottomed flask, or (2) the researcher is doing something wrong and either doesn’t know it or doesn’t want to know it. Then I ask them which explanation they think I’m leaning towards.”³⁰²

If you get different results from another group, it is critical to have a detailed description of the methods that each group used to figure out why the groups are getting different results. Open-source software provides this at the level of computational analysis, because the openness of the software means anyone can explore the exact details of how each algorithm runs.

One key advantage of open-source software is that all code is open, so you can dig down to figure out exactly how each step of the program works. Futher, in many cases for open-source scientific software, the algorithms and their principles have gone through peer review as part of the academic publication process. With proprietary software, on the other hand, details of algorithms may be considered protected intellectual property, and so it may be hard to find out the details of how the underlying algorithms work.³⁰³ Also, the algorithms may not have gone through peer-review, especially if they are considered private intellectual property.

Another advantage of open-source software is that older versions of the software are often well-archived and easily available to reinstall if you need to reproduce an analysis that was done using an earlier version of the software. Another advantage is that open-source software is often free. This makes it economical to test out, and it means that trainees from a lab will have no problem continuing to use the software as they move to new positions. The cost with open-source software, then, comes not with the price to buy the software, but with the investment that is required to learn it.

One facet where proprietary software has an advantage is that it will often have more comprehensive company-based user support than open-source software. The companies that make and sell proprietary software will usually have a user support team to answer questions and help develop pipelines and may also offer training programs or materials.

Some open-source software also has robust user support, although sometimes a bit less organized under a common source. In some cases, this has developed as a result of a large community of users who help each other. Message boards like StackOverflow provides a forum for users to ask and respond to questions. Some companies also exist that provide, as their business model, user support for open-source software. While open-source software is usually free, these companies make money by providing support for that software.

User support is sparser for some of smaller software packages that are developed as extensions of open-source software. For example, many of the packages for pre-processing types of biomedical data are built by small bioinformatics teams or individuals at academic or research institutions. Often this software is developed by a single person or very small team as one part of their job profile, with limited resources for user support and for providing training. These extensions build on larger, more supported open-source software (e.g., R or Python), but the extension itself is built and maintained by a very small team that may not have the capacity to respond quickly to user questions. Many open-source software developers try to create helpful documentation in the form of help files and package vignettes (tutorials on how to use the software they created), but from a practical point of view it is difficult for small open-source developers to provide the same level of user support that a large proprietary software company can.

This is often the case with cutting-edge open-source software for biomedical pre-processing. These just-developed software packages are less likely to be comprehensively documented than longer-established software. Further, it can take a while for the community of software users to develop once software is available, and while this is a limitation of new software for both open-source and proprietary languages, it can represent more of a problem for open-source software, where there is typically not a company-based helpline and so the community of users often represents one of the main sources for help and troubleshooting.

12 Principles of pre-processing experimental data

14 Introduction to scripted data pre-processing in R