Module 6 Organizing project files
In earlier modules, we discussed how to separate data collection from data analysis. By separating data collection and analysis into separate files, we can make the file for each step simpler. Further, by separating steps into different files, we can save the files in plain text, which makes it easier to track them using version control software (discussed in later modules). This helps create a record of changes made to the data or analysis code during the research process.
While this process helps in reproducibility, it results in more files being collected for an experiment. Instead of data and its analysis collected within a single spreadsheet file, you may end up with multiple files of data collected from the experiment, as well as separate files with scripts for processing, analyzing, and visualizing the data. With more complex experiments, there may be different data files containing the data collected from different assays. For example, you may run an experiment where you collect data from each research animal on bacterial load, as well as flow cytometry data, as well as a measure of antibody levels through ELISA. As a result, you may have one raw data file from each assay and, for some assays, even one file per study subject (e.g., flow cytometry). The files for a research project will also include files with writing and presentations (posters and slides) associated with the project, as well as code scripts for pre-processing data, for conducting data analysis, and for creating and sharing final figures and tables.
In this and the next few modules, we’ll discuss how you can organize the files for an experiment using a single directory that is designed to follow a similar format across all your projects. The modules will discuss the advantages of well-designed project directories, tips for arranging files within a project directory, and how to create a directory template that allows you to use consistent file organization across many experiments.
Objectives. After this module, the trainee will be able to:
- Explain how poor file organization can impede reproducibility
- List benefits of good file organization
- List several principles for organizing research project files
- Define the design concept of “discoverability”
- Apply the idea of discoverability in organizing project files
- Explain how a project directory template works
6.1 Advantages of organizing project files
As the files for a project accumulate, do you have a clear plan for keeping them organized? Based on one analysis, many biomedical researchers do not. One study, for example, surveyed over 250 biomedical researchers at the University of Washington. They noted that, “Some researchers admitted to having no organizational methodology at all, while others used whatever method best suited their individual needs.”147 One respondent answered, “They’re not organized in any way—they’re just thrown into files under different projects,” while another said “I grab them when I need them, they’re not organized in any decent way,” and another, “It’s not even organized—a file on a central computer of protocols that we use, common lab protocols but those are just individual Word files within a folder so it’s not searchable per se.”148
This lack of organization can make scientists reluctant to share their research files, impeding reproducibility. In an article on organizing project files for research, Marwick notes:
“Virtually all researchers use computers as a central tool in their workflow. However, our formal education rarely includes any training in how to organise our computer files to make it easy to reproduce results and share our analysis pipeline with others. Without clear instructions, many researchers struggle to avoid chaos in their file structures, and so are understandable reluctant to expose their workflow for others to see. This may be one of the reasons that so many requests for details about method, including requests for data and code, are turned down or go unanswered.”149
Sharing data and code is crucial to research reproducibility, especially for projects that include extensive proprocessing and complex analysis of data, as many biomedical research projects now do. As a further bonus, when research articles include data, they tend to be more impactful, as measured by citations that the paper receives.150
In an earlier module, we introduced Adam Savage’s idea of “knolling” to keep a workspace tidy (module 3). He was talking about a physical workspace. When you are working with data, computer files and directories are your workspace. For any type of work, the design of the workspace plays a critical role in how the workers approach tasks and solve problems. Rod Judkins, who is a lecturer at St Martin’s College of Art, highlights this in a book on creative thinking:
“Your working environment, whether it’s a supermarket, office, studio, or building site, persuades you to work and think in certain ways. The more aware you are of that, and the more you understand your medium, the more you can use it to your advantage.”151
Adam Savage describes how important this is in another type of work: gourmet cooking. He describes how this idea of an organized workspace is captured by the technique of mise en place—of laying out all the elements needed for the work ahead of time and in an organized way—introduced by the famous French chef August Escoffier:
“Kitchens are pressure cookers in which wasted movement and hasty technique can ruin a dish, slice an artery, burn a hand, land you in the weeds, and ultimately kill a restaurant. Mise en place is the only way to reliably create a perfect dish, to exact specifications, over and over again, night after night, for paying customers who demand nothing less.”152
Good organization of your files can similarly encourage clear thinking, and it can help you in reasoning through how to analyze data. One article notes that “mundane issues such as organizing files and directories and documenting progress … are important because poor organizational choices can lead to significantly slower research progress.”153 In fact, if files are organized in a consistent way across multiple projects, this can even allow you to start automating some necessary tasks through code that is built to work with that consistent structure.154
Organization also helps you in finding things, and finding them quickly. You can even find things quickly when you come back to a project after a while away from it (for example, while the paper was out for review). You can teach others how to find things quickly and consistently across your multiple projects, as well as where to put things they’re contributing.
Good file organization will also help you find information you need when it’s time to write up your results. As one article notes, with good organization, “methods and data sections in papers practically write themselves, with no time wasted in frenzied hunting for missing information.”155
Finally, good file organization can improve your efficiency. An article on organizing computational biology projects highlights this:
“Everything you do, you will probably have to do over again. Inevitably, you will discover some flaw in your initial preparation of the data being analyzed, or you will get access to new data, or you will decide that your parameterization of a particular model was not broad enough. This means that the experiment you did last week, or even the set of experiments you’ve been working on over the past month, will probably need to be redone. If you have organized and documented your work clearly, then repeating the experiment with the new data or the new parameterization will be much, much easier.”156
6.2 How to organize project files
Now that we’ve explained why to organize project files, let’s talk about how you can do that. We’ll cover higher-level principles in this module. In the next few modules, we’ll move into more details and examples.
First, and at a minimum, you should get in the habit of storing all of the files for an experiment in the same place. Specifically, project files should all be in a single directory within the file system of a computer.157 While this can be an individual’s computer, it may also be on a dedicated server or through an online, cloud-based program.
There are a number of advantages to keeping all of a project’s files inside a dedicated file directory. First, it provides a clear and obvious place to search for all project files as you work on the project, including after lulls (like waiting for reviews from a paper submission).
One article about the reproducibility of scientific papers talks about how helpful this organization can be, describing the experience for a project that involved a large research group:
“Instead of squirrelling away data in individual folders and lab books, researchers now archive all published data in a designated central drive, so that the information is accessible for the long haul. Initially, people thought the process was just extra bureaucratic work, or that it had been invented so I could police their data. Now, it has become the norm, and researchers tell me they save time and worry by having their data organized and archived.”158
By keeping all project files within a single directory, you also make it easier to share those files as a unit. There are several reasons you might want to share these files. An obvious one is that you to share the project files across members in your research team, so they can collaborate on the project. However, there are also other reasons you’d need to share files, and one that is growing in importance is that you may be asked to share files (data, code scripts, etc.) when you publish a paper describing your results.
When files are all stored in one directory, the directory can be compressed and shared as an email attachment (if the file size is small enough) or through a file sharing platform like Google Drive. When all the materials for a project are stored in a single directory, it also makes it easier to share the set of files through version control and online version control platforms.159 In later modules in this book (modules 9–11), we will introduce Git version control software and the GitHub platform for sharing files under this type of version control—this is one example of this more dynamic way of sharing files, but requires them to be stored in a single directory.
To gain the advantages of directory-based project file organization, all the files need to be within a single directory, but they don’t all have to be within the same “level” in that directory. Instead, you can use subdirectories to structure and organize these files, while still retaining all the advantages of directory-based file organization. Computer file systems are well-structured to use a hierarchical design, with subdirectories nested inside directories. You can leverage this structure to manage the complexity and breadth of files for your project.
This will help limit the number of files in each “level” of the directory, so none becomes an overwhelming collection of files of different types. It can help you navigate the files in the directory, and also help someone else quickly figure out what’s in the directory and where everything is. However, to leverage these gains, you need to be thoughtful about exactly how you organize the files into subdirectories.
As you decide how to organize files, keep in mind a concept called discoverability. In the classic design book The Design of Everyday Things, Don Norman presents discoverability as a key principle of good design, explaining it as the ability for a user to be able to figure out, from the design of something, how to use that thing quickly, easily, and correctly.
He illustrates this with an example of discoverability in the design of doors. For a door, the location of a pull handle and a push bar immediately shows someone how to use the door: pull on the side of the door where you see a pull handle and push where you see a push bar. If the door is lacking these, it makes it harder for a user to “discover” how to use it at first glance, and they might try to push when they need to pull or vice-versa.
The same idea applies when you design an organizational system for project files. You want to make sure that a new user (or you in the future) will be able to easily navigate through the directory to find what they need. One article on organizing research project files notes that, when it comes to deciding how to organize your files, “The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.”160 Another notes, “The key principle is to organize the [project directory] so that another person can know what to expect from the plain meaning of the file and directory names.”161
Another way to improve discoverability is to name your files and subdirectories in meaningful ways. The computer will give you wide flexibility in setting names for files and subdirectories, but a human will find it much easier to navigate a directory when the names are clear labels that describe the contents. For example, if you have data from different assays, you might organize them all into a directory named “raw_data” that is then divided into subdirectories named with the type of assay.
As you develop names that are discoverable, keep in mind that your users may include some people outside your field, for whom some shorthand common in the field might be unclear. For example, in some studies of infectious bacterial disease, the bacterial load is measured in an assay that counts colony forming units. Among bench scientists in this field, the assay is often called “CFUs”. If you are collaborating with a statistician, however, they may find the files more discoverable if you named the subdirectory with these files something like “bacterial_load” rather than “cfus”, as they may not be familiar with that shorthand.
One way to improve discoverability is to follow any standards that exist for organizing project files.162 The use of standards or conventions tend to make it easier for users to navigate (“discover”) new instances of a certain type of thing. In module 2, we discussed this role of standards when it comes to the format you use to record your data. When it comes to project file organization, standards will come in the form of the subdirectories that are included, how they’re organized hierarchically, and how subdirectories and files are named.
These standards could exist as several levels: at a top level for your discipline, but also just for your lab group, or even for you as an individual. It is very helpful when standards exist at a discipline-wide level, as following this type of high-level standard will immediately make your work discoverable (in the design sense) to a wide group of people. As one article notes, “Using widely held conventions… will help other people to understand how your files relate to each other without having to ask you.”163
As an example of this, when people develop R packages, the package consists of a set of files, and there is a very clear and highly enforced standard for how these files are arranged in a directory and how the subdirectories are named. By enforcing this standard, many different people can create packages and have them work in a similar way.
On the opposite end of the spectrum, if there are not clear standards at the level of your discipline, you could create a clear standard that you plan to follow either for your lab group or even for your individual work. If you’re consistent in organizing your files using that standard, it will make it easier to navigate files as you move from one project to another.
As an added bonus, subdirectory organization can also be used in clever ways within code scripts applied to files in the directory. For example, there are functions in all scripting languages that will list all the files in a specified subdirectory. If you keep all your raw data files of a certain type (for example, all output from flow cytometry for the project) within a single subdirectory, you can use this type of function with code scripts to list all the files in that directory and then apply code that you’ve developed to preprocess or visualize the data across all those files. This code would continue to work as you added files to that directory, since it starts by looking in that subdirectory each time it runs and working with all files there as of that moment.
This type of automation can be a huge efficiency boost for your project. One article describes how this type of automation can increase efficiency with a comparison to a simpler task in working with computer files:
“Organizing data files into a single directory with consistent filenames prepares us to iterate over all of our data, whether it’s the four example files used in this example, or 40,000 files in a real project. Think of it this way: remember when you discovered you could select many files with your mouse cursor? With this trick, you could move 60 files as easily as six files. You could also select certain file types (e.g., photos) and attach them all to an email with one movement. By using consistent file naming and directory organization, you can do the same programatically using the Unix shell and other programming languages.”164
A final way to improve your directory organization is to make sure the directory is not cluttered with unnecessary files. Unnecessary files can include old versions of project files, which have been superseded by newer versions. In later modules (modules 9–11), we’ll describe how version control can help avoid this clutter from old versions of files while retaining information from older versions as files evolve.
6.3 What is a project directory template?
Louis Pastuer famously said that “Luck favors the prepared mind.” In file organization, as with so much else, time spent preparing can pay off exponentially later. In this case, the next step is to not only use a structured directory for each project or experiment, but to start using the same, standardized structure for every one of your projects and experiments—in other words, to create a standard for file organziation and to use it consistently.
In other modules, we talk about how templates can be used to improve the rigor and reproducibility of collecting and reporting on data. Just as it’s possible to create templates for data collection and for reports, it’s also possible to create a template for how you organize file directories for your scientific projects, creating and applying standards for things like which subdirectories are included and how files are named. This takes more work—to design a structure that can be used across many projects, rather than to set something up ad hoc as you start each new experiment. However, the gains in terms of organization and efficiency can be extraordinary.
This involves first designing a common template for the directory structure for your projects. Once you have decided on a structure for this template, you can create a version of it on your computer—a file directory with all the subdirectories included, but without any files (or only template files you’d want to use as a starting point in each project, like templates for data collection and reports as presented in modules 4 and 5). When you start a new project, you can then just copy this template directory, rename it, and start using it for your new research project. If you are using R and begin to use R Projects (described in the next section), you can also create an R Studio Project template to serve as this kind of starting point each time you start a new project.
In other areas of science and engineering, this idea of standardized directory structures has allowed the development of powerful techniques for open-source software developers to work together. For example, anyone may create their own extensions to the R programming language and share these with others through GitHub or several large repositories. As mentioned earlier in this module, this is coordinated by enforcing a common directory structure on these extension “packages”—to create a new package, you must put certain types of files in certain subdirectories within a project directory. With these standardized rules of directory structure and content, each of these packages can interact with the base version of R, since there are functions that can tap into any of these new packages by assuming where each type of file will be within the package’s directory of files.
In a similar way, if you impose a common directory structure across all the project directories in your research lab, your collaborators will quickly be able to learn where to find each element, even in projects they are new to, and you will all be able to write code that can be easily applied across all project directories, allowing you to improve reproducibility and comparability across all projects by assuring that you are conducting the same pre-processing and analysis across all projects (or, if you are conducting things differently for different projects, that you are deliberate and aware that you are doing so). Creating a project template that you copy and rename as you start a new project is one way to facilitate this.
As you use a template for a project, you can customize it as you need. For example, if you had included a subdirectory for flow cytometry data, but are not running that assay in this experiment, you can remove that subdirectory. Similarly, you can customize the report as you go to help it work well for this specific experiment. However, you will aim to keep to the standard format as much as possible, since it’s the standardization across projects that provides many of the advantages.
In module 7, we will walk through the steps of designing a project template that you can use across experiments for your laboratory group. In module 8, we’ll walk through an example of creating and using this kind of project template for an example set of studies.