To improve the computational reproducibility of a research project, researchers can use a single ‘Project’ directory to collectively store all research data, meta-data, pre-processing code, and research products (e.g., paper drafts, figures). We will explain how this practice improves the reproducibility and list some of the common components and subdirectories to include in the structure of a ‘Project’ directory, including subdirectories for raw and pre-processed experimental data.
Objectives. After this module, the trainee will be able to:
One of the most amazing parts of how modern computers work is their file directory systems. [More on these.]
It is useful to leverage this system to organize all the files related to a project. These include data files (both “raw” data—directly output from measurement equipment or directly recorded from observations, as well as any “cleaned” version of this data, after steps have been taken to preprocess the data to prepare it for visualization and analysis in papers and reports). These files also include the files with writing and presentations (posters and slides) associated with the project, as well as code scripts for preprocessing data, for conducting data analysis, and for creating and sharing final figures and tables.
There are a number of advantages to keeping all files related to a single project inside a dedicated file directory on your computer. First, this provides a clear and obvious place to search for all project files throughout your work on the project, including after lulls in activity (for example, while waiting for reviews from a paper submission). By keeping all project files within a single directory, you also make it easier to share the collection of files for the project. There are several reasons you might want to share these files. An obvious one is that you likely will want to share the project files across members in your research team, so they can collaborate together on the project. However, there are also other reasons you’d need to share files, and one that is growing in popularity is that you may be asked to share files (data, code scripts, etc.) when you publish a paper describing your results.
When files are all stored in one directory, the directory can be compressed and
shared as an email attachment or through a file sharing platform like Google Drive.
As you learn more tools for reproducibility, you can also share the directory through
some more dynamic platforms, that let all those sharing access continue to change
and contribute to the files in the directory in a way that is tracked and
reversible. In later modules in this book, we will introduce git version control
software and the GitHub platform for sharing files under this type of version
control—this is one example of this more dynamic way of sharing files within
a directory.
To gain the advantages of directory-based project file organization, all the files need to be within a single directory, but they don’t all have to be within the same “level” in that directory. Instead, you can use subdirectories to structure and organize these files, while still retaining all the advantages of directory-based file organization. This will help limit the number of files in each “level” of the directory, so none becomes an overwhelming slew of files of different types. It can help you navigate the files in the directory, and also help someone you share the directory with figure out what’s in it and where everything is.
Subdirectory organizations can also, it turns out, be used in clever ways within code scripts applied to files in the directory. For example, there are functions in all scripting languages that will list all the files in a specified subdirectory. If you keep all your raw data files of a certain type (for example, all output from running flow cytometry for the project) within a single subdirectory, you can use this type of function with code scripts to list all the files in that directory and then apply code that you’ve developed to preprocess or visualize the data across all those files. This code would continue to work as you added files to that directory, since it starts by looking in that subdirectory each time it runs and working with all files there as of that moment.
It is worthwhile to take some time to think about the types of files that are often generated by your research projects, because there are also big advantages to creating a standard structure of subdirectories that you can use consistently across the directories for all the projects in your research program. Of course, some projects may not include certain files, and some might have a new or unusual type of file, so you can customize the directory structure to some degree for these types of cases, but it is still a big advantage to include as many common elements as possible across all your projects.
For example, you may want to always include a subdirectory called “raw_data,” and consistently call it “raw_data,” to store data directly from observations or directly output from laboratory equipment. You may want to include subdirectories in that “raw_data” subdirectory for each type of data—maybe a “cfu” subdirectory, for example, with results from plating data to count colony forming units, and another called “flow” for output from a flow cytometer. By using the same structure and the same subdirectory names, you will find that code scripts are easier to reuse from one project to another. Again, most scripting languages allow you to leverage order in how you’ve arranged your files in the file system, and so using the same order across different projects lets you repeat and reuse code scripts more easily from one project to another.
Finally, if you create a clear and clean organization structure for your project directories, you will find it is much easier to navigate your files in all directories, and also that new lab members and others you share the directories with will be able to quickly learn to navigate them. In other areas of science and engineering, this idea of standardized directory structures has allowed the development of powerful techniques for open-source software developers to work together. For example, anyone may create their own extensions to the R programming language and share these with others through GitHub or several large repositories. This is coordinated by enforcing a common directory structure on these extension “packages”—to create a new package, you must put certain types of files in certain subdirectories within a project directory. With these standardized rules of directory structure and content, each of these packages can interact with the base version of R, since there are functions that can tap into any of these new packages by assuming where each type of file will be within the package’s directory of files. In a similar way, if you impose a common directory structure across all the project directories in your research lab, your collaborators will quickly be able to learn where to find each element, even in projects they are new to, and you will all be able to write code that can be easily applied across all project directories, allowing you to improve reproducibility and comparability across all projects by assuring that you are conducting the same preprocessing and analysis across all projects (or, if you are conducting things differently for different projects, that you are deliberate and aware that you are doing so).
Figure [x] gives an example of a project directory organization that might make sense for a immunology research laboratory.
Once you have decided on a structure for your directory, you can create a template of it—a file directory with all the subdirectories included, but without any files (or only template files you’d want to use as a starting point in each project). When you start a new project, you can then just copy this template and rename it. If you are using R and begin to use R Project (described in the next section), you can also create an R Studio Project template to serve as this kind of starting point each time you start a new project.
If you are using the R programming language for data preprocessing, analysis, and visualization—as well as RMarkdown for writing reports and presentations—then you can use RStudio’s “Project” functionality to make it even more convenient to work with files within a research project’s directory. You can make any file directory a “Project” in RStudio by chosing “File” -> “New Project” in RStudio’s menu. This gives you the option to create a project from scratch or to make an existing directory and RStudio Project.
When you make a file directory an RStudio Project, it doesn’t change much in the directory itself except adding a “.RProj” file. This file keeps track of some things about the file directory for RStudio, includuing … Also, when you open one of these Projects in RStudio, it will move your working directory into that projects top-level directory. This makes it very easy and practical to write code using relative pathnames that start from this top-level of the project directory. This is very good practice, because these relative pathnames will work equally well on someone else’s computer, whereas if you use file pathnames that are absolute (i.e., giving directions to the file from the root directory on your computer), then when someone else tries on run the code on their own computer, it won’t work and they’ll need to change the filepaths in the code, since everyone’s computer has its files organized differently. For example, if you, on your personal computer, have the project directory stored in your “Documents” folder, while a colleague has stored the project directory in his or her “Desktop” directory, then the absolute filepaths for each file in the directory will be different for each of you. The relative pathnames, starting from the top level of the project directory, will be the same for both of you, though, regardless of where you each stored the project directory on your computer.
There are some other advantages, as well, to turning each of your research project directories into RStudio Projects. One is that it is very easy to connect each of these Projects with GitHub, which facilitates collaborative work on the project across multiple team members while tracking all changes under version control. This functionality is described in a later module in this book.
As you continue to use R and RStudio’s Project functionality, you may want to take the template directory for your project and create an RStudio Project template based on its structure. Once you do, when you start a new research project, you can create the full directory for your project’s files from within RStudio by going to “File” -> “New Project” and then choosing to create a new project based on that template. The new project will already be set up with the “.RProj” file that allows you to easily navigate into and out of that project, to connect it to GitHub, and all the other advantages of setting a file directory as an RStudio Project. The next module gives step-by-step directions for making a directory an RStudio Project, and also how to create you own RStudio Project template to quickly create a new directory for project files each time you start a new research project.
[Visual—project directory as a mise en place for cooking—everything you need for the analysis, plus the recipe for someone to repeat later.]
[Reference: The Usual Suspects—you’ll typically have the same types of data files, analysis, types of figures, etc., come up again and again for different research projects. Leverage tools to improve efficiency when working with these “usual suspects.” The first time you follow a protocol that is new to you, or the first time you cook a recipe, it takes much longer and much more thought than it should as you do it over and over—there are some recipes where I only use the cookbook now to figure out the oven temperature or the exact measurement of an ingredient. These tools will help you streamline your project file organization and move towards reuse of modular tools and ideas (e.g., remembering how to make a vinaigrette and applying that regardless of the type of salad) across projects.]
[Analogies for moving to do things more programatically—Tom Sawyer outsourcing the fence painting, sorcerer’s apprentice (all the mops, plus some difficulties when you first start, before you get the hang of it).]
[File extensions give an idea of the power of consistent file names. While some operating systems don’t require these, by naming all the files that should be opened with, for example, Word “.docx,” the operating system can easily do a targeted search that looks for files with certain key words in the name while limiting the search only to Word files. You can leverage this same power yourself, and in a way that’s more customized to your project or typical research approach, by using consistent conventions to name your files.]
One study surveyed over 250 biomedical researchers at the University of Washington. They noted that, “a common theme surrounding data management and analysis was that may researchers preferred to utilize their own individual methods to organize data. The varied ways of managing data were accepted as functional for most present needs. Some researchers admitted to having no organizational methodology at all, while others used whatever method best suited their individual needs.” (Anderson et al. 2007) One respondent answered, “They’re not organized in any way—they’re just thrown into files under different projects,” while another said “I grab them when I need them, they’re not organized in any decent way,” and another, “It’s not even organized—a file on a central computer of protocols that we use, common lab protocols but those are just individual Word files within a folder so it’s not searchable per se.” (Anderson et al. 2007)
“In general, data reuse is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, are all provided.” (Goodman et al. 2014)
“So far we have used filenames without ever saying what a legal name is, so it’s time for a couple of rules. First, filenames are limited to 14 characters. Second, although you can use almost any character in a filename, common sense says you should stick to ones that are visible, and that you should avoid characters that might be used with other meanings. … To avoid pitfalls, you would do well to use only letters, numbers, the period and the underscore until you’re familiar with the situation [i.e., characters with pitfalls]. (The period and the underscore are conventionally used to divide filenames into chunks…) Finally, don’t forget that case distinctions matter—junk, Junk, and JUNK are three different names.” (Kernighan and Pike 1984)
“The [Unix] system distinguishes your file called ‘junk’ from anyone else’s of the same name. The distinction is made by grouping files into directories, rather in the way that books are placed om shelves in a library, so files in different directories can have the same name without any conflict. Generally, each user haas a personal or home directory, sometimes called login directory, that contains only the files that belong to him or her. When you log in, you are ‘in’ your home directory. You may change the directory you are working in—often called your working or current directory—but your home directory is always the same. Unless you take special action, when you create a new file it is made in your current directory. Since this is initially your home directory, the file is unrelated to a file of the same name that might exist in someone else’s directory. A directory can contain other directories as well as ordinary files … The natural way to picture this organization is as a tree of directories and files. It is possible to move around within this tree, and to find any file in the system by starting at the root of the tree and moving along the proper branches. Conversely, you can start where you are and move toward the root.” (Kernighan and Pike 1984)
“The name ‘/usr/you/junk’ is called the pathname of the file. ‘Pathname’ has an intuitive meaning: it represents the full name of the path from the root through the tree of directories to a particular file. It is a universal rule in the Unix system that wherever you can use an ordinary filename, you can use a pathname.” (Kernighan and Pike 1984)
“If you work regularly with Mary on information in her directory, you can say ‘I want to work on Mary’s files instead of my own.’ This is done by changing your current directory with the
cdcommand… Now when you use a filename (without the /’s) as an argument tocatorpr, it refers to the file in Mary’s directory. Changing directories doesn’t affect any permissions associated with a file—if you couldn’t access a file from your own directory, changing to another directory won’t alter that fact.” (Kernighan and Pike 1984)
“It is usually convenient to arrange your own files so that all the files related to one thing are in a directory separate from other projects. For example, if you want to write a book, you might want to keep all the text in a directory called ‘book.’” (Kernighan and Pike 1984)
“Suppose you’re typing a large document like a book. Logically this divides into many small pieces, like chapters and perhaps sections. Physically it should be divided too, because it is cumbersome to edit large files. Thus you should type the document as a number of files. You might have separate files for each chapter, called ‘ch1,’ ‘ch2,’ etc. … With a systematic naming convention, you can tell at a glance where a particular file fits into the whole. What if you want to print the whole book? You could say
$ pr ch1.1 ch1.2 ch 1.3 ..., but you would soon get bored typing filenames and start to make mistakes. This is where filename shorthand comes in. If you say$ pr ch*the shell takes the*to mean ‘any string of characters,’ so ch* is a pattern that matches all filenames in the current directory that begin with ch. The shell creates the list, in alphabetical order, and passes the list topr. Theprcommand never sees the*; the pattern match that the shell does in the current directory generates aa list of strings that are passed topr.” (Kernighan and Pike 1984)
“The current directory is an attribute of a process, not a person or a program. … The notion of a current directory is certainly a notational convenience, because it can save a lot of typing, but its real purpose is organizational. Related files belong together in the same directory. ‘/usr’ is often the top directory of a user file system… ‘/usr/you’ is your login directory, your current directory when you first log in. … Whenever you embark on a new project, or whenever you have a set of related files … you could create a new directory with
mkdirand put the files there.” (Kernighan and Pike 1984)
“Despite their fundamental properties inside the kernel, directories sit in the file system as ordinary files. They can be read as ordinary files. But they can’t be created or written as ordinary files—to preserve its sanity and the users’ files, the kernel reserves to itself all control over the contents of directories.” (Kernighan and Pike 1984)
“A file has several components: a name, contents, and administrative information such as permissions and modifications times. The administrative information is stored in the inode (over the years, the hyphen fell out of ‘i-node’), along with essential system data such as how long it is, where on the disc the contents of the file are stored, and so on. … It is important to understand inodes, not only to appreciate the options on
ls, but because in a strong sense the inodes are the files. All the directory hierarchy does is provide convenient names for files. The system’s name for a file is its i-number: the number of the inode holding the file’s information. … It is the i-number that is stored in the first two bytes of a directory, before the name. … The first two bytes in each directory entry are the only connection between the name of a file and its contents. A filename in a directory is therefore called a link, because it links a name in the directory hierarchy to the inode, and hence to the data. The same i-number can appear in more than one directory. Thermcommand does not actually remove the inodes; it removes directory entries or links. Only when the last link to a file disappears does the system remove the inode, and hence the file itself. If the i-number in a directory entry is zero, it means that the link has been removed, but not necessarily the contents of the file—there may still be a link somewhere else.” (Kernighan and Pike 1984)
“The file system is the part of the operating system that makes physical storage media like disks, CDs and DVDs, removable memory devices, and other gadgets look like hierarchies of files and folders. The file system is a great example of the distinction between logical organization and physical implementation; file systems organize and store information on many differet kinds of devices, but the operating system presents the same interface for all of them.” (Kernighan 2011)
" A folder contains the names of other folders and files; examining a folder will reveal more folders and files. (Unix systems traditionally use the word directory instead of folder.) The folders provide the organizational structure, while the files hold the actual contents of documents, pictures, music, spreadsheets, web pages, and so on. All the information that you computer holds is stored in the file system and is accessible through it if you poke around. This includes not only your data, but the executable forms of programs (a browser, for example), libraries, device drivers, and the files that make up the operating system itself. … The file system manages all this information, making it accessible for reading and writing by applications and the rest of the operating system. It coordinates accesses so they are performed efficiently and don’t interfere with each other, it keeps track of where data is physically located, and it ensures that the pieces are kept separate so that parts of your email don’t mysteriously wind up in your spreadsheets or tax returns." (Kernighan 2011)
“File system services are available through system calls at the lowest level, usually supplemented by libraries to make common operations easy to program.” (Kernighan 2011)
“The file system is a wonderful example of how a wide variety of physical systems can be made to present a uniform logical appearance, a hierarchy of folders and files.” (Kernighan 2011)
“A folder is a file that contains information about where folders and files are located. Because information about file contents and organization must be perfectly accurate and consistent, the file system reserves to itself the right to manage and maintain the contents of folders. Users and application programs can only change the folder contents implicitly, by making requests of the file system.” (Kernighan 2011)
“In fact, folders are files; there’s no difference in how they are stored except that the file system is totally responsible for folder contents, and application programs have no direct way to change them. But otherwise, it’s just blocks on the disk, all managed by the same mechanisms.” (Kernighan 2011)
“A folder entry for this [example] file would contain its name, its size of 2,500 bytes, the date and time it was created or changed, and other miscellaneous facts about it (permissions, type, etc., depending on the operating system). All of that information is visible through a program like Explorer or Finder. The folder entry also contains information about where the file is stored on disk—which of the 100 million blocks [on the example computer’s hard disk] contain its bytes. There are different ways to manage that location information. The folder entry could contain a list of block numbers; it could refer to a block that itself contains a list of block numbers; or it could contain the number of the first block, which in turn gives the second block, and so on. … Blocks need not be physically adjacent on disk, and in fact they typically won’t be, at least for large files. A megabyte file will occupy a thousand blocks, and those are likely to be scattered to some degree. The folders and the block lists are themselves stored in blocks…” (Kernighan 2011)
“When a program wants to access an existing file, the file system has to search for the file starting at the root of the file system hierarchy, looking for each component of the file path name in the corresponding folder. That is, if the file is
/Users/bwk/book/book.txton a Mac, the file system will search the root of the file system forUsers, then search within that folder forbwk, then within that folder forbook, then within that forbook.txt. … This is a divide-and-conquer strategy, since each component of the path narrows the search to files and folders that lie within that folder; all others are eliminated. Thus multiple files can have the same name for some component; the only requirement is that the full path name be unique. In practice, programs and the operating system keep track of the folder that is currenlty in use so searches need not start from the root each time, and the system is likely to cache frequently-used folders to speed up operations.” (Kernighan 2011)
“When quitting R, the option is given to save the ‘workspace image.’ The workspace consists of all values that have been created during a session—all of the data values that have been stored in RAM. The workspace is saved as a file called
.Rdataand then R starts up, it checks for such a file in the current working directory and loads it automatically. This provides a simple way of retaining the results of calculations from one R session to the next. However, saving the entire R workspace is not the recommended approach. It is better to save the original data set and R code and re-create results by running the code again.” (Murrell 2009)
“Just as a well-organized laboratory makes a scientist’s life easier, a well-organized and well-documented project makes a bioinformatician’s life easier. Regardless of the particular project you’re working on, your project directory should be laid out in a consistent and understandable fashion. Clear project organization makes it easier for both you and collaborators to figure out exactly where and what everything is. Additionally, it’s much easier to automate tasks when files are organized and clearly named. For example, processing 300 gene sequences stored in separate FASTA files with a script is trivial if these files are organized in a single directory and are consistently named.” (Buffalo 2015)
“Project directory organization isn’t just about being tidy, but is essential to the way by which tasks are automated across large numbers of files” (Buffalo 2015)
“All files and directories used in your project should live in a single project directory with a clear name. During the course of a project, you’ll have amassed data files, notes, scripts, and so on—if these were scattered all over your hard drive (or worse, across many computers’ hard drives), it would be a nightmare to keep track of everything. Even worse, such a disordered project would later make your research nearly impossible to reproduce.” (Buffalo 2015)
“Naming files and directories on a computer matters more than you may think. In transitioning from a graphical user interface (GUI) based operating system to the Unix command line, many folks bring the bad habit of using spaces in file and directory names. This isn’t appropriate in a Unix-based environment, because spaces are used to separate arguments in commands. … Although Unix doesn’t require file extensions, including extensions in file names helps indicate the type of each file. For example, a file named osativa-genes.fasta makes it clear that this is a file of sequences in FASTA format. In contrast, a file named osativa-genes could be a file of gene models, notes on where these Oryza sativa genes came from, or sequence data. When in doubt, explicit is always better than implicit when it comes to filenames, documentation, and writing code.” (Buffalo 2015)
“Scripts and analyses often need to refer to other files (such as data) in your project hierarchy. This may require referring to parent directories in you directory’s hierarcy … In these cases, it’s important to always use relative paths … rather than absolute paths … As long as your internal project directory structure remains the same, these relative paths will always work. In contrast, absolute paths rely on you particular user account and directory structures details above the project directory level (not good). Using absolute paths leaves your work less portable between collaborators and decreases reproducibility.” (Buffalo 2015)
“Document the origin of all data in your project directory. You need to keep track of where data was downloaded from, who gave it to you, and any other relevant information. ‘Data’ doesn’t just refer to your project’s experimental data—it’s any data that programs use to create output. This includes files your collaborators send you from their separate analyses, gene annotation tracks, reference genomes, and so on. It’s critical to record this important data about you’re data, or metadata. For example, if you downloaded a set of genic regions, record the website’s URL. This seems like an obvious recommendation, but ocuntless times I’ve encountered an analysis step that couldn’t be easily reproduced because someone forgot to record the data’s source.” (Buffalo 2015)
“Record data version information. Many databases have explicit release numbers, version numbers, or names (e.g., TAIR10 version of genome annotation for Arabidopsis thaliana, or Wormbase release WS231 for Caenorhabditis elegans). It’s important to record all version information in your documentation, including minor version numbers.” (Buffalo 2015)
“Describe how you downloaded the data. For example, did you use MySQL to download a set of genes? Or the USCS Genome Browser? THese details can be useful in tracking down issues like when data is different between collaborators.” (Buffalo 2015)
“Bioinformatics projects involve many subprojects and subanalyses. For example, the quality of raw experimental data should be assessed and poor quality regions removed before running it through bioinformatics tools like aligners or assemblers. … Even before you get to actually analyzing the sequences, your project directory can get cluttered with intermediate files. Creating directories to logically separate subprojects (e.g., sequencing data quality improvement, aligning, analyzing alignment results, etc.) can simplify complex projects and help keep files organized. It also helps reduce the risk of accidentally clobbering a file with a buggy script, as subdirectories help isolate mishaps. Breaking a project down into subprojects and keeping these in separate subdirectories also makes documenting your work easier; each README pertains to the directory it resides in. Ultimately, you’ll arrive at your own project organization system that works for you; the take-home point is: leverage directories to help stay organized.” (Buffalo 2015)
“Because automating file processing tasks is an integral part of bioinformatics, organizing our projects to facilitate this is essential. Organizing data into subdirectories and using clear and consistent file naming schemes is imperative—both of these practices allow us to programmatically refer to files, the first step to automating a task. Doing something programatically means doing it through code rather than manually, using a method that can effortlessly scale to multiple objects (e.g., files). Programatically referring to multiple files is easier and safer than typing them all out (because it’s less error prone.)” (Buffalo 2015)
“Organizing data files into a single directory with consistent filenames prepares us to iterate over all of our data, whether it’s the four example files used in this example, or 40,000 files in a real project. Think of it this way: remember when you discovered you could select many files with your mouse cursor? With this trick, you could move 60 files as easily as six files. You could also select certain file types (e.g., photos) and attach them all to an email with one movement. By using consistent file naming and directory organization, you can do the same programatically using the Unix shell and other programming languages.” (Buffalo 2015)
“Because lots of daily bioinformatics work involves file processing, programmatically accessing files makes our job easier and eliminates mistakes from mistyping a filename or forgetting a sample. However, our ability to programmatically access files with wildcards (or other methods in R or Python) is only possible when our filenames are consistent. While wildcards are powerful, they’re useless if files are inconsistently named. … Unfortunately, inconsistent naming is widespread across biology, and is the source of bioinformaticians everywhere. Collectively, bioinformaticians have probably wasted thousands of hours fighting others’ poor naming schemes of files, genes, and in code.” (Buffalo 2015)
“Another useful trick is to use leading zeros … when naming files. This is useful because lexicographically sorting files (as
lsdoes) leads to correct ordering. … Using leading zeros isn’t just useful when naming filenames; this is also the best way to name genes, transcripts, and so on. Projects like Ensembl use this naming scheme in naming their genes (e.g., ENSG00000164256).” (Buffalo 2015)
“In addition to simplifying working with files, consistent naming is an often overlooked component of robust bioinformatics. Bad naming schemes can easily lead to switched samples. Poorly chosen filenames can also cause serious errors when you or collaborators think you’re working with the correct data, but it’s actually outdate or the wrong file. I guarantee that out of all the papers published in the past decade, at least a few and likely many more contain erroneous results because of a file naming issue.” (Buffalo 2015)
“In order to read or write a file, the first thing we need to be able to do is specify which file we want to work with. Any function that works with a file requires a precise description of the name of the file and the location of the file. A filename is just a character value…, but identifying the location of a file can involve a path, which describes a location on a persistent storage medium, such as a hard drive.” (Murrell 2009)
“A regular expression consists of a mixture of literal characters, which have their normal meaning, and metacharacters, which have a special meaning. The combination describes a pattern that can be used to find matches amongst text values.” (Murrell 2009)
“A regular expression may be as simple as a literal word, such as
cat, but regular expressions can also be quite complex and express sophisticated ideas, such as[a-z]{3,4}[0-9]{3}, which describes a pattern consisting of either three or four lowercase letters followed by any three digits.” (Murrell 2009)
“… it’s important to mind R’s working directory. Scripts should not use
setwd()to set their working directory, as this is not portable to other systems (which won’t have the same directory structure). For the same reason, use relative paths … when loading in data, and not absolute pathers… Also, it’s a good idea to indicate (either in comments or a README file) which directory the user should set as their working directory.” (Buffalo 2015)
“Centralize the location of the raw data files and automate the derivation of intermediate data. Store the input data on a centralized file server that is profesionally backed up. Mark the files as read-only. Have a clear and linear workflow for computing the derived data (e.g., normalized, summarized, transformed, etc.) from the raw files, and store these in a separate directory. Anticipate that this workflow will need to be run several times, and version it. Use the
BiocFileCachepackage to mirror these files on your personal computer. [footnote: A more basic alternative is the rsync utility. A popular solution offered by some organizations is based on ownCloud. Commercial options are Dropbox, Google Drive and the like].” (Holmes and Huber 2018)