3.2 Introduction to scripted data pre-processing in R

We will show how to implement scripted pre-processing of experimental data through R scripts. We will demonstrate the difference between interactive coding and code scripts, using R for examples. We will then demonstrate how to create, save, and run an R code script for a simple data cleaning task.

Objectives. After this module, the trainee will be able to:

Describe what an R code script is and how it differs from interactive coding in R
Create and save an R script to perform a simple data pre-processing task
Run an R script
List some popular packages in R for pre-processing biomedical data

3.2.1 Compiled versus interpreted programming languages

When computers were first being developed, they were very tricky to program, as they required humans to translate appropriate logic down to a very granular level that the computers of the time could process. As computer development continued, development of programming techniques and languages developed as well. These evolved to allow a programmer to write at a level of logic that is more straightforward for humans, and then the inner design of the programming language did the work of translating those instructions for the computer.

One key development in programming languages was the development of interpreted programming languages. These are in contrast to a type of programming languages called compiled languages. With compiled languages, you must write the full set of instructions for the computer to run. This full set of instructions is then sent through a programmer called a compiler, which translates the instructions for the computer, and then the program can be run, either once or repeatedly. By contrast, interpreted languages do this type of compiling (translating for the computer) “on the fly,” and so they allow you to run each step of the instructions as you write them, and then check the output a step at a time.

It may be easier to understand this difference with an analogy, so we’ll make a comparison with teaching someone how to cook a recipe. With an interpreted language, it is as if you are in the kitchen with the person you are teaching. You can tell them to do the first step (“chop the onion into small dice”). Then, you can take a look at the result. If you don’t like it (“those dice aren’t small enough—make them smaller”), you can give a new instruction. You can work through the entire recipe like this, checking and adjusting as you go. By contrast, with a compiled language, it is as if you have to write down the whole recipe and mail it off to someone in a different city, and then hope it all works okay.

Compiled languages have a number of advantages—speed of running the code being a key one—that mean they are still widely used. However, interpreted languages are much easier for a new programmer to learn, as they allow this process of checking and adjusting, really allowing someone to see what’s going on with each thing they ask the computer to do. Interpreted languages are often now taught as a programmer’s first language, with Python as a particularly popular first language. Other interpreted languages include Julia and R, with R being particularly popular for data science in general and for bioinformatics and other biological research in particular.

3.2.2 Code scripts versus interactive coding

When you use an interactive programming language, like R, you will likely start to explore your data by working interactively, running one call, looking at the results, and then running the next call, adapting as needed based on the results you see at each step.

3.2.3 Process of building a code script

A code script is essentially a recipe for cleaning and analyzing data.