11 min read

R basics tutorial

This is a brief introduction to the very basics of R and some common tasks for data manipulation

What is R and how does it work?

R is two things: a programming language and a software to interpret that language. The goal of both is to give us tons of ways to process and analyse data.

The basic of how R work is this. In general, we may have 1) our data or some information (or a formula, a model, a single value, or many other things) stored in an object in the computer’s memory, 2) and we will do an operation with it, most likely in the form of a function, 3) the arguments of the function specify the particular details about the operations we are going to perform, 4) and this will give us some results, which we can print on the screen or store in another object to do other things with the results, or export it.

Say you have a table with stem diameters and total biomass of trees of your forest study site, which is stored in an object called trees, and you want to plot this data. The function could be plot and the arguments of that function will specify on which axis each variable will be plotted, the type of the plot (e.g. scatterplot) and the color of the symbols that represent the data, and the results of that (i.e. the plot) will be shown on your screen.

Illustrating this example

We will input some data on tree stem diameters and biomass and then plot it. We will explain the details later.

# Here we define two objects with the data
diameter <- c(2, 7, 9, 15, 20, 33)
biomass <- c(1, 23, 30, 60, 85, 153)

# Plot defining the roles of the data, type of plot (points), and shape and color of the symbols
plot(x = diameter, y = biomass, type = "p", pch = 20, col = "black")
A plot of stem diameter vs biomass in forest trees

Figure 1: A plot of stem diameter vs biomass in forest trees

That was very easy!

Why use R?

  • Its free in every possible way
  • Available in any operating system and its usable even in old computers
  • Infinite possibilities for your analysis via R packages. R packages extends R functionality, and can be created by the same people that develop the analysis techniques
  • You can do all steps of your analyses in the same software
  • Can do high quality plots and technical/scientific reports
  • Can help doing science reproducible, by sharing your code and allowing others to replicate.
  • Having some much possibilities at you reach makes you want to learn more and more

On Rstudio

Rstudio is a software that can make it kind of easier to use R. It gives you several tools in one same place such as a code editor (for writing the code for your analysis), the R console (where the code is run), and places to look at the help files, plots, the objects in the memory, available packages, etc.

Figure 2: Components of Rstudio

Interestingly, Rstudio is not the only software for easily working with R. We will see some more in the next part of the course.

Basic objects/information types and its creation

Data in R can be in several types. The most common ones are logic (either TRUE or FALSE), character, and numeric (can be others, like dates). Then, the data can be in different types of objects. The most common ones are 1) single values, no need to explain, 2) vectors, a collection of values of the same type, 3) data frames, are tables composed of two or more vectors, 4) matrices, and 4) lists, which can contain any of the other types.

For creating any object we choose a name for it followed by <- and then put what the object is. Functions would look like this function(argument1 = value, argument2 = value, argument3 = value). We can print the objects on the console by running the object name. Object names should typically not be quoted.

We create single values like this

# logicals (TRUE or FALSE, in capital letters)
logic1 <- TRUE

# numeric
numeric1 <- 1

# character (always single or double quoted)
char <- "sample text"

There are several ways to create vectors

# numeric vectors
vector1 <- c(1, 2, 3, 4)
vector2 <- seq(from = 1, to = 4, by = 1)
vector3 <- 1:4

# character and logical
vector4 <- c("Pinus", "Juniperus", "Quercus", "Salvia")
vector5 <- c(TRUE, FALSE, FALSE, TRUE)

To create a data frame

# from existing vectors
dataframe1 <- data.frame(vector1, vector4, vector5)
##   vector1   vector4 vector5
## 1       1     Pinus    TRUE
## 2       2 Juniperus   FALSE
## 3       3   Quercus   FALSE
## 4       4    Salvia    TRUE
# from existing vectors, changing column names
dataframe2 <- data.frame(number = vector1, genera = vector4, alive = vector5)
##   number    genera alive
## 1      1     Pinus  TRUE
## 2      2 Juniperus FALSE
## 3      3   Quercus FALSE
## 4      4    Salvia  TRUE
# created at the moment
dataframe3 <- data.frame(number = c(1, 2), genera = c("Pinus", "Juniperus"), alive = c(TRUE, FALSE))
##   number    genera alive
## 1      1     Pinus  TRUE
## 2      2 Juniperus FALSE

To create a matrix

matrix1 <- matrix(data = c(2, 5, 3, 7, 10, 8), nrow = 2, ncol = 3, byrow = FALSE) 
##      [,1] [,2] [,3]
## [1,]    2    3   10
## [2,]    5    7    8

To create a list

# A named list
list1 <- list(sValue = logic1, sValue2 = numeric1, plants = dataframe2, matrix = matrix1)
## $sValue
## [1] TRUE
## $sValue2
## [1] 1
## $plants
##   number    genera alive
## 1      1     Pinus  TRUE
## 2      2 Juniperus FALSE
## 3      3   Quercus FALSE
## 4      4    Salvia  TRUE
## $matrix
##      [,1] [,2] [,3]
## [1,]    2    3   10
## [2,]    5    7    8

Excercise 1

If you are interested in practicing, try this: create a dataframe of 10 rows, at least one column of each data type (logic, numeric, and character), the character column should have some repeated elements, include a few missing values here and there (NA).

Good practices for coding

Before jumping into more coding action, here is my advice for some minimal good practice for coding.

  • Comment your code ( add comments with #) whenever is needed. This will help you in the future when you come back and try to understand the code. When you are just starting to code in R, its good to add comments on everything, but as practice goes well, try only commenting to clarify certain parts of your code. Code purists may argue that the code should be readable by itself (if its well written and structured).
  • Use descriptive names for your objects. Avoid generic names. As you code you will find descriptive names easier to use and remember what exactly are they.
  • Explicitly write the argument name of the functions. This is not strictly needed but makes code easier to read and understand.
  • Leave spaces. E.g. 2 + 2 instead of 2+2 objName <- c(1, 3) instead of objName<-c(1,3). Spaced code looks cleaner and its easier to read.
  • Follow help pages. To use the help, type function names in the Rstudio help tab (see Figure 2 or run ?functionName in the console. More explanation on this below.

The help files

The help files have a very specific structure. Its a good idea to get a good understanding on what each part of the help is. Try searching for the help for the mean() function (e.g. type ?mean in the console or just search mean in Rstudio help).

At the top most part you will find the name of the function, what package does it belogs to, and a human readable name and explanation of the function. After that there are certain sections:

  • Usage: gives you a general sense of how to use the function. In addition, pay attention to the order of the arguments. This is the order on which you should input them if you decide not to spell out explicitly the argument names in your code. Also, pay attention to which arguments have = and a value for that argument… those values are their default values, i.e. if you do not write them, that is the value those arguments will take.
  • Agruments: Gives an explanation of what each argument is and the specific type of information and object type that you should input. E.g. in mean() function, the argument x expects a vector as input, while na.rm expects a logical value (TRUE or FALSE)
  • Details: Gives further details about what the function does, and specifics on certain arguments.
  • Value: Explain what kind of data/object/information you should expect as the output.

The help also gives some references, related functions and examples.

Some very common functions

The following functions are used quite often, so its good for you to know them. You should take some time to read their help files.

  • sum(), mean(), max(), min(): These functions are almost self-explanatory. Be sure to check the help file to know how to use them.
  • quantile(): Gives the quantiles of a vector, which are quite useful to get a quick picture of the distribution of our data.
  • summary(): Its a generic function that gives a summary on a lot of objects, such as vectors and dataframes, providing key statistical and data summaries. Also if applied to other objects (e.g. statistical models), it gives you a summary of the results.

Lets try them one of these in our dataframe of plants that we created before.

# A dataframe of tree stem diameter and biomass
diameter <- c(2, 7, 9, 15, 20, 33)
biomass <- c(1, 23, 30, NA, 85, 153)
treesDF <- data.frame(diameter, biomass)

# Get a summary of this dataframe
##     diameter        biomass     
##  Min.   : 2.00   Min.   :  1.0  
##  1st Qu.: 7.50   1st Qu.: 23.0  
##  Median :12.00   Median : 30.0  
##  Mean   :14.33   Mean   : 58.4  
##  3rd Qu.:18.75   3rd Qu.: 85.0  
##  Max.   :33.00   Max.   :153.0  
##                  NA's   :1

Lets try the mean for biomass. We can select a specific column with $ like this: objectName$columnName.

# If we don't include `na.rm = TRUE` we would get a weird
# result because `biomass` data has a missing value
mean(x = treesDF$biomass)
## [1] NA
mean(x = treesDF$biomass, na.rm = TRUE)
## [1] 58.4

Excercise 2: Try using one of this functions over one column of the dataframe you created in Excercise 1.

Selecting data

We can have more specific selections of our data, and there are several ways to do it. One, is specifying the specific locations of our data, and another one is selecting data that meets certain conditions.

# Different ways of selecting by specifying data locations
treesDF$diameter[3] # [rowNumber] 
treesDF[3, 2] # [rowNumber, columnNumber]
treesDF[1:3, 2] # several rows from a column
treesDF[1:3, ] # several rows from all columns

Excercise 3: With your dataframe from Excercise 1, try to find out how can you select rows 3, 5, 8 from all columns, and how can you select all rows from columns 2 and 3.

For selecting data that meets certain conditions, we will define conditions based on the following logical operators:

  • > or < for higher or lower than
  • >= or <= for equal or higher/lower than
  • == or != for equal or not equal
  • & or | for AND or OR

You will build conditions by comparing objects/information from the right side of the operator to the ones in the left side of the operator.

A useful function for selecting data based on coditions is subset(). Take a look at its help file.

# On our tree dataframe, lets select the bigger trees.
# That is, e.g. trees with biomass larger than 50
# The argument subset states the condition
biggerTrees <- subset(x = treesDF, subset = treesDF$biomass > 50)
##   diameter biomass
## 5       20      85
## 6       33     153

Excercise 4: From the dataframe you created in Excercise 1, subset it using a condition. Try also calculating the mean of a column of the subseted dataframe.

What did we learn in this session?

  • R can be an amazing tool for data analysis
  • R stores information in objects, and process with functions and arguments.
  • Data can be stored as single values, vectors (collection of values), dataframes (a typical data table, like the ones you do in excel), matrices and lists (collection of any of the other types).
  • Good practices for coding are commenting, using spaces in your code, use descriptive names, spell out argument names.
  • Help files have a lot of useful information about how to use the functions.
  • Some basic functions: mean(), sum(), summary(),subset()
  • We can select data with $, [], and subset()

Next session will be about plotting with ggplot, and statistical analyses, so real data analysis action.