This is a brief introduction to the very basics of R and some common tasks for data manipulation
What is R and how does it work?
R is two things: a programming language and a software to interpret that language. The goal of both is to give us tons of ways to process and analyse data.
The basic of how R work is this. In general, we may have 1) our data or some information (or a formula, a model, a single value, or many other things) stored in an object in the computer’s memory, 2) and we will do an operation with it, most likely in the form of a function, 3) the arguments of the function specify the particular details about the operations we are going to perform, 4) and this will give us some results, which we can print on the screen or store in another object to do other things with the results, or export it.
Say you have a table with stem diameters and total biomass of trees of your forest study site, which is stored in an object called
trees, and you want to plot this data. The function could be
plot and the arguments of that function will specify on which axis each variable will be plotted, the type of the plot (e.g. scatterplot) and the color of the symbols that represent the data, and the results of that (i.e. the plot) will be shown on your screen.
Illustrating this example
We will input some data on tree stem diameters and biomass and then plot it. We will explain the details later.
# Here we define two objects with the data diameter <- c(2, 7, 9, 15, 20, 33) biomass <- c(1, 23, 30, 60, 85, 153) # Plot defining the roles of the data, type of plot (points), and shape and color of the symbols plot(x = diameter, y = biomass, type = "p", pch = 20, col = "black")
That was very easy!
Why use R?
- Its free in every possible way
- Available in any operating system and its usable even in old computers
- Infinite possibilities for your analysis via R packages. R packages extends R functionality, and can be created by the same people that develop the analysis techniques
- You can do all steps of your analyses in the same software
- Can do high quality plots and technical/scientific reports
- Can help doing science reproducible, by sharing your code and allowing others to replicate.
- Having some much possibilities at you reach makes you want to learn more and more
Rstudio is a software that can make it kind of easier to use R. It gives you several tools in one same place such as a code editor (for writing the code for your analysis), the R console (where the code is run), and places to look at the help files, plots, the objects in the memory, available packages, etc.
Interestingly, Rstudio is not the only software for easily working with R. We will see some more in the next part of the course.
Basic objects/information types and its creation
Data in R can be in several types. The most common ones are logic (either
FALSE), character, and numeric (can be others, like dates). Then, the data can be in different types of objects. The most common ones are 1) single values, no need to explain, 2) vectors, a collection of values of the same type, 3) data frames, are tables composed of two or more vectors, 4) matrices, and 4) lists, which can contain any of the other types.
For creating any object we choose a name for it followed by
<- and then put what the object is. Functions would look like this
function(argument1 = value, argument2 = value, argument3 = value). We can print the objects on the console by running the object name. Object names should typically not be quoted.
We create single values like this
# logicals (TRUE or FALSE, in capital letters) logic1 <- TRUE # numeric numeric1 <- 1 # character (always single or double quoted) char <- "sample text"
There are several ways to create vectors
# numeric vectors vector1 <- c(1, 2, 3, 4) vector2 <- seq(from = 1, to = 4, by = 1) vector3 <- 1:4 # character and logical vector4 <- c("Pinus", "Juniperus", "Quercus", "Salvia") vector5 <- c(TRUE, FALSE, FALSE, TRUE)
To create a data frame
# from existing vectors dataframe1 <- data.frame(vector1, vector4, vector5) dataframe1
## vector1 vector4 vector5 ## 1 1 Pinus TRUE ## 2 2 Juniperus FALSE ## 3 3 Quercus FALSE ## 4 4 Salvia TRUE
# from existing vectors, changing column names dataframe2 <- data.frame(number = vector1, genera = vector4, alive = vector5) dataframe2
## number genera alive ## 1 1 Pinus TRUE ## 2 2 Juniperus FALSE ## 3 3 Quercus FALSE ## 4 4 Salvia TRUE
# created at the moment dataframe3 <- data.frame(number = c(1, 2), genera = c("Pinus", "Juniperus"), alive = c(TRUE, FALSE)) dataframe3
## number genera alive ## 1 1 Pinus TRUE ## 2 2 Juniperus FALSE
To create a matrix
matrix1 <- matrix(data = c(2, 5, 3, 7, 10, 8), nrow = 2, ncol = 3, byrow = FALSE) matrix1
## [,1] [,2] [,3] ## [1,] 2 3 10 ## [2,] 5 7 8
To create a list
# A named list list1 <- list(sValue = logic1, sValue2 = numeric1, plants = dataframe2, matrix = matrix1) list1
## $sValue ##  TRUE ## ## $sValue2 ##  1 ## ## $plants ## number genera alive ## 1 1 Pinus TRUE ## 2 2 Juniperus FALSE ## 3 3 Quercus FALSE ## 4 4 Salvia TRUE ## ## $matrix ## [,1] [,2] [,3] ## [1,] 2 3 10 ## [2,] 5 7 8
If you are interested in practicing, try this: create a dataframe of 10 rows, at least one column of each data type (logic, numeric, and character), the character column should have some repeated elements, include a few missing values here and there (
Good practices for coding
Before jumping into more coding action, here is my advice for some minimal good practice for coding.
- Comment your code ( add comments with
#) whenever is needed. This will help you in the future when you come back and try to understand the code. When you are just starting to code in R, its good to add comments on everything, but as practice goes well, try only commenting to clarify certain parts of your code. Code purists may argue that the code should be readable by itself (if its well written and structured).
- Use descriptive names for your objects. Avoid generic names. As you code you will find descriptive names easier to use and remember what exactly are they.
- Explicitly write the argument name of the functions. This is not strictly needed but makes code easier to read and understand.
- Leave spaces. E.g.
2 + 2instead of
objName <- c(1, 3)instead of
objName<-c(1,3). Spaced code looks cleaner and its easier to read.
- Follow help pages. To use the help, type function names in the Rstudio help tab (see Figure 2 or run
?functionNamein the console. More explanation on this below.
The help files
The help files have a very specific structure. Its a good idea to get a good understanding on what each part of the help is. Try searching for the help for the
mean() function (e.g. type ?mean in the console or just search mean in Rstudio help).
At the top most part you will find the name of the function, what package does it belogs to, and a human readable name and explanation of the function. After that there are certain sections:
- Usage: gives you a general sense of how to use the function. In addition, pay attention to the order of the arguments. This is the order on which you should input them if you decide not to spell out explicitly the argument names in your code. Also, pay attention to which arguments have
=and a value for that argument… those values are their default values, i.e. if you do not write them, that is the value those arguments will take.
- Agruments: Gives an explanation of what each argument is and the specific type of information and object type that you should input. E.g. in mean() function, the argument
xexpects a vector as input, while
na.rmexpects a logical value (
- Details: Gives further details about what the function does, and specifics on certain arguments.
- Value: Explain what kind of data/object/information you should expect as the output.
The help also gives some references, related functions and examples.
Some very common functions
The following functions are used quite often, so its good for you to know them. You should take some time to read their help files.
min(): These functions are almost self-explanatory. Be sure to check the help file to know how to use them.
quantile(): Gives the quantiles of a vector, which are quite useful to get a quick picture of the distribution of our data.
summary(): Its a generic function that gives a summary on a lot of objects, such as vectors and dataframes, providing key statistical and data summaries. Also if applied to other objects (e.g. statistical models), it gives you a summary of the results.
Lets try them one of these in our dataframe of plants that we created before.
# A dataframe of tree stem diameter and biomass diameter <- c(2, 7, 9, 15, 20, 33) biomass <- c(1, 23, 30, NA, 85, 153) treesDF <- data.frame(diameter, biomass) # Get a summary of this dataframe summary(treesDF)
## diameter biomass ## Min. : 2.00 Min. : 1.0 ## 1st Qu.: 7.50 1st Qu.: 23.0 ## Median :12.00 Median : 30.0 ## Mean :14.33 Mean : 58.4 ## 3rd Qu.:18.75 3rd Qu.: 85.0 ## Max. :33.00 Max. :153.0 ## NA's :1
Lets try the mean for
biomass. We can select a specific column with
$ like this:
# If we don't include `na.rm = TRUE` we would get a weird # result because `biomass` data has a missing value mean(x = treesDF$biomass)
##  NA
mean(x = treesDF$biomass, na.rm = TRUE)
##  58.4
Excercise 2: Try using one of this functions over one column of the dataframe you created in Excercise 1.
We can have more specific selections of our data, and there are several ways to do it. One, is specifying the specific locations of our data, and another one is selecting data that meets certain conditions.
# Different ways of selecting by specifying data locations treesDF$diameter # [rowNumber] treesDF[3, 2] # [rowNumber, columnNumber] treesDF[1:3, 2] # several rows from a column treesDF[1:3, ] # several rows from all columns
Excercise 3: With your dataframe from Excercise 1, try to find out how can you select rows 3, 5, 8 from all columns, and how can you select all rows from columns 2 and 3.
For selecting data that meets certain conditions, we will define conditions based on the following logical operators:
<for higher or lower than
<=for equal or higher/lower than
!=for equal or not equal
|for AND or OR
You will build conditions by comparing objects/information from the right side of the operator to the ones in the left side of the operator.
A useful function for selecting data based on coditions is
subset(). Take a look at its help file.
# On our tree dataframe, lets select the bigger trees. # That is, e.g. trees with biomass larger than 50 # The argument subset states the condition biggerTrees <- subset(x = treesDF, subset = treesDF$biomass > 50) biggerTrees
## diameter biomass ## 5 20 85 ## 6 33 153
Excercise 4: From the dataframe you created in Excercise 1, subset it using a condition. Try also calculating the mean of a column of the subseted dataframe.
What did we learn in this session?
- R can be an amazing tool for data analysis
- R stores information in objects, and process with functions and arguments.
- Data can be stored as single values, vectors (collection of values), dataframes (a typical data table, like the ones you do in excel), matrices and lists (collection of any of the other types).
- Good practices for coding are commenting, using spaces in your code, use descriptive names, spell out argument names.
- Help files have a lot of useful information about how to use the functions.
- Some basic functions: mean(), sum(), summary(),subset()
- We can select data with $, , and subset()
Next session will be about plotting with ggplot, and statistical analyses, so real data analysis action.