R is a powerful programming language for data analysis, graphics, and especially statistical analysis. It is freely available to the public at www.r-project.org with easy-to-install binaries for Linux, macOS, and Windows. This notebook provides an introduction to R, designed for students and researchers. No prior experience is required, although some familiarity with scripting languages such as Bash, Matlab, or Python is helpful.
The best option for programming in R is to install RStudio or Visual Studio Code. Here, we'll learn some of the fundamentals of R and how to load and process data.
When starting out in R (or any programming language), we learn three fundamental skills:
Comments are parts of a computer program used to describe a piece of code. We can make a comment using the # symbol.
Purpose of comments? Readability, ignoring some code, pseudocode
# Anything that starts with # is a comment and is not executed
# This is used to explain what the code does
# or to leave notes or reminders
# or to disable a line of code
# or to make the code more readable or organized
Different types of values, such as characters, numbers, or logical values, can be assigned to a variable.
x <- 10 # The most common form of assignment
y = 20 # Also works
R will throw an error because it will think you're trying to use the age variable, call the do function, access the patient… a complete mess.
You can use the underscore (_), the period (.), or camelCase to separate words in variable names, like this:
patient_age <- 25
patient.age <- 25
PatientAge <- 25
In other words, whenever you create a variable, avoid spaces and use legal characters. Note that although each spelling is acceptable, each creates a different variable.
When we speak or write with other people, we understand small errors, language variations, and misspellings. R is not like that.
In R, "age," "Age," "ages," and "Ages" are completely different names. This is because the language is case-sensitive.
And each variable must be unique within the same workspace; that is, each time I write a different way, I create a new object. This is important because it avoids confusion and ensures that you know exactly which variable you're using.
Therefore, create variable names that make sense, are easy to remember, and are different enough for YOU to work with.
Rules for naming variables:
This arrow is used to assign a value to an object. It's like saying, "Put this value here inside this little box." Or, "x is now worth 10 and y is now worth 20."
But the "<-" arrow is the traditional and most widely used form in the R community, especially in statistics and data science. It's like a language tradition—many say the arrow indicates where the value comes from and where it goes.
On the keyboard, to do "<-", you type "<" and then "-", and R automatically understands it as an assignment.
In R, data is typically organized into tables (data frames), which resemble Excel spreadsheets. Each column represents a variable (age, name, height, etc.) and each row represents an observation (a person, an animal, an experiment, etc.).
You can create data manually
# There are several data types in R, but the most common are:
# Numbers
x <- 10.5 # Decimal numbers
# Integer
y <- 5L # Integer numbers
# Text (we call them "strings")
name <- "my name is:" # Text in quotes (fill in your name)
# Logical (boolean)
true <- TRUE # True
false <- FALSE # False
# Vectors
ages <- c(25, 30, 35) # Vector of numbers
c() is a function that stands for "combine" or "concatenate," used to create vectors.
All elements of a vector must be of the same type. If you don't use c(), R won't understand that you want to create a vector and will throw an error. When you use c(), R understands that you are combining these values into a single object, forming a vector.
Whenever you want to create a vector with MULTIPLE values, whether numbers, text, or logical values.
A data frame is like an Excel spreadsheet: each column is a variable and each row is an observation. You can import your own data or build it within R.
my_dataframe <- data.frame(
Name = c("Camila", "Matias", "Juan"),
Age = c(21,33,19),
Student = c(TRUE, FALSE, TRUE)
)
my_dataframe
A matrix is a table of only numbers (or a single type of data), organized into rows and columns, as in mathematics.
m <- matrix(1:9, nrow = 3, ncol = 3)
m
In programming, strings are sequences of characters—that is, any set of letters, numbers, symbols, or spaces. They are used to represent text.
# Finding the length of a string
nchar(my_string1)
# Joining strings together
paste(my_string1, "2023")
#Splitting strings
txt<-("R is the statistical analysis language")
unlist(strsplit(txt, split = " "))
A vector is a basic structure that stores a collection of elements of the same type—they can be numbers, text (strings), logical values (TRUE/FALSE), etc. Vectors can perform operations such as adding, filtering, comparing, etc.
# Vector
my_vector1 <- c(1,2,3)
my_vector2 <- c(4,5,6)
another_vector <- c("Camila", "Matias", "Juan")
class(my_vector1)
my_vector1
class(another_vector)
another_vector
# Creating vectors with R functions
my_name <- rep("Alejandra", times = 5)
my_name
my_seq <- rep(c(1,3,5), times = 3)
my_seq
my_seq <- rep(c(1,3,5), each = 3)
my_seq
my_seq <- -10:10
my_seq
my_seq <- seq(from = -10, to = 10, by = 2)
my_seq
A list is a "pocket" that can hold anything: numbers, text, vectors, data frames, even functions!
Think of a shopping list: Soap, Fabric softener, Apple, Orange, Biscuits, Bread etc...
Notice that the items aren't part of the same categories (food, cleaning supplies, hygiene products), but they're all in the same list. It works similarly in R:
my_list <- list(City = "Salvador", Age = 476, Beaches = c(Barra, Itapuã, Piatã, Ribeira))
my_list
A factor is used to represent categorical variables, such as "sex," "marital status," or "yes/no response." Behind the scenes, R treats these values as named numeric levels.
sex <- factor(c("F", "M", "F"))
R understands this as a variable with two levels: "F" and "M."
Factors are essential in statistical analysis because they treat categories as distinct levels, not just any text.
R is a language created by mathematicians to solve statistical and mathematical problems. Therefore, performing operations in this language is quite similar to what we already know in traditional mathematics.
my_result <- 2 + 3
1 - my_result
6 * 4
2 ^ 3
10 / 5
log10(1000)
log2(32)
sqrt(144)
my_vector1
my_vector2
# Finding the length of a vector
length(my_vector1)
# Math
my_vector1 * my_vector2
my_vector1 + my_vector2
my_vector1 - my_vector2
my_vector1 / my_vector2
# Indexing is a way to select (include/exclude) particular elements from a variable
my_vector2
my_vector2[1]
my_vector2[-c(1,3)]
my_vector2[2:3]
my_vector2[c(1,3)] # First and third only
# Create a vector with some random values
some_values <- c(8, 6, 1, 12, 3)
mean(some_values)
median(some_values)
sort(some_values)
# Quartiles (25% vs 75%)
quantile(some_values, 0.25)
quantile(some_values, 0.75)
# Interquartile range
IQR(some_values)
# Standard deviation
#In R, the standard deviation and the variance are computed as if the data represent a sample (n - 1)
sd(some_values)
# Variance
var(some_values)
Packages are collections of ready-made functions that help you perform specific tasks. For example:
# 1. Instalar (apenas uma vez):
install.packages("ggplot2")
# 2. Carregar (toda vez que vocês for usar):
library(ggplot2)
Functions are ready-made commands that do something for you. You've seen some above. They follow the structure:
function_name(argument 1, argument 2, ...)
#Example:
mean(c(1, 2, 3, 4))
When we use a function in R, we need to pass it the necessary information so it knows what to do. This information is called arguments—think of them as the steps in an experiment.
round(3.14159, digits = 2)
There are several ways to create graphs in R, but in this lesson we'll focus on the ggplot2 package, a powerful and elegant tool for data visualization. It's based on the so-called grammar of graphs, which allows you to build graphs in a structured and intuitive way—as if you were writing meaningful sentences.
The grammar of graphs is a set of rules that defines how to build visualizations in a logical and standardized way. The great advantage is that, by following this grammar, you learn a single structure that serves to create different types of graphs—without having to memorize different arguments for each one.
This package stands out because it gives you complete control over the graph: you decide every part—axes, colors, visualization types, titles, legends… everything can be added and adjusted with layers. This makes your graphs more personalized and informative.
The three main components of the grammar of graphs are:
Useful Resources:
PDF of the documentation: https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
Here's the link to the website: https://ggplot2.tidyverse.org/
And here's the link to the book: https://ggplot2-book.org/
# Load the tidyverse package
library(tidyverse)
# Load the msleep dataset
data(msleep)
# Check the structure of the msleep dataset
str(msleep)
nrow(msleep)
ncol(msleep)
head(msleep)
# Are there NA values? How many?
any(is.na(msleep))
sum(is.na(msleep))
# Leave out any observation with NA values
msleep %>%
drop_na()
#Equal to do:
#drop_na(msleep)
# Is there a relationship between sleep total and sleep rem?
# How can we check that?
# Plotting maybe?
plot(msleep$sleep_total, msleep$sleep_rem)
# Correlation coefficient can be computed using the functions cor()
cor.test(msleep$sleep_total, msleep$sleep_rem, method = "pearson", use = "complete.obs")
#Comparing means
t.test(msleep$sleep_total, msleep$sleep_rem)
# Using tidyverse
msleep %>%
ggplot(aes(x = sleep_total, y = sleep_rem)) +
geom_point()
#Directly calling ggplot
ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) +
geom_point()
#Saving plots
g <- ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) +
geom_point()
ggsave(filename = "test.pdf", plot = g)
#Other alternative
pdf("test2.pdf")
g
dev.off()
# Add a custom y-axis title
msleep %>%
ggplot(aes(sleep_total, sleep_rem)) + # Implicit or explicit arguments?
geom_point() +
xlab("Total sleep time (h)") +
ylab("REM sleep time (h)")
msleep %>%
ggplot(aes(sleep_total, sleep_rem)) +
geom_point(color = "red") + #visit: color-hex.com to select it
xlab("Total sleep time (h)") +
ylab("REM sleep time (h)")
# How can we colour the points by vore?
str(msleep)
msleep %>%
ggplot(aes(sleep_total, sleep_rem, color = vore)) +
geom_point() +
xlab("Total sleep time (h)") +
ylab("REM sleep time (h)")
# Base R
boxplot(sleep_total ~ vore, data = msleep)
# ggplot2
msleep %>%
ggplot(aes(vore, sleep_total)) +
geom_boxplot()
msleep %>%
filter(!(is.na(msleep$vore))) %>%
ggplot(aes(vore, sleep_total, fill = vore)) +
geom_boxplot()
msleep %>%
filter(!(is.na(msleep$vore))) %>%
ggplot(aes(vore, sleep_total)) +
geom_boxplot() +
geom_jitter(aes(color=vore)) +
labs(x= "Diet", y = "Total sleep time (h)", color ="Diet type")