Introduction to R and RStudio

R is a powerful programming language for data analysis, graphics, and especially statistical analysis. It is freely available to the public at www.r-project.org with easy-to-install binaries for Linux, macOS, and Windows. This notebook provides an introduction to R, designed for students and researchers. No prior experience is required, although some familiarity with scripting languages such as Bash, Matlab, or Python is helpful.
The best option for programming in R is to install RStudio or Visual Studio Code. Here, we'll learn some of the fundamentals of R and how to load and process data.

A Brief Introduction

When starting out in R (or any programming language), we learn three fundamental skills:

1. Reading Code: Initially, you'll encounter pre-built code. Reading is the first step to familiarizing yourself with the logic, commands, and how R "speaks." Example: understanding what mean(c(1, 2, 3)) means.

2. Understanding Code: Here you begin to understand why that code is there, what it does, and how it connects to the data. Example: realizing that mean() calculates the mean and c(1, 2, 3) creates a vector (and what is a vector?)

3. Writing Code: The pinnacle of learning! With practice, you'll begin to write your own scripts and functions—whether for graphing, analysis, or automating tasks. Example: creating a graph with plot(x, y) and adjusting the parameters however you like.

Comments

Comments are parts of a computer program used to describe a piece of code. We can make a comment using the # symbol.
Purpose of comments? Readability, ignoring some code, pseudocode


# Anything that starts with # is a comment and is not executed
# This is used to explain what the code does
# or to leave notes or reminders
# or to disable a line of code
# or to make the code more readable or organized

Variable

Different types of values, such as characters, numbers, or logical values, can be assigned to a variable.


x <- 10 # The most common form of assignment
y = 20 # Also works

R will throw an error because it will think you're trying to use the age variable, call the do function, access the patient… a complete mess.
You can use the underscore (_), the period (.), or camelCase to separate words in variable names, like this:


patient_age <- 25
patient.age <- 25
PatientAge <- 25

In other words, whenever you create a variable, avoid spaces and use legal characters. Note that although each spelling is acceptable, each creates a different variable.

R differentiates EVERYTHING

When we speak or write with other people, we understand small errors, language variations, and misspellings. R is not like that.

In R, "age," "Age," "ages," and "Ages" are completely different names. This is because the language is case-sensitive.

And each variable must be unique within the same workspace; that is, each time I write a different way, I create a new object. This is important because it avoids confusion and ensures that you know exactly which variable you're using.
Therefore, create variable names that make sense, are easy to remember, and are different enough for YOU to work with.

Rules for naming variables:

A variable name can be created using letters, digits, periods, and underscores

Start with a letter (never a number!)

If a variable name starts with a period, you can't use digits after it.

R (like other programming languages) is case-sensitive. This means that age and Age are different variables.

Avoid accents and special characters

Be descriptive, but don't overdo it.

Write simply and in a standardized format. The clearer your variable is, the easier it will be to reuse.

if_it_is_too_long_it_is_bad_to_reuse

What does <- mean?

This arrow is used to assign a value to an object. It's like saying, "Put this value here inside this little box." Or, "x is now worth 10 and y is now worth 20."

But the "<-" arrow is the traditional and most widely used form in the R community, especially in statistics and data science. It's like a language tradition—many say the arrow indicates where the value comes from and where it goes.

On the keyboard, to do "<-", you type "<" and then "-", and R automatically understands it as an assignment.

Data and Data Types

In R, data is typically organized into tables (data frames), which resemble Excel spreadsheets. Each column represents a variable (age, name, height, etc.) and each row represents an observation (a person, an animal, an experiment, etc.).
You can create data manually


# There are several data types in R, but the most common are:
# Numbers
x <- 10.5 # Decimal numbers

# Integer
y <- 5L # Integer numbers

# Text (we call them "strings")
name <- "my name is:" # Text in quotes (fill in your name)

# Logical (boolean)
true <- TRUE # True
false <- FALSE # False

# Vectors
ages <- c(25, 30, 35) # Vector of numbers

c() is a function that stands for "combine" or "concatenate," used to create vectors.

All elements of a vector must be of the same type. If you don't use c(), R won't understand that you want to create a vector and will throw an error. When you use c(), R understands that you are combining these values into a single object, forming a vector.

Whenever you want to create a vector with MULTIPLE values, whether numbers, text, or logical values.

Data frame

A data frame is like an Excel spreadsheet: each column is a variable and each row is an observation. You can import your own data or build it within R.


my_dataframe <- data.frame(
  Name = c("Camila", "Matias", "Juan"),
  Age = c(21,33,19),
  Student = c(TRUE, FALSE, TRUE)
)

my_dataframe

Matrix

A matrix is a table of only numbers (or a single type of data), organized into rows and columns, as in mathematics.


m <- matrix(1:9, nrow = 3, ncol = 3)

m

Strings

In programming, strings are sequences of characters—that is, any set of letters, numbers, symbols, or spaces. They are used to represent text.


# Finding the length of a string
nchar(my_string1)

# Joining strings together
paste(my_string1, "2023")

#Splitting strings
txt<-("R is the statistical analysis language")
unlist(strsplit(txt, split = " "))

Vector

A vector is a basic structure that stores a collection of elements of the same type—they can be numbers, text (strings), logical values (TRUE/FALSE), etc. Vectors can perform operations such as adding, filtering, comparing, etc.

A vector of numbers: c(1, 2, 3, 4)
A vector of text: c("apple", "banana", "orange")
A logical vector: c(TRUE, FALSE, TRUE)


# Vector
my_vector1 <- c(1,2,3)
my_vector2 <- c(4,5,6)
another_vector <- c("Camila", "Matias", "Juan")

class(my_vector1)
my_vector1


class(another_vector)
another_vector


# Creating vectors with R functions
my_name <- rep("Alejandra", times = 5)
my_name

my_seq <- rep(c(1,3,5), times = 3)
my_seq

my_seq <- rep(c(1,3,5), each = 3)
my_seq

my_seq <- -10:10
my_seq

my_seq <- seq(from = -10, to = 10, by = 2)
my_seq

List

A list is a "pocket" that can hold anything: numbers, text, vectors, data frames, even functions!
Think of a shopping list: Soap, Fabric softener, Apple, Orange, Biscuits, Bread etc...

Notice that the items aren't part of the same categories (food, cleaning supplies, hygiene products), but they're all in the same list. It works similarly in R:


my_list <- list(City = "Salvador", Age = 476, Beaches = c(Barra, Itapuã, Piatã, Ribeira))

my_list

Factor

A factor is used to represent categorical variables, such as "sex," "marital status," or "yes/no response." Behind the scenes, R treats these values as named numeric levels.


sex <- factor(c("F", "M", "F"))

R understands this as a variable with two levels: "F" and "M."

Factors are essential in statistical analysis because they treat categories as distinct levels, not just any text.

Operations in R

R is a language created by mathematicians to solve statistical and mathematical problems. Therefore, performing operations in this language is quite similar to what we already know in traditional mathematics.


my_result <- 2 + 3

1 - my_result

6 * 4

2 ^ 3

10 / 5

log10(1000)

log2(32)

sqrt(144)

Vector Operations


my_vector1
my_vector2


# Finding the length of a vector
length(my_vector1)


# Math
my_vector1 * my_vector2
my_vector1 + my_vector2
my_vector1 - my_vector2
my_vector1 / my_vector2


# Indexing is a way to select (include/exclude) particular elements from a variable
my_vector2
my_vector2[1]
my_vector2[-c(1,3)]
my_vector2[2:3]
my_vector2[c(1,3)] # First and third only


# Create a vector with some random values
some_values <- c(8, 6, 1, 12, 3)

mean(some_values)

median(some_values)

sort(some_values)

# Quartiles  (25% vs 75%)
quantile(some_values, 0.25)
quantile(some_values, 0.75)

# Interquartile range
IQR(some_values)

# Standard deviation
#In R, the standard deviation and the variance are computed as if the data represent a sample (n - 1)
sd(some_values)

# Variance
var(some_values)

Packages (or libraries)

Packages are collections of ready-made functions that help you perform specific tasks. For example:

ggplot2 — for elegant graphs

dplyr — for manipulating data

readr — for reading files

Before using a package, you need:


# 1. Instalar (apenas uma vez):
install.packages("ggplot2")


# 2. Carregar (toda vez que vocês for usar):
library(ggplot2)

What are functions?

Functions are ready-made commands that do something for you. You've seen some above. They follow the structure:


function_name(argument 1, argument 2, ...)


#Example:
mean(c(1, 2, 3, 4))

mean is the name of the function.

c(1, 2, 3, 4) is the argument (an array of numbers).

Functions save time and organize what you need to do. There are hundreds of them in R—and you can even create your own or use others'.

What are function arguments?

When we use a function in R, we need to pass it the necessary information so it knows what to do. This information is called arguments—think of them as the steps in an experiment.


round(3.14159, digits = 2)

Plotting Numerial Data

There are several ways to create graphs in R, but in this lesson we'll focus on the ggplot2 package, a powerful and elegant tool for data visualization. It's based on the so-called grammar of graphs, which allows you to build graphs in a structured and intuitive way—as if you were writing meaningful sentences.

The grammar of graphs is a set of rules that defines how to build visualizations in a logical and standardized way. The great advantage is that, by following this grammar, you learn a single structure that serves to create different types of graphs—without having to memorize different arguments for each one.

This package stands out because it gives you complete control over the graph: you decide every part—axes, colors, visualization types, titles, legends… everything can be added and adjusted with layers. This makes your graphs more personalized and informative.

The three main components of the grammar of graphs are:

Data: the observations in our dataset.

Aesthetics: Mappings of data to visual properties (such as axes and sizes of geometric objects).

Geometries: Geometric objects, such as lines, that represent what we see in the graph.

Useful Resources:

PDF of the documentation: https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

Here's the link to the website: https://ggplot2.tidyverse.org/

And here's the link to the book: https://ggplot2-book.org/


# Load the tidyverse package
library(tidyverse)


# Load the msleep dataset
data(msleep)

# Check the structure of the msleep dataset
str(msleep)


nrow(msleep)
ncol(msleep)
head(msleep)


# Are there NA values? How many?
any(is.na(msleep))
sum(is.na(msleep))


# Leave out any observation with NA values
msleep %>%
  drop_na()

#Equal to do:
#drop_na(msleep)


# Is there a relationship between sleep total and sleep rem?
# How can we check that?
# Plotting maybe?

plot(msleep$sleep_total, msleep$sleep_rem)


# Correlation coefficient can be computed using the functions cor()
cor.test(msleep$sleep_total, msleep$sleep_rem, method = "pearson", use = "complete.obs")


#Comparing means
t.test(msleep$sleep_total, msleep$sleep_rem)

Applying the ggplot2 package


# Using tidyverse
msleep %>%
  ggplot(aes(x = sleep_total, y = sleep_rem)) +
  geom_point()


#Directly calling ggplot
ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) +
  geom_point()


#Saving plots
g <- ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) +
  geom_point()
ggsave(filename = "test.pdf", plot = g)


#Other alternative
pdf("test2.pdf")
g
dev.off()


# Add a custom y-axis title
msleep %>%
  ggplot(aes(sleep_total, sleep_rem)) + # Implicit or explicit arguments?
  geom_point() +
  xlab("Total sleep time (h)") +
  ylab("REM sleep time (h)")


msleep %>%
  ggplot(aes(sleep_total, sleep_rem)) +
  geom_point(color = "red") +  #visit: color-hex.com to select it
  xlab("Total sleep time (h)") +
  ylab("REM sleep time (h)")


# How can we colour the points by vore?
str(msleep)


msleep %>%
  ggplot(aes(sleep_total, sleep_rem, color = vore)) +
  geom_point() +
  xlab("Total sleep time (h)") +
  ylab("REM sleep time (h)")

Boxplots


# Base R
boxplot(sleep_total ~ vore, data = msleep)


# ggplot2
msleep %>%
  ggplot(aes(vore, sleep_total)) +
      geom_boxplot()


msleep %>%
  filter(!(is.na(msleep$vore))) %>%
  ggplot(aes(vore, sleep_total, fill = vore)) +
  geom_boxplot()


msleep %>%
  filter(!(is.na(msleep$vore))) %>%
  ggplot(aes(vore, sleep_total)) +
  geom_boxplot() +
  geom_jitter(aes(color=vore)) +
  labs(x= "Diet", y = "Total sleep time (h)", color ="Diet type")