Introduction to R and RStudio

R is a powerful programming language for data analysis, graphics, and especially statistical analysis. It is freely available to the public at www.r-project.org with easy-to-install binaries for Linux, macOS, and Windows. This notebook provides an introduction to R, designed for students and researchers. No prior experience is required, although some familiarity with scripting languages such as Bash, Matlab, or Python is helpful.
The best option for programming in R is to install RStudio or Visual Studio Code. Here, we'll learn some of the fundamentals of R and how to load and process data.



A Brief Introduction

When starting out in R (or any programming language), we learn three fundamental skills:

  • 1. Reading Code: Initially, you'll encounter pre-built code. Reading is the first step to familiarizing yourself with the logic, commands, and how R "speaks." Example: understanding what mean(c(1, 2, 3)) means.
  • 2. Understanding Code: Here you begin to understand why that code is there, what it does, and how it connects to the data. Example: realizing that mean() calculates the mean and c(1, 2, 3) creates a vector (and what is a vector?)
  • 3. Writing Code: The pinnacle of learning! With practice, you'll begin to write your own scripts and functions—whether for graphing, analysis, or automating tasks. Example: creating a graph with plot(x, y) and adjusting the parameters however you like.

  • Comments

    Comments are parts of a computer program used to describe a piece of code. We can make a comment using the # symbol.
    Purpose of comments? Readability, ignoring some code, pseudocode

    
    # Anything that starts with # is a comment and is not executed
    # This is used to explain what the code does
    # or to leave notes or reminders
    # or to disable a line of code
    # or to make the code more readable or organized
                    

    Variable

    Different types of values, such as characters, numbers, or logical values, can be assigned to a variable.

    
    x <- 10 # The most common form of assignment
    y = 20 # Also works                    
                    

    R will throw an error because it will think you're trying to use the age variable, call the do function, access the patient… a complete mess.
    You can use the underscore (_), the period (.), or camelCase to separate words in variable names, like this:

    
    patient_age <- 25
    patient.age <- 25
    PatientAge <- 25                    
                    

    In other words, whenever you create a variable, avoid spaces and use legal characters. Note that although each spelling is acceptable, each creates a different variable.


    R differentiates EVERYTHING

    When we speak or write with other people, we understand small errors, language variations, and misspellings. R is not like that.


    In R, "age," "Age," "ages," and "Ages" are completely different names. This is because the language is case-sensitive.


    And each variable must be unique within the same workspace; that is, each time I write a different way, I create a new object. This is important because it avoids confusion and ensures that you know exactly which variable you're using.
    Therefore, create variable names that make sense, are easy to remember, and are different enough for YOU to work with.

    Rules for naming variables:

  • A variable name can be created using letters, digits, periods, and underscores
  • Start with a letter (never a number!)
  • If a variable name starts with a period, you can't use digits after it.
  • R (like other programming languages) is case-sensitive. This means that age and Age are different variables.
  • Avoid accents and special characters
  • Be descriptive, but don't overdo it.
  • Write simply and in a standardized format. The clearer your variable is, the easier it will be to reuse.
  • if_it_is_too_long_it_is_bad_to_reuse
  • What does <- mean?

    This arrow is used to assign a value to an object. It's like saying, "Put this value here inside this little box." Or, "x is now worth 10 and y is now worth 20."


    But the "<-" arrow is the traditional and most widely used form in the R community, especially in statistics and data science. It's like a language tradition—many say the arrow indicates where the value comes from and where it goes.


    On the keyboard, to do "<-", you type "<" and then "-", and R automatically understands it as an assignment.

    Data and Data Types

    In R, data is typically organized into tables (data frames), which resemble Excel spreadsheets. Each column represents a variable (age, name, height, etc.) and each row represents an observation (a person, an animal, an experiment, etc.).
    You can create data manually


    
    # There are several data types in R, but the most common are:
    # Numbers
    x <- 10.5 # Decimal numbers
    
    # Integer
    y <- 5L # Integer numbers
    
    # Text (we call them "strings")
    name <- "my name is:" # Text in quotes (fill in your name)
    
    # Logical (boolean)
    true <- TRUE # True
    false <- FALSE # False
    
    # Vectors
    ages <- c(25, 30, 35) # Vector of numbers                    
                    
    
                    

    c() is a function that stands for "combine" or "concatenate," used to create vectors.


    All elements of a vector must be of the same type. If you don't use c(), R won't understand that you want to create a vector and will throw an error. When you use c(), R understands that you are combining these values into a single object, forming a vector.


    Whenever you want to create a vector with MULTIPLE values, whether numbers, text, or logical values.

    Data frame

    A data frame is like an Excel spreadsheet: each column is a variable and each row is an observation. You can import your own data or build it within R.

    
    my_dataframe <- data.frame(
      Name = c("Camila", "Matias", "Juan"),
      Age = c(21,33,19),
      Student = c(TRUE, FALSE, TRUE)
    )
    
    my_dataframe                    
                    

    Matrix

    A matrix is a table of only numbers (or a single type of data), organized into rows and columns, as in mathematics.

    
    m <- matrix(1:9, nrow = 3, ncol = 3)
    
    m                    
                    

    Strings

    In programming, strings are sequences of characters—that is, any set of letters, numbers, symbols, or spaces. They are used to represent text.

    
    # Finding the length of a string
    nchar(my_string1)
    
    # Joining strings together
    paste(my_string1, "2023")
    
    #Splitting strings
    txt<-("R is the statistical analysis language")
    unlist(strsplit(txt, split = " "))                    
                    

    Vector

    A vector is a basic structure that stores a collection of elements of the same type—they can be numbers, text (strings), logical values (TRUE/FALSE), etc. Vectors can perform operations such as adding, filtering, comparing, etc.

    
    # Vector
    my_vector1 <- c(1,2,3)
    my_vector2 <- c(4,5,6)
    another_vector <- c("Camila", "Matias", "Juan")
    
    class(my_vector1)
    my_vector1                    
                    
    
    class(another_vector)
    another_vector                    
                    
    
    # Creating vectors with R functions
    my_name <- rep("Alejandra", times = 5)
    my_name
    
    my_seq <- rep(c(1,3,5), times = 3)
    my_seq
    
    my_seq <- rep(c(1,3,5), each = 3)
    my_seq
    
    my_seq <- -10:10
    my_seq
    
    my_seq <- seq(from = -10, to = 10, by = 2)
    my_seq                    
                    

    List

    A list is a "pocket" that can hold anything: numbers, text, vectors, data frames, even functions!
    Think of a shopping list: Soap, Fabric softener, Apple, Orange, Biscuits, Bread etc...

    Notice that the items aren't part of the same categories (food, cleaning supplies, hygiene products), but they're all in the same list. It works similarly in R:

    
    my_list <- list(City = "Salvador", Age = 476, Beaches = c(Barra, Itapuã, Piatã, Ribeira))
    
    my_list                    
                    

    Factor

    A factor is used to represent categorical variables, such as "sex," "marital status," or "yes/no response." Behind the scenes, R treats these values as named numeric levels.

    
    sex <- factor(c("F", "M", "F"))                    
                    

    R understands this as a variable with two levels: "F" and "M."

    Factors are essential in statistical analysis because they treat categories as distinct levels, not just any text.

    Operations in R

    R is a language created by mathematicians to solve statistical and mathematical problems. Therefore, performing operations in this language is quite similar to what we already know in traditional mathematics.

    
    my_result <- 2 + 3
    
    1 - my_result
    
    6 * 4
    
    2 ^ 3
    
    10 / 5
    
    log10(1000)
    
    log2(32)
    
    sqrt(144)                    
                    

    Vector Operations

    
    my_vector1
    my_vector2                    
                    
    
    # Finding the length of a vector
    length(my_vector1)                    
                    
    
    # Math
    my_vector1 * my_vector2
    my_vector1 + my_vector2
    my_vector1 - my_vector2
    my_vector1 / my_vector2                    
                    
    
    # Indexing is a way to select (include/exclude) particular elements from a variable
    my_vector2
    my_vector2[1]
    my_vector2[-c(1,3)]
    my_vector2[2:3]
    my_vector2[c(1,3)] # First and third only                    
                    
    
    # Create a vector with some random values
    some_values <- c(8, 6, 1, 12, 3)
    
    mean(some_values)
    
    median(some_values)
    
    sort(some_values)
    
    # Quartiles  (25% vs 75%)
    quantile(some_values, 0.25)
    quantile(some_values, 0.75)
    
    # Interquartile range
    IQR(some_values)
    
    # Standard deviation
    #In R, the standard deviation and the variance are computed as if the data represent a sample (n - 1)
    sd(some_values)
    
    # Variance
    var(some_values)
                    

    Packages (or libraries)

    Packages are collections of ready-made functions that help you perform specific tasks. For example:

  • ggplot2 — for elegant graphs
  • dplyr — for manipulating data
  • readr — for reading files
  • Before using a package, you need:
  • 
    # 1. Instalar (apenas uma vez):
    install.packages("ggplot2")                    
                    
    
    # 2. Carregar (toda vez que vocês for usar):
    library(ggplot2)                    
                    
    What are functions?

    Functions are ready-made commands that do something for you. You've seen some above. They follow the structure:

    
    function_name(argument 1, argument 2, ...)                    
                    
    
    #Example:
    mean(c(1, 2, 3, 4))                    
                    
  • mean is the name of the function.
  • c(1, 2, 3, 4) is the argument (an array of numbers).
  • Functions save time and organize what you need to do. There are hundreds of them in R—and you can even create your own or use others'.
  • What are function arguments?

    When we use a function in R, we need to pass it the necessary information so it knows what to do. This information is called arguments—think of them as the steps in an experiment.

    
    round(3.14159, digits = 2)                    
                    

    Plotting Numerial Data

    There are several ways to create graphs in R, but in this lesson we'll focus on the ggplot2 package, a powerful and elegant tool for data visualization. It's based on the so-called grammar of graphs, which allows you to build graphs in a structured and intuitive way—as if you were writing meaningful sentences.


    The grammar of graphs is a set of rules that defines how to build visualizations in a logical and standardized way. The great advantage is that, by following this grammar, you learn a single structure that serves to create different types of graphs—without having to memorize different arguments for each one.


    This package stands out because it gives you complete control over the graph: you decide every part—axes, colors, visualization types, titles, legends… everything can be added and adjusted with layers. This makes your graphs more personalized and informative.


    The three main components of the grammar of graphs are:


  • Data: the observations in our dataset.
  • Aesthetics: Mappings of data to visual properties (such as axes and sizes of geometric objects).
  • Geometries: Geometric objects, such as lines, that represent what we see in the graph.

  • Useful Resources:


    PDF of the documentation: https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf


    Here's the link to the website: https://ggplot2.tidyverse.org/


    And here's the link to the book: https://ggplot2-book.org/

    
    # Load the tidyverse package
    library(tidyverse)                    
                    
    
    # Load the msleep dataset
    data(msleep)
    
    # Check the structure of the msleep dataset
    str(msleep)                    
                    
    
    nrow(msleep)
    ncol(msleep)
    head(msleep)                    
                    
    
    # Are there NA values? How many?
    any(is.na(msleep))
    sum(is.na(msleep))                    
                    
    
    # Leave out any observation with NA values
    msleep %>%
      drop_na()
    
    #Equal to do:
    #drop_na(msleep)                    
                    
    
    # Is there a relationship between sleep total and sleep rem?
    # How can we check that?
    # Plotting maybe?
    
    plot(msleep$sleep_total, msleep$sleep_rem)                    
                    
    
    # Correlation coefficient can be computed using the functions cor()
    cor.test(msleep$sleep_total, msleep$sleep_rem, method = "pearson", use = "complete.obs")                    
                    
    
    #Comparing means
    t.test(msleep$sleep_total, msleep$sleep_rem)                    
                    

    Applying the ggplot2 package

    
    # Using tidyverse
    msleep %>%
      ggplot(aes(x = sleep_total, y = sleep_rem)) +
      geom_point()                    
                    
    
    #Directly calling ggplot
    ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) +
      geom_point()                    
                    
    
    #Saving plots
    g <- ggplot(msleep, aes(x = sleep_total, y = sleep_rem)) +
      geom_point()
    ggsave(filename = "test.pdf", plot = g)                    
                    
    
    #Other alternative
    pdf("test2.pdf")
    g
    dev.off()                    
                    
    
    # Add a custom y-axis title
    msleep %>%
      ggplot(aes(sleep_total, sleep_rem)) + # Implicit or explicit arguments?
      geom_point() +
      xlab("Total sleep time (h)") +
      ylab("REM sleep time (h)")                    
                    
    
    msleep %>%
      ggplot(aes(sleep_total, sleep_rem)) +
      geom_point(color = "red") +  #visit: color-hex.com to select it
      xlab("Total sleep time (h)") +
      ylab("REM sleep time (h)")                    
                    
    
    # How can we colour the points by vore?
    str(msleep)                    
                    
    
    msleep %>%
      ggplot(aes(sleep_total, sleep_rem, color = vore)) +
      geom_point() +
      xlab("Total sleep time (h)") +
      ylab("REM sleep time (h)")                    
                    

    Boxplots

    
    # Base R
    boxplot(sleep_total ~ vore, data = msleep)                    
                    
    
    # ggplot2
    msleep %>%
      ggplot(aes(vore, sleep_total)) +
          geom_boxplot()                    
                    
    
    msleep %>%
      filter(!(is.na(msleep$vore))) %>%
      ggplot(aes(vore, sleep_total, fill = vore)) +
      geom_boxplot()                    
                    
    
    msleep %>%
      filter(!(is.na(msleep$vore))) %>%
      ggplot(aes(vore, sleep_total)) +
      geom_boxplot() +
      geom_jitter(aes(color=vore)) +
      labs(x= "Diet", y = "Total sleep time (h)", color ="Diet type")