An introduction to Single cell Assay for Transposase-Accessible Chromatin sequencing (scATAC-seq)

scATAC-seq is a technique used to study chromatin accessibility at the single-cell level. Unlike scRNA-seq, which focuses on gene expression, scATAC-seq identifies regions of the genome that are open and potentially active, meaning they can be bound by transcription factors to regulate gene activity.




For this work, we will use data from the article by Kumegawa et al. (2022) entitled: "GRHL2 motif is associated with intratumor heterogeneity of cis-regulatory elements in luminal breast cancer"


In this paper, Kumegawa et al. analyze chromatin accessibility profiles of more than 10.000 cells from 16 breast cancer patients including luminal, luminal-HER2, HER2+ and 3 triple-negative subtypes. Using this profiling process, they classified cells into cancer cells and tumor microenvironment, allowing to highlight the heterogeneity of disease-related pathways. Moreover, they identified the GRHL2 transcription factor which cooperated with FOXA1 to initiate endocrine resistance and that GRHL2 binding elements potentially regulate genes associated with endocrine resistance, metastasis, and poor prognosis in patients that received hormonal therapy.


The scATAC-seq libraries were prepared with the SureCell ATAC-seq Library Preparation kit (BioRad) and a SureCell ddSEQ Index Kit (Bio-Rad). Alignment was done with ATAC-Seq Analysis Toolkit (Bio-Rad).


For this work, we will explore two breast tumor samples (TNBC and Luminal-HER2), specifically T cells. To do so, we went to retrieve the file fragment of a sample from the GEO (Gene Expression Omnibus) website using the identifier provided by the author: GSE198639.


This course is done using ArchR. For more details on ArchR see here.


ArchR provides a comprehensive suite for scATAC-seq analysis tools from pre-processing data to results, offering several levels of information. Moreover, ArchR is fast and ask a reasonable ressource usage.


For these analyses, you need (if you will do in your computer):

  • Install python3.6 or more: https://www.python.org/downloads/
  • Install conda (miniconda or anaconda, it's a package manager for python, it allows to create python environment):https://conda.io/projects/conda/en/latest/user-guide/install/index.html
  • Install macs2 package (via the terminal):
  • 
    conda create -y -n MACS2 python=3.6
    
    conda activate MACS2
    
    conda install macs2 or conda install -c bioconda macs2
                    
  • Install of R4.1.3 or more: https://cran.r-project.org/
  • install these R packages (via R environment):
  • 
    install.packages(c("devtools","BiocManager","reticulate","clustree","Seurat"))
    
    devtools::install_github("GreenleafLab/ArchR", ref="master", repos = BiocManager::repositories())
    
    ArchR::installExtraPackages()
    
    BiocManager::install("BSgenome.Hsapiens.UCSC.hg19") (or other genome if you have data from another organism or another genome reference)
    
    devtools::install_github("GreenleafLab/chromVARmotifs")
    
    install.packages("hexbin")
                    

    Download preinstalled libraries and datasets

    
    # Download the installation script from GitHub
    download.file("https://github.com/eddelbuettel/r2u/raw/master/inst/scripts/add_cranapt_jammy.sh",
                  "add_cranapt_jammy.sh")
    
    # Change the script's permissions to make it executable
    Sys.chmod("add_cranapt_jammy.sh", "0755")
    
    # Execute the script in the terminal
    system("./add_cranapt_jammy.sh")
                    
    
    system("apt install libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libbz2-dev libgsl-dev gsl-bin -y")
    system("apt install  libfontconfig1-dev libharfbuzz-dev libfribidi-dev libcairo2-dev libgmp-dev -y")
    system("apt update")
    system("apt install libmagick++-dev -y")
                    
    
    # Define a helper function to run shell commands and print their output
    shell_call <- function(command, ...) {
      result <- system(command, intern = TRUE, ...)  # Execute the command and capture output
      cat(paste0(result, collapse = "\n"))  # Print output to console
    }
    
    # Download the MACS2 package (version 2.2.9.1) using wget
    shell_call("wget https://github.com/macs3-project/MACS/archive/refs/tags/v2.2.9.1.tar.gz -O MACS.tar.gz")
    
    # Extract the downloaded tar.gz file
    system("tar -xvf MACS.tar.gz")
    
    # Install MACS2 in editable mode using pip
    shell_call("pip install -e MACS-2.2.9.1/")
                    
    
    # Set a long timeout limit to avoid download failures
    options(timeout = 1000)
    
    # Install BiocManager if not already installed
    if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager", quiet = TRUE)
    
    # Install ArchR from GitHub using devtools
    devtools::install_github("GreenleafLab/ArchR", ref = "master", repos = BiocManager::repositories(), upgrade = FALSE)
    
    # Install additional dependencies for ArchR
    ArchR::installExtraPackages()
    
    # Install other required R packages
    install.packages("clustree", quiet = TRUE)
    install.packages("hexbin")
    
    # Install a specific version of the Matrix package from CRAN archives
    install.packages("https://cran.r-project.org/src/contrib/Archive/Matrix/Matrix_1.5-3.tar.gz", repos = NULL, type = "source")
                    
    
    # Function to run shell commands and display the output
    shell_call <- function(command, ...) {
      result <- system(command, intern = TRUE, ...)  # Run command and store output
      cat(paste0(result, collapse = "\n"))  # Print the output
    }
                    
    
    # Set a timeout limit for downloads
    options(timeout = 300)
    
    # Download the scATAC-seq workshop dataset as a ZIP file
    download.file('https://iauchile-my.sharepoint.com/:u:/g/personal/adolfo_rh_postqyf_uchile_cl/ETPOTjhE9llEkT85F6XQfyQBdN4r9R2Jf4hvY1BicfTWSw?e=tQbDjt&download=1', 
                  'scATACseqWorkshop.zip')
    
    # List the downloaded files with detailed information
    shell_call("ls -lh")
    
    # Unzip the downloaded file
    system("unzip scATACseqWorkshop.zip")
                    

    NOTE: If you have errors you can do this to reanalyze the data

    
    # Define the working directory
    work_dir2 <- "/content/"
    setwd(work_dir2)
    
    # Remove any existing directory with the same name
    shell_call("rm -rf scATACseqWorkshop/")
    
    # Unzip the dataset again
    shell_call("unzip scATACseqWorkshop.zip")
                    

    Define libraries, parameters and directories

    Firstly, we define python library location and load R libraries. After, we define some parameters such as: 1) the number of threads we will use, 2) the working directory and 3) the location of fragment files.


    In fact, ArchR can utilize multiple input formats of scATAC-seq data (the fragment files and the BAM files are the more common scATAC-seq data).

    
    # Suppress package startup messages for cleaner output
    # Load libraries
    suppressPackageStartupMessages({
      library(ArchR)
      library(reticulate)
      library(clustree)
      library(Seurat)
      library(hexbin)
    })
                    
    
    # Set the Python environment path for Reticulate
    Sys.setenv(RETICULATE_PYTHON = "/usr/local/bin/python")
    
    # Check the Python configuration
    py_config()
    
    # Test if MACS2 is installed and accessible
    findMacs2()
                    
    
    # Set a random seed for reproducibility
    set.seed(1)
    
    # Define the number of threads to use
    nb.threads = 2
    addArchRThreads(threads = nb.threads)
    
    # Set the working directory
    work_dir <- "/content/scATACseqWorkshop"
    setwd(work_dir)
    
    # List and name the input fragment files
    inputFiles <- list.files(file.path(work_dir, "fragments_data"), full.names = TRUE)
    names(inputFiles) <- gsub("^.+/", "", gsub("GSM[0-9]+_", "", gsub(".fragments.tsv.gz", "", inputFiles)))
    
    # Specify the reference genome for ArchR
    addArchRGenome("hg19")
                    

    Create Arrow file

    We create an HDF5-format Arrow file that stores all of the data associated with a sample (now and during all the analysis process). It will be updated with the additional layers of information.


    If we analyse some samples, an Arrow file will be generated for each sample.


    It's not a R language object and because of this, we will generate an ArchRProject object to associate the Arrow file(s) into a single analytical framework that will be rapidly accessible in R.


    During this step, ArchR computes a TileMatrix containing insertion counts across genome-wide 500-bp bins (default value) and a GeneScoreMatrix that stores predicted gene expression based on weighting insertion counts in tiles nearby a gene promoter.


    
    ArrowFiles <- createArrowFiles(
      inputFiles = inputFiles,   # Input files containing scATAC-seq data
      sampleNames = names(inputFiles),  # Assign sample names based on input file names
      minTSS = 0.1,   # Minimum TSS enrichment score to filter low-quality cells
      minFrags = 1,   # Minimum number of unique fragments per cell
      addTileMat = TRUE,  # Compute and store the tile matrix for accessibility analysis
      addGeneScoreMat = TRUE  # Compute and store the gene score matrix for gene activity analysis
    )
    
    # Create an ArchR project using the Arrow files
    project <- ArchRProject(
      ArrowFiles = ArrowFiles,  # Use the generated Arrow files
      outputDirectory = "Analysis_scATACseq_noFilter",  # Define output directory for the project
      copyArrows = TRUE  # Recommended to maintain an unaltered copy of Arrow files for future use
    )
                    

    Strict quality control (QC) of scATAC-seq data is essential to remove the contribution of low-quality cells.


    ArchR considers three characteristics of data:

  • The fragment size (DNA fragments cut by Tn5 transposases) distribution. Due to nucleosomal periodicity, we expect to see depletion of fragments that are the length of DNA wrapped around a nucleosome (approximately 147 bp).
  • The Transcription Start Site (TSS) enrichment (signal-to-background ratio). Low signal-to-background ratio is often attributed to dead or dying cells which have de-chromatized DNA which allows for random transposition genome-wide.
  • The number of unique nuclear fragments (i.e. not mapping to mitochondrial DNA).

  • We can appreciate the QC and the main metrics of samples using some plots: Plot QC metrics:

    
    # Extract TSS enrichment and fragment count data from the ArchR project
    df <- getCellColData(project, select = c("log10(nFrags)", "TSSEnrichment"))
    
    # Create a scatter plot of TSS Enrichment vs. Log10(Unique Fragments)
    plot.tss.frags <- ggPoint(
      x = df[,1], y = df[,2],  # Set x-axis as log10(Unique Fragments) and y-axis as TSS Enrichment
      colorDensity = TRUE,  # Color points based on density
      continuousSet = "sambaNight",  # Define color theme
      xlabel = "Log10 Unique Fragments", ylabel = "TSS Enrichment",  # Label axes
      xlim = c(0, quantile(df[,1], probs = 1) + 0.1),  # Set x-axis limits
      ylim = c(0, quantile(df[,2], probs = 1) + 0.1)   # Set y-axis limits
    )
    
    # Save the plot as a PDF in the project's "Plots" directory
    plotPDF(plot.tss.frags, name = "TSS-vs-Frags.pdf", ArchRProj = project, addDOC = FALSE)
    
    # Display the plot
    plot.tss.frags
                    

    Plot TSS metrics:

    
    # Create a ridge plot showing TSS enrichment distribution across samples
    plot.tss.v1 <- plotGroups(
      ArchRProj = project,
      groupBy = "Sample",  # Group by sample
      colorBy = "cellColData",  # Color based on cell metadata
      name = "TSSEnrichment",  # Use TSS Enrichment as the feature to plot
      plotAs = "ridges"  # Plot as ridge plot
    )
    
    # Create a violin plot with an overlaid box plot
    plot.tss.v2 <- plotGroups(
      ArchRProj = project,
      groupBy = "Sample",
      colorBy = "cellColData",
      name = "TSSEnrichment",
      plotAs = "violin",  # Plot as a violin plot
      alpha = 0.4,  # Set transparency level
      addBoxPlot = TRUE  # Overlay a box plot on top of the violin plot
    )
    
    # Display both plots side by side
    plot.tss.v1 | plot.tss.v2
                    

    Plot fragment metrics:

    
    # Create a ridge plot showing log10(Unique Fragments) distribution across samples
    plot.frags.v1 <- plotGroups(
      ArchRProj = project,
      groupBy = "Sample",
      colorBy = "cellColData",
      name = "log10(nFrags)",
      plotAs = "ridges"
    )
    
    # Create a violin plot with an overlaid box plot
    plot.frags.v2 <- plotGroups(
      ArchRProj = project,
      groupBy = "Sample",
      colorBy = "cellColData",
      name = "log10(nFrags)",
      plotAs = "violin",
      alpha = 0.4,
      addBoxPlot = TRUE
    )
    
    # Display both plots side by side
    plot.frags.v1 | plot.frags.v2
                    
    
    # Create Arrow Files with Quality Filters
    ArrowFiles <- createArrowFiles(
      inputFiles = inputFiles,  # List of fragment files for each sample
      sampleNames = names(inputFiles),  # Assign sample names based on file names
      minTSS = 4,  # Minimum Transcription Start Site (TSS) enrichment score to retain a cell
      minFrags = 1000,  # Minimum number of unique fragments per cell
      addTileMat = TRUE,  # Create a tile matrix for peak calling and other analyses
      addGeneScoreMat = TRUE  # Compute gene activity scores
    )
                    

    Doublet detection

    A source of trouble in single-cell data is the contribution of "doublets" to the analysis (a doublet refers to a single droplet that received more than one nucleus). To predict which “cells” are actually doublets, ArchR synthesizes in silico doublets from the data by mixing the reads from thousands of combinations of individual cells. It projects these synthetic doublets into the UMAP embedding and identify their nearest neighbor. By iterating this procedure thousands of times, it can identify “cells” in the data whose signal looks very similar to synthetic doublets. Here, we identify the doublets.

    
    # Compute Doublet Scores
    doubletScores <- addDoubletScores(
      input = ArrowFiles,  # Use Arrow files created in the previous step
      k = 10,  # Number of nearest neighbors to consider for doublet detection
      knnMethod = "UMAP",  # Use UMAP embedding for nearest neighbor search
      LSIMethod = 1  # Use Latent Semantic Indexing (LSI) method 1 for doublet estimation
    )
                    

    Creation of ArchR project

    As explain before, we generate a ArchR project to easily manipulate the scATAC-seq generated by ArchR.

    
    # Create an ArchR Project
    project <- ArchRProject(
      ArrowFiles = ArrowFiles,  # Use the Arrow files created earlier
      outputDirectory = "Analysis_scATACseq",  # Define output directory for the project
      copyArrows = TRUE  # Keep a copy of the Arrow files for future reference
    )
                    

    We can easily list the matrix items present in the project

    
    # List Available Matrices in the Project
    getAvailableMatrices(project)  # Check what matrices (e.g., Gene Score Matrix) are available
                    

    Strict quality control (QC) of scATAC-seq data is essential to remove the contribution of low-quality cells.


    ArchR consider three characteristics of data:

  • The fragment size distribution. Due to nucleosomal periodicity, we expect to see depletion of fragments that are the length of DNA wrapped around a nucleosome (approximately 147 bp).
  • The TSS enrichment (signal-to-background ratio). Low signal-to-background ratio is often attributed to dead or dying cells which have de-chromatized DNA which allows for random transposition genome-wide.
  • The number of unique nuclear fragments (i.e. not mapping to mitochondrial DNA). We can appreciate the QC and the main metrics of samples using some plots:

  • Plot QC metrics:

    
    # Plot TSS Enrichment vs. Fragment Counts
    df <- getCellColData(project, select = c("log10(nFrags)", "TSSEnrichment"))  # Extract metadata
    
    plot.tss.frags <- ggPoint(
      x = df[,1],  # Log10 number of unique fragments
      y = df[,2],  # TSS Enrichment score
      colorDensity = TRUE,  # Color by density
      continuousSet = "sambaNight",  # Use the "sambaNight" color scheme
      xlabel = "Log10 Unique Fragments",  # Label for X-axis
      ylabel = "TSS Enrichment",  # Label for Y-axis
      xlim = c(log10(450), quantile(df[,1], probs = 1) + 0.1),  # Set X-axis limits
      ylim = c(0, quantile(df[,2], probs = 1) + 0.1)  # Set Y-axis limits
    ) +
      geom_hline(yintercept = 4, lty = "dashed", col = "black") +  # Add horizontal line at TSS = 4
      geom_vline(xintercept = log10(1000), lty = "dashed", col = "black")  # Add vertical line at 1000 fragments
    
    # Save the plot as a PDF inside the project directory
    plotPDF(plot.tss.frags, name = "TSS-vs-Frags.pdf", ArchRProj = project, addDOC = FALSE)
    
    # Display the plot
    plot.tss.frags
                    

    Plot TSS metrics:

    
    # Plot TSS Enrichment Distributions
    
    # Ridge plot of TSS enrichment per sample
    plot.tss.v1 <- plotGroups(
      ArchRProj = project, 
      groupBy = "Sample",  # Group by sample
      colorBy = "cellColData",  # Use cell metadata for color
      name = "TSSEnrichment",  # Plot TSS enrichment scores
      plotAs = "ridges"  # Use a ridge plot
    )
    
    # Violin plot of TSS enrichment per sample with an overlaid boxplot
    plot.tss.v2 <- plotGroups(
      ArchRProj = project, 
      groupBy = "Sample",  # Group by sample
      colorBy = "cellColData",  # Use cell metadata for color
      name = "TSSEnrichment",  # Plot TSS enrichment scores
      plotAs = "violin",  # Use a violin plot
      alpha = 0.4,  # Set transparency
      addBoxPlot = TRUE  # Add a boxplot overlay
    )
    
    # Display the ridge and violin plots side by side
    plot.tss.v1 | plot.tss.v2
                    

    Plot fragment metrics:

    
    # Plot Fragment Count Distributions
    
    # Ridge plot of log10 fragment counts per sample
    plot.frags.v1 <- plotGroups(
      ArchRProj = project, 
      groupBy = "Sample",  # Group by sample
      colorBy = "cellColData",  # Use cell metadata for color
      name = "log10(nFrags)",  # Plot log10 fragment counts
      plotAs = "ridges"  # Use a ridge plot
    )
    
    # Violin plot of log10 fragment counts per sample with an overlaid boxplot
    plot.frags.v2 <- plotGroups(
      ArchRProj = project, 
      groupBy = "Sample",  # Group by sample
      colorBy = "cellColData",  # Use cell metadata for color
      name = "log10(nFrags)",  # Plot log10 fragment counts
      plotAs = "violin",  # Use a violin plot
      alpha = 0.4,  # Set transparency
      addBoxPlot = TRUE  # Add a boxplot overlay
    )
    
    # Display the ridge and violin plots side by side
    plot.frags.v1 | plot.frags.v2
                    

    Filter the doublets

    
    # Filter Out Doublets from the Dataset
    project <- filterDoublets(ArchRProj = project)  # Remove detected doublets to keep only single cells
                    

    Sample Fragment Size Distribution and TSS Enrichment Profiles

    
    # Plot the Transcription Start Site (TSS) enrichment profile
    plot.tss.v3 <- plotTSSEnrichment(ArchRProj = project)
    
    # Plot the fragment size distribution
    plot.frags.v3 <- plotFragmentSizes(ArchRProj = project)
    
    # Display both plots side by side
    plot.frags.v3 | plot.tss.v3
                    

    Normalization, dimensional reduction, batch effect correction, clustering and others step

    Normalization and dimensional reduction

    scATAC-seq generates a sparse insertion counts matrix (500-bp tiles; binary data of ~6 million of features) making it impossible to identify variable peaks for standard dimensionality reduction. To get around this issue, ArchR use LSI (Latent Semantic Indexing), a layered dimensionality reduction approach for sparse and noisy data.


    Rather than identifying the most variable peaks, ArchR tries using the most accessible features as input to LSI.


    However, when running multiple samples the results could shown high degrees of noise and low reproducibility.


    To remedy this, ArchR introduced the “iterative LSI” approach (Satpathy, Granja et al., 2019), which computes an initial LSI transformation on the most accessible tiles and identifies lower resolution clusters that are not batch confounded.

  • This approach computes an initial LSI transformation on the most accessible tiles and identifies lower resolution clusters that are not batch confounded.
  • ArchR computes the average accessibility for each of these clusters across all features. ArchR then identifies the most variable peaks across these clusters and uses these features for LSI again.
  • In this second iteration, the most variable peaks are more similar to the variable genes used in scRNA-seq LSI implementations.

  • This approach minimizes observed batch effects and allow dimensionality reduction operations on a more reasonably sized feature matrix.

    
    project_Normalized <- addIterativeLSI(ArchRProj = project, iterations = 2,
                                          # Number of iterations for LSI; more iterations refine clustering
                                          #sampleCellsPre = 50000, # Optional: Number of cells to use for iterations before the final one
                                          #clusterParams = list(resolution = 0.1, sampleCells = 50000, maxClusters = 6, n.start = 10), 
                                          # Cluster parameters can be adjusted to optimize clustering
                                          useMatrix = "TileMatrix", # Use TileMatrix for LSI
                                          name = "IterativeLSI", # Name of the reduced dimensions
                                          varFeatures = 25000) # Number of variable features to use for LSI
                    

    Batch effect correction


    Sometimes the iterative LSI approach isn’t enough to correct strong batch effect differences. For this reason, ArchR implements a commonly used batch effect correction tool called Harmony (Korsunsky et al., 2019) which was originally designed for scRNA-seq.

    
    # Perform batch correction using Harmony on the reduced dimensions from LSI
    project_Normalized <- addHarmony(ArchRProj = project_Normalized, reducedDims = "IterativeLSI",
                                     name = "Harmony", groupBy = "Sample")
                    

    UMAP


    Run UMAP in ArchR.

    
    # Compute UMAP embedding based on Iterative LSI dimensions
    project_Normalized <- addUMAP(ArchRProj = project_Normalized, reducedDims = "IterativeLSI", name = "UMAP")
    
    # Plot UMAP colored by sample identity
    plotEmbedding(ArchRProj = project_Normalized, colorBy = "cellColData", name = "Sample", embedding = "UMAP", size=0.1)
    
    # Compute UMAP embedding based on Harmony dimensions (after batch correction)
    project_Normalized <- addUMAP(ArchRProj = project_Normalized, reducedDims = "Harmony", name = "UMAP", force=TRUE)
    
    # Plot UMAP again to visualize batch-corrected embedding
    plotEmbedding(ArchRProj = project_Normalized, colorBy = "cellColData", name = "Sample", embedding = "UMAP", size=0.1)
                    

    Clustering


    To identify clusters, ArchR allows to use same method as Seurat or Scran. We have selected the same method describe the first day and used in Seurat package.

    
    # Iterate over different clustering resolutions from 0.0 to 0.9
    for(i in seq(0,0.9,0.1)){
      project_Normalized <- addClusters(input = project_Normalized, reducedDims = "Harmony",
                                        method = "Seurat", # Clustering method (Seurat-based)
                                        name = paste("Clusters.res",i,sep=""), # Naming clusters dynamically
                                        resolution = i, # Set clustering resolution
                                        verbose = FALSE) # Suppress verbose output
                    

    Save and load a project


    Save

    
    # Save the current state of the ArchR project to disk
    saveArchRProject(ArchRProj = project_Normalized,
                     outputDirectory = file.path(getwd(),"Analysis_scATACseq"))
                    

    Load

    
    # Load the saved ArchR project
    project_Normalized <- loadArchRProject(path = file.path(getwd(),"Analysis_scATACseq"),
                                           force = TRUE, showLogo = FALSE)
                    

    Exploration of data using Gene estimation

    Visualization of clustering using Clustree


    Clustree is an useful tool to explore the links between the clusters from different resolutions.

    
    # Extract clustering information from ArchR project
    tmp.clustree.datatable <- as.data.frame(project_Normalized@cellColData)
    
    # Plot a clustering tree to visualize clustering relationships across resolutions
    clustree(tmp.clustree.datatable, prefix="Clusters.res")
                    

    Visualization of clustering on UMAP


    
    # Iterate over different clustering resolutions and visualize UMAP embeddings
    for(i in seq(0,0.9,0.1)){
      # Plot UMAP with cluster labels
      plot.umap.resi <- plotEmbedding(ArchRProj = project_Normalized, 
                                      colorBy = "cellColData", 
                                      name = paste("Clusters.res",i,sep=""), 
                                      embedding = "UMAP", size=0.1)
      
      # Plot UMAP without cluster labels
      plot.umap.woLabel.resi <- plotEmbedding(ArchRProj = project_Normalized, 
                                              colorBy = "cellColData", 
                                              name = paste("Clusters.res",i,sep=""), 
                                              embedding = "UMAP", size=0.1, 
                                              labelMeans=FALSE)
    
      # Display both plots side by side
      print(plot.umap.resi | plot.umap.woLabel.resi)
    }
                    

    Characterization of clusters


    At this step, we select a specific resolution to explore in details the GeneScore to characterize the different clusters. For that, we will identify marker genes (based on gene scores, or estimation of gene expression) of clusters. In short, ArchR estimates gene scores using the local accessibility of the gene region that includes the promoter and gene body, but imposes an exponential weight that accounts for the activity of putative distal regulatory elements as a function of distance.


    Remarks: ArchR can used gene, peak or transcription factor motif features. Here, ArchR identify the genes that appear to be uniquely active in each cluster at the resolution 0.4.

    
    slct.res = "res0.7" # Select resolution for analysis
    
    # Identify marker genes using Gene Score Matrix
    markersGS.slctRes <- getMarkerFeatures(ArchRProj = project_Normalized,
                                           useMatrix = "GeneScoreMatrix",
                                           groupBy = paste("Clusters.",slct.res,sep=""),
                                           bias = c("TSSEnrichment", "log10(nFrags)"),
                                           testMethod = "wilcoxon") # Perform Wilcoxon test
    
    # Extract marker genes with FDR ≤ 0.05 and Log2 Fold Change ≥ 0.2
    markerList <- getMarkers(markersGS.slctRes, cutOff = "FDR <= 0.05 & Log2FC >= 0.2")
    
    # Display marker genes for the first cluster
    i = names(markerList)[1]
    markerList[[i]]
    
    # Save marker genes for each cluster
    for(i in names(markerList)){
      write.table(markerList[[i]], sep="\t", row.names=FALSE, col.names=TRUE, quote=FALSE,
                  file=file.path(work_dir, paste(i, ".res", slct.res, ".mGenesList.tsv", sep="")))
    }
                    

    To visualize the marker genes, we can produce a heatmap:

    
    # Define key marker genes
    markerGenes <- c("EPCAM", "VIM", "FLT4", "THY1", "CD3D", "PECAM1", "CD38", "PAX5",
                     "MS4A1", "CD14", "ITGAX", "CD4", "CD8A", "GZMA")
    
    # Generate heatmap of gene scores
    heatmapGS <- plotMarkerHeatmap(seMarker = markersGS.slctRes,
                                   cutOff = "FDR <= 0.05 & Log2FC >= 1",
                                   labelMarkers = markerGenes,
                                   transpose = FALSE)
    
    # Display heatmap
    heatmapGS
    
    # Retrieve heatmap matrix
    heatmapGSmatrix <- plotMarkerHeatmap(seMarker = markersGS.slctRes,
                                         cutOff = "FDR <= 0.05 & Log2FC >= 1",
                                         labelMarkers = markerGenes,
                                         returnMatrix = TRUE,
                                         transpose = FALSE)
    
    # Display first 10 rows of heatmap matrix
    head(heatmapGSmatrix, 10)
    
    # Save heatmap matrix
    write.table(cbind(Cluster=rownames(heatmapGSmatrix), heatmapGSmatrix), sep="\t",
                row.names=FALSE, col.names=TRUE, quote=FALSE,
                file=file.path(work_dir, paste("GeneScores-Marker-Heatmap", slct.res, sep=".")))
                    

    Or visualize GeneScore of marker genes in UMAP

    
    # Plot Gene Score UMAP without MAGIC imputation
    plot.GS.woMAGIC <- plotEmbedding(ArchRProj = project_Normalized, 
                                     colorBy = "GeneScoreMatrix", 
                                     name = markerGenes, embedding = "UMAP", 
                                     quantCut = c(0.01, 0.95), 
                                     imputeWeights = NULL)
    
    # Display selected genes
    plot.GS.woMAGIC$VIM | plot.GS.woMAGIC$EPCAM
                    

    However, scATAC-seq data is really sparse. Due to that, it is highly suggest to use MAGIC (van Dijk, et al., 2018), which add an imputation weight to the gene scores by smoothing signal across nearby cells.

    
    # Apply MAGIC for gene imputation
    project_Normalized <- addImputeWeights(project_Normalized)
    
    # Plot Gene Score UMAP with imputation
    plot.GS <- plotEmbedding(ArchRProj = project_Normalized, colorBy = "GeneScoreMatrix",
                             name = markerGenes,
                             embedding = "UMAP",
                             imputeWeights = getImputeWeights(project_Normalized))
    
    plot.GS$VIM | plot.GS$EPCAM
    plot.GS$FLT4 | plot.GS$THY1
    plot.GS$ITGAX | plot.GS$CD14
    plot.GS$MS4A1 | plot.GS$CD38
    plot.GS$CD3D | plot.GS$CD8A
    plot.GS$CD4 | plot.GS$GZMA
                    

    scATAC-scRNAseq integration

    ArchR enables integration with scRNA-seq, offers the possibility to use clusters called in scRNA-seq space or use the gene expression measurements after integration.


    The way this integration works is by directly aligning cells from scATAC-seq with cells from scRNA-seq by comparing the scATAC-seq gene score matrix with the scRNA-seq gene expression matrix. This alignment is performed using the FindTransferAnchors() function from the Seurat package which allows you to align data across two datasets.


    However, to appropriately scale this procedure for hundreds of thousands of cells ArchR provides, a parallelization of this procedure by dividing the total cells into smaller groups of cells and performing separate alignments.

    
    # Import scRNAseq data
    scRNA<-readRDS(file.path(work_dir,"scRNAseq.data.rds"))
    DefaultAssay(object = scRNA) <- "RNA"
    
    # Integrate scRNA-seq and scATAC-seq data
    project_Normalized <- addGeneIntegrationMatrix(ArchRProj = project_Normalized,
        useMatrix = "GeneScoreMatrix", matrixName = "GeneIntegrationMatrix",
        reducedDims = "Harmony", #Harmony , IterativeLSI
        seRNA = scRNA, addToArrow = TRUE, force= TRUE,
        groupRNA = "integrated_snn_res.0.5",
        nameCell = "predictedCell", nameGroup = "predictedGroup", nameScore = "predictedScore",
        sampleCellsATAC = 10000, sampleCellsRNA = 10000, scaleTo = 10000)
    project_Normalized <- addImputeWeights(project_Normalized)
    
    saveArchRProject(ArchRProj = project_Normalized, load = FALSE)
    
    plot_rna.woLabel <- plotEmbedding(project_Normalized,colorBy = "cellColData",name = "predictedGroup", embedding = "UMAP", size=1, labelMeans=FALSE)
    plot_rna <- plotEmbedding(project_Normalized,colorBy = "cellColData",name = "predictedGroup", embedding = "UMAP", size=1)
    
    # Cross table between scRNA-seq and scATAC-seq data
    cM <- as.matrix(confusionMatrix(project$Clusters.res0.7,                # Warning resolution
                                    project$predictedGroup))                # Warning resolution
    
                    

    Peak calling

    Calling peaks is one of the most fundamental processes in ATAC-seq data analysis. Because per-cell scATAC-seq data is essentially binary (accessible or not accessible), we perform calling peaks on groups of similar cells (or clusters) define previously.


    ArchR applies a Iterative Overlap Peak Merging Procedure with the recommanded MACS2 peak caller (Zhang et al., 2008).


    It uses a function to perform this iterative overlap peak merging procedure:

    • ArchR would call peaks for each pseudo-bulk replicate individually.
    • ArchR would analyze all of the pseudo-bulk replicates from a single cell type together, performing the first iteration of iterative overlap removal.
    • After the first iteration of iterative overlap removal, ArchR checks to see the reproducibility of each peak across pseudo-bulk replicates and only keeps peaks that pass a threshold indicated by the reproducibility parameter.
    • At the end of this process, we would have a single merged peak set for each cell types.

    Creation of the pseudo-bulk replicates


    
    # Determine the number of replicates to use for coverage calculation
    nbReplicates = ifelse(length(names(table(project_Normalized$Sample))) > 5, 
                          length(names(table(project_Normalized$Sample))), 5)
    
    # Add group coverage information to the ArchR project
    project_Peaks <- addGroupCoverages(
        ArchRProj = project_Normalized, 
        maxReplicates = nbReplicates, 
        groupBy = paste("Clusters", slct.res, sep = ".")
    )
                    

    Perform peak calling

    
    # Find the path to MACS2
    pathToMacs2 <- findMacs2()
    
    # Perform peak calling using MACS2
    project_Peaks <- addReproduciblePeakSet(
        ArchRProj = project_Peaks,
        groupBy = paste("Clusters", slct.res, sep = "."),
        pathToMacs2 = pathToMacs2
    )
    
    # Alternative peak calling method (use if MACS2 is unavailable)
    # project_Peaks <- addReproduciblePeakSet(
    #     ArchRProj = project_Peaks,
    #     groupBy = paste("Clusters", slct.res, sep = "."),
    #     peakMethod = "Tiles",
    #     method = "p"
    # )
    
    # Retrieve the peak set after calling peaks
    getPeakSet(project_Peaks)
    
    # Add imputation weights to improve downstream analyses
    project_Peaks <- addImputeWeights(project_Peaks)
    
    # Save the ArchR project with peak calling results
    saveArchRProject(
        ArchRProj = project_Peaks,
        outputDirectory = file.path(getwd(), "Analysis_scATACseq"), 
        load = TRUE
    )
                    

    Identification of marker peaks

    As explained before for the marker genes, ArchR can used gene, peak or transcription factor motif features. Here, ArchR identify the peaks that appear to be uniquely active in each cluster at selected resolution.

    
    # Generate a Peak Matrix for accessibility quantification
    project_Peaks <- addPeakMatrix(project_Peaks)
    
    # Identify marker peaks for each cluster using Wilcoxon test
    markersPeaks <- getMarkerFeatures(
        ArchRProj = project_Peaks,
        useMatrix = "PeakMatrix",
        groupBy = paste("Clusters", slct.res, sep = "."),
        bias = c("TSSEnrichment", "log10(nFrags)"),
        testMethod = "wilcoxon"
    )
    
    # Extract significantly different peaks (FDR <= 0.01, Log2FC >= 1)
    markerList <- getMarkers(markersPeaks, cutOff = "FDR <= 0.01 & Log2FC >= 1")
    
    # View marker peaks for cluster C9
    markerList[["C9"]]
                    

    To visualize the marker genes, we can produce a heatmap:

    
    # Generate a heatmap of marker peaks (less stringent threshold)
    heatmapPeaks <- plotMarkerHeatmap(
        seMarker = markersPeaks,
        cutOff = "FDR <= 0.1 & Log2FC >= 0.5",
        transpose = FALSE
    )
    
    # Display the heatmap
    heatmapPeaks
                    

    Or MA and volcano plots of marker peaks by cluster:

    
    # Generate MA and Volcano plots for cluster C9
    map <- plotMarkers(
        seMarker = markersPeaks, 
        name = "C9",
        cutOff = "FDR <= 0.1 & Log2FC >= 1",  # Default
        plotAs = "MA"
    )
    
    vp <- plotMarkers(
        seMarker = markersPeaks, 
        name = "C9",
        cutOff = "FDR <= 0.1 & Log2FC >= 1",  # Default
        plotAs = "Volcano"
    )
    
    # Combine both plots
    map | vp
                    

    Or visualize the marker peaks on a browser track:

    
    # Generate a browser track visualization for CD4
    plot.track1 <- plotBrowserTrack(
        ArchRProj = project_Peaks,
        groupBy = paste("Clusters", slct.res, sep = "."),
        geneSymbol = c("CD4"),
        features = getMarkers(markersPeaks, cutOff = "FDR <= 0.1 & Log2FC >= 1", returnGR = TRUE)["C9"],
        upstream = 50000, downstream = 50000
    )
    
    # Display the track
    grid::grid.newpage()
    grid::grid.draw(plot.track1$CD8A)
                    
    
    # Save the object and download it!!
    saveRDS(project_Peaks,"project_Peaks.rds")
    saveRDS(markersPeaks,"markersPeaks.rds")
                    

    Motif Enrichment

    After identified peak sets, the next step is to predict what transcription factors (TFs) may be mediating the binding events that create those accessible chromatin sites.


    ArchR allow to annotate the TF motifs that are enriched in peaks that are up or down in the different cell types.


    Firstly, we add motif annotations to our ArchRProject based on a reference database (for example: CIS-BP, JASPAR, Encode or Homer).


    Here, we have selected CIS-BP which contains > 300 TF families from > 700 species collecting from > 70 sources , including other databases: Transfac, JASPAR, Hocomoco, FactorBook, UniProbe, among others.


    Next, we test the set of significantly differential peaks for motif enrichment.

    
    # Download and reload data if needed
    shell_call("gdown 1SdSmF9R3yHNWacmFf22RxpcrrKmUHM_b")
    markersPeaks = readRDS("markersPeaks.rds")
    
    shell_call("gdown 1S6fRM7_KX4kjd9ankvzA5HloJJSIM4bn")
    project_Peaks = readRDS("project_Peaks.rds")
                    
    
    ## Motif Enrichment
    project_Peaks <- addMotifAnnotations(ArchRProj = project_Peaks, motifSet = "cisbp", name = "Motif", force = TRUE)
    
    # Motif enrichment in marker peaks
    enrichMotifs <- peakAnnoEnrichment(
        seMarker = markersPeaks, ArchRProj = project_Peaks,
        peakAnnotation = "Motif",cutOff = "FDR <= 0.1 & Log2FC >= 0.5")        # Default
    
    # You have two output, the Enrichment matrix and the pvalue matrix:
    head(enrichMotifs@assays@data$Enrichment,10)
                    

    We can display a heatmap to visualize the main motifs of each clusters.

    
    # Plot a heatmap of enriched motifs
    heatmapEM <- plotEnrichHeatmap(enrichMotifs, n = 7, transpose = TRUE)
    ComplexHeatmap::draw(heatmapEM, heatmap_legend_side = "bot", annotation_legend_side = "bot")
                    

    ChromVAR and visualization of motif deviation

    ChromVAR, developed by Greenleaf lab, is designed for predicting enrichment of TF activity on a per-cell basis from sparse chromatin accessibility data. ChromVAR computes:

  • A “deviation”, which is a bias-corrected measurement of how far the per-cell accessibility of a given feature (i.e motif) deviates from the expected accessibility based on the average of all cells or samples.
  • A “z-score” / a “deviation score”, which is the z-score for each bias-corrected deviation across all cells.
  • 
    # Add background peaks
    project_Peaks <- addBgdPeaks(project_Peaks)
    
    # Compute deviations matrix
    project_Peaks <- addDeviationsMatrix(ArchRProj = project_Peaks, peakAnnotation = "Motif",force = TRUE)
    
    # Plot variability in motif accessibility
    getVarDeviations(project_Peaks, name = "MotifMatrix", plot = TRUE)
    
    # Save project
    saveArchRProject(ArchRProj = project_Normalized,
                     outputDirectory = file.path(getwd(),"Analysis_scATACseq"))
                    

    We can display a distribution of markers

    
    # Define a list of motifs to analyze
    motifs <- c("FOS", "JUNB")
    
    # Retrieve motif features from the MotifMatrix that match the selected motifs
    markerMotifs <- getFeatures(
      project_Peaks,
      select = paste(motifs, "_", collapse = "|", sep = ""),
      useMatrix = "MotifMatrix"
    )
    
    # Filter motif features to include only those with the "z:" prefix
    markerMotifs <- grep("z:", markerMotifs, value = TRUE)
    
    # Add imputation weights to the ArchR project for smoothing data visualization
    project_Peaks <- addImputeWeights(project_Peaks)
    
    # Generate grouped motif enrichment plots
    cowp <- plotGroups(
      ArchRProj = project_Peaks,
      groupBy = paste("Clusters", slct.res, sep = "."),
      colorBy = "MotifMatrix",
      name = markerMotifs,
      imputeWeights = getImputeWeights(project_Peaks)
    )
    
    # Arrange plots in a grid layout with two columns
    do.call(cowplot::plot_grid, c(list(ncol = 2), cowp))
                    

    Or visualize the motif deviation in UMAP (and see if motif deviation correlates with TF gene score)

    
    # Plot motif enrichment on UMAP embedding
    motif.umap <- plotEmbedding(
      ArchRProj = project_Peaks,
      colorBy = "MotifMatrix",
      name = sort(markerMotifs),
      embedding = "UMAP",
      imputeWeights = getImputeWeights(project_Peaks)
    )
    
    # Display motif UMAP plots in a grid layout
    do.call(cowplot::plot_grid, c(list(ncol = 2), motif.umap))
    
    # Retrieve gene activity features related to the selected motifs
    markerRNA <- getFeatures(
      project_Peaks,
      select = paste(motifs, "$", collapse = "|", sep = ""),
      useMatrix = "GeneScoreMatrix"
    )
    
    # Plot gene score matrix enrichment on UMAP embedding
    gene.umap <- plotEmbedding(
      ArchRProj = project_Peaks,
      colorBy = "GeneScoreMatrix",
      name = sort(markerRNA),
      embedding = "UMAP",
      imputeWeights = getImputeWeights(project_Peaks)
    )
    
    # Display gene score UMAP plots in a grid layout
    do.call(cowplot::plot_grid, c(list(ncol = 2), gene.umap))
                    

    Pairwise test between clusters

    We can identify the motif enrichment between two clusters (based on differential accessibility of peaks between these two clusters).

    
    slct.Cl1="C9"
    slct.Cl2="C11"
    
    # Perform differential analysis between C9 and C11
    markerTest <- getMarkerFeatures(ArchRProj = project_Peaks,
                                    useMatrix = "PeakMatrix",
                                    groupBy = paste("Clusters",slct.res,sep="."),
                                    testMethod = "wilcoxon",
                                    bias = c("TSSEnrichment", "log10(nFrags)"),
                                    useGroups = slct.Cl1, bgdGroups = slct.Cl2)
    
    # Generate MA and Volcano plots
    map.Cl1vCl2 <- markerPlot(seMarker = markerTest, name = slct.Cl1,
                            cutOff = "FDR <= 0.1 & abs(Log2FC) >= 1",
                            plotAs = "MA")
    vp.Cl1vCl2 <- markerPlot(seMarker = markerTest, name = slct.Cl1,
                           cutOff = "FDR <= 0.1 & abs(Log2FC) >= 1",
                           plotAs = "Volcano")
    
    map.Cl1vCl2 | vp.Cl1vCl2
    
                    

    Motif Up-Enrich and motif Down-enrich (based on the pairwise test between clusters)

    
    # Identify significantly enriched motifs in peaks with increased accessibility
    motifsUp <- peakAnnoEnrichment(ArchRProj = project_Peaks,
                                   seMarker = markerTest,
                                   peakAnnotation = "Motif",
                                   cutOff = "FDR <= 0.1 & Log2FC >= 0.5") # Select motifs with significant FDR and Log2FC >= 0.5
    
    # Create a data frame with motif names and -log10 adjusted p-values
    df <- data.frame(TF = rownames(motifsUp), mlog10Padj = assay(motifsUp)[,1])
    df <- df[order(df$mlog10Padj, decreasing = TRUE),] # Sort by significance
    df$rank <- seq_len(nrow(df)) # Assign rank based on significance
    
    # Plot enriched TF motifs with labels
    ggUp <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) +
        geom_point(size = 1) + ggrepel::geom_label_repel(
            data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF),
            size = 1.5, nudge_x = 2, color = "black") + theme_ArchR() +
            ylab("-log10(P-adj) Motif Enrichment") + xlab("Rank Sorted TFs Enriched") +
            scale_color_gradientn(colors = paletteContinuous(set = "comet"))
    
    # Identify significantly enriched motifs in peaks with decreased accessibility
    motifsDo <- peakAnnoEnrichment(ArchRProj = project_Peaks,
                                   seMarker = markerTest,
                                   peakAnnotation = "Motif",
                                   cutOff = "FDR <= 0.1 & Log2FC <= -0.5") # Select motifs with Log2FC <= -0.5
    
    df <- data.frame(TF = rownames(motifsDo), mlog10Padj = assay(motifsDo)[,1])
    df <- df[order(df$mlog10Padj, decreasing = TRUE),] # Sort by significance
    df$rank <- seq_len(nrow(df)) # Assign rank
    
    # Plot TF motifs that are lost in accessibility
    ggDo <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) +
        geom_point(size = 1) + ggrepel::geom_label_repel(
            data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF),
            size = 1.5, nudge_x = 2, color = "black") + theme_ArchR() +
            ylab("-log10(FDR) Motif Enrichment") + xlab("Rank Sorted TFs Enriched") +
            scale_color_gradientn(colors = paletteContinuous(set = "comet"))
    
    # Combine both plots
    ggUp | ggDo
                    

    Identification of Positive TF-Regulators

    Although ATAC-seq allows unbiased identification of TFs, families of TFs share similar motifs when viewed in aggregate. This makes it difficult to identify specific TFs that may be responsible for the observed changes in chromatin accessibility to their predicted binding sites. To circumvent this problem, ArchR identifies TFs whose gene expression (Gene score) is positively correlated with changes in accessibility of their corresponding motif (Motif deviation obtained using ChromVAR).

    Step 1. Identify Deviant TF Motifs

    
    # Extract motif deviation scores grouped by clusters
    seGroupMotif <- getGroupSE(ArchRProj = project_Peaks, useMatrix = "MotifMatrix", groupBy = paste("Clusters", slct.res, sep="."))
    
    # Extract only Z-score deviations
    seZ <- seGroupMotif[rowData(seGroupMotif)$seqnames == "z",]
    
    # Compute the maximum delta in Z-score across all clusters
    rowData(seZ)$maxDelta <- lapply(seq_len(ncol(seZ)), function(x){
      rowMaxs(assay(seZ) - assay(seZ)[,x])
    }) %>% Reduce("cbind", .) %>% rowMaxs
                    

    Step 2. Identify Correlated TF Motifs and TF Gene Score/Expression

    
    # Compute correlations between gene scores and motif deviations
    corGSM_MM <- correlateMatrices(
        ArchRProj = project_Peaks,
        useMatrix1 = "GeneScoreMatrix",
        useMatrix2 = "MotifMatrix",
        reducedDims = "Harmony" # Can also use IterativeLSI
    )
    
    # Display top correlations
    head(corGSM_MM, 15)
                    

    Step 3. Add Maximum Delta Deviation to the Correlation Data Frame

    
    # Annotate motifs with the max delta observed between clusters
    corGSM_MM$maxDelta <- rowData(seZ)[match(corGSM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
                    

    Step 4. Identify Positive TF Regulators

    
    # Sort by absolute correlation and remove duplicate motifs
    corGSM_MM <- corGSM_MM[order(abs(corGSM_MM$cor), decreasing = TRUE), ]
    corGSM_MM <- corGSM_MM[which(!duplicated(gsub("\\-.*", "", corGSM_MM[,"MotifMatrix_name"]))), ]
    
    # Classify TFs as positive (PLUS), negative (NEG), or neutral (NO)
    corGSM_MM$TFRegulator <- "NO"
    corGSM_MM$TFRegulator[which(corGSM_MM$cor > 0.1 & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "PLUS"
    corGSM_MM$TFRegulator[which(corGSM_MM$cor < (-0.1) & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "NEG"
    
    # Scatter plot of correlation vs max delta
    ggplot(data.frame(corGSM_MM), aes(cor, maxDelta, color = TFRegulator)) +
      geom_point() +
      theme_ArchR() +
      geom_vline(xintercept = 0, lty = "dashed") +
      scale_color_manual(values = c("NO"="darkgrey", "PLUS"="firebrick3", "NEG"="royalblue1")) +
      xlab("Correlation To Gene Score") +
      ylab("Max TF Motif Delta") +
      scale_y_continuous(
        expand = c(0,0),
        limits = c(0, max(corGSM_MM$maxDelta)*1.05)
      )
    
    # Display top regulators
    head(as.matrix(sort(corGSM_MM[corGSM_MM$TFRegulator=="PLUS", c("GeneScoreMatrix_name", "MotifMatrix_name", "cor", "padj", "maxDelta")])), 15)
    head(as.matrix(sort(corGSM_MM[corGSM_MM$TFRegulator=="NEG", c("GeneScoreMatrix_name", "MotifMatrix_name", "cor", "padj", "maxDelta")])), 5)
                    

    Co-accessibility

    To study how the genes are regulated (promoter and enhancer links) ArchR proposes Co-accessibility analysis. Co-accessibility is a correlation in accessibility between two peaks across many single cells. Said another way, when Peak A is accessible in a single cell, Peak B is often also accessible. For example, co-accessibility allows to visualize the enhancer(s) linked to the gene promoter.


    Remarks:Co-accessibility analysis identify cell type-specific peaks. Although these peaks are often accessible together within a single cell type (and often all not accessible in all other cell types) does not necessarily explain a regulatory relationship between these peaks.

    
    # Add co-accessibility analysis to the ArchR project using Harmony dimensions
    project_Peaks <- addCoAccessibility(ArchRProj = project_Peaks, reducedDims = "Harmony")
    
    # Retrieve co-accessibility interactions with correlation cutoff and resolution
    cA <- getCoAccessibility(
      ArchRProj = project_Peaks,
      corCutOff = 0.5,
      resolution = 10000,
      returnLoops = TRUE
    )
    
    # Display the first 10 co-accessibility interactions
    head(cA$CoAccessibility, 10)
    
    # Generate a genome browser track visualization for selected marker genes
    p <- plotBrowserTrack(
      ArchRProj = project_Peaks,
      groupBy = paste("Clusters", slct.res, sep = "."),
      geneSymbol = markerGenes,
      upstream = 50000,  # Extend 50kb upstream
      downstream = 50000,  # Extend 50kb downstream
      loops = getCoAccessibility(project_Peaks)
    )
    
    # Render the genome browser plot
    grid::grid.newpage()
    grid::grid.draw(p$CD8A)
                    

    Footprinting

    Transcription factor (TF) footprinting allows for the prediction of the precise binding location of a TF at a particular locus. This is because the DNA bases that are directly bound by the TF are actually protected from transposition while the DNA bases immediately adjacent to TF binding are accessible.


    
    # Get motif positions
    motifPositions <- getPositions(project_Peaks)
    
    # Remove 'z:' prefix from motif names
    markerMotifs <- gsub("z:", "", markerMotifs)
    
    # Compute motif footprints
    seFoot <- getFootprints(
      ArchRProj = project_Peaks,
      positions = motifPositions[markerMotifs],
      groupBy = paste("Clusters", slct.res, sep=".")
    )
    
    # Plot footprints with bias correction
    plotFootprints(seFoot = seFoot,
                   ArchRProj = project_Peaks,
                   normMethod = "Subtract", # Options: Divide, None
                   plotName = paste("Footprints-Subtract-Bias", slct.res, "cisbp", sep="."),
                   addDOC = FALSE,
                   smoothWindow = 5)
                    

    Trajectory Analysis

    ArchR proposes to create a cellular trajectory that approximates the differentiation from a cell cluster to an other one. After the definition of the trajectory backbone, which consist of an ordered vector of cell group labels, ArchR identify a pseudo-time value for each cell in the trajectory. In the results, ArchR provides UMAPs to visualize the pseudo-temporal trajectory and heatmaps to track spike/pattern/gene signals as a function of pseudo-temporality.


    Construction of trajectory

    Firstly, ArchR produces a pseudo-time value for each cell in the trajectory, which can be visualize on UMAP and used to display an arrow approximating the trajectory path from the spline-fit.

    
    # Define a trajectory (e.g., C9 to C11)
    trajectory <- c("C9","C11")
    traj.name <- "TF.C9.C11"
    
    # Add trajectory to the project
    project_Peaks <- addTrajectory(
        ArchRProj = project_Peaks,
        name = traj.name,
        groupBy = paste("Clusters", slct.res, sep="."),
        trajectory = trajectory,
        embedding = "UMAP",
        force = TRUE
    )
    
    # Plot trajectory
    plotTraj <- plotTrajectory(project_Peaks, trajectory = traj.name, colorBy = "cellColData", name = traj.name, embedding = "UMAP")
    plotTraj[[1]]
                    

    Observation of specific genes

    It's possible to visualize this trajectory but color the cells by a specific gene score value.

    
    # Plot trajectory of the gene CD4 using the GeneScoreMatrix, visualized in UMAP embedding
    p_gene <- plotTrajectory(project_Peaks, trajectory = traj.name, colorBy = "GeneScoreMatrix", 
                             name = "CD4", continuousSet = "horizonExtra", embedding = "UMAP")
    
    # Display the first two trajectory plots side by side
    p_gene[[1]] | p_gene[[2]]
                    

    Pseudo-time heatmaps

    Finally, ArchR allow to perform heatmap to visualize changes in many features (peaks, gene scores or motifs) across pseudo-time.

    
    # Obtain the gene score trajectory (normalized with log2)
    trajGSM <- getTrajectory(ArchRProj = project_Peaks, name = traj.name, useMatrix = "GeneScoreMatrix", log2Norm = TRUE)
    
    # Plot heatmap for gene score trajectory with "horizonExtra" color palette
    p_trajGSM <- plotTrajectoryHeatmap(trajGSM, pal = paletteContinuous(set = "horizonExtra"))
    
    # Generate heatmap matrix for gene score trajectory
    p_trajGSM.matrix <- plotTrajectoryHeatmap(trajGSM, pal = paletteContinuous(set = "horizonExtra"), returnMatrix = TRUE)
    
    # Obtain the peak accessibility trajectory (normalized with log2)
    trajPM  <- getTrajectory(ArchRProj = project_Peaks, name = traj.name, useMatrix = "PeakMatrix", log2Norm = TRUE)
    
    # Plot heatmap for peak accessibility trajectory with "solarExtra" color palette
    p_trajPM <- plotTrajectoryHeatmap(trajPM, pal = paletteContinuous(set = "solarExtra"))
    
    # Generate heatmap matrix for peak accessibility trajectory
    p_trajPM.matrix <- plotTrajectoryHeatmap(trajPM, pal = paletteContinuous(set = "solarExtra"), returnMatrix = TRUE)
    
    # Obtain the motif activity trajectory (without log2 normalization)
    trajMM  <- getTrajectory(ArchRProj = project_Peaks, name = traj.name, useMatrix = "MotifMatrix", log2Norm = FALSE)
    
    # Plot heatmap for motif activity trajectory with "solarExtra" color palette
    p_trajMM <- plotTrajectoryHeatmap(trajMM, pal = paletteContinuous(set = "solarExtra"))
    
    # Generate heatmap matrix for motif activity trajectory
    p_trajMM.matrix <- plotTrajectoryHeatmap(trajMM, pal = paletteContinuous(set = "solarExtra"), returnMatrix = TRUE)
    
    # Display the heatmaps
    p_trajGSM
    p_trajPM
    p_trajMM
                    

    Integrative pseudo-time analysis

    As shown before, ArchR allows also to perform integrative analysis to identify positive TF using gene scores and motif accessibility across pseudo-time, follow their variability across pseudo-time and understand their role in this trajectory. To do it, ArchR proposes to use the correlateTrajectories() function which takes two SummarizedExperiment objects obtained from the getTrajectories() function which we have obtained before.

  • Step 1. Identify and select the motifs which gene score and TF motif accessibility is correlated:
  • 
    # Compute correlation between gene score trajectory (trajGSM) and motif trajectory (trajMM)
    # Using low stringency criteria: correlation cutoff of 0.2, variance cutoffs of 0.5 for both matrices
    corGSM_MM <- correlateTrajectories(trajGSM, trajMM, 
                                       corCutOff = 0.2, varCutOff1 = 0.5, varCutOff2 = 0.5)
    
    # Filter the gene score and motif trajectories based on correlation results
    flt_trajGSM <- trajGSM[corGSM_MM[[1]]$name1, ]
    flt_trajMM <- trajMM[corGSM_MM[[1]]$name2, ]
                    
  • Step 2. Create a new trajectory two visualize side by side the TF motif based on gene score and TF motif enrichment:
  • 
    # Create a combined trajectory object using the filtered gene score trajectory (flt_trajGSM)
    combinedTraj <- flt_trajGSM
    
    # Normalize and combine the gene score trajectory (flt_trajGSM) and motif trajectory (flt_trajMM)
    # - Scale each row (gene/motif) separately for both matrices
    # - Transpose the result and add them together to integrate information from both sources
    assay(combinedTraj, withDimnames=FALSE) <- t(apply(assay(flt_trajGSM), 1, scale)) + 
                                               t(apply(assay(flt_trajMM), 1, scale))
    
    # Generate a heatmap matrix from the combined trajectory
    # - returnMat = TRUE returns the matrix instead of plotting
    # - varCutOff = 0 ensures no variance-based filtering
    combinedMat <- plotTrajectoryHeatmap(combinedTraj, returnMat = TRUE, varCutOff = 0)
    
    # Determine the order of rows (genes/motifs) in the combined matrix
    rowOrder <- match(rownames(combinedMat), rownames(flt_trajGSM))
    
    # Plot heatmap for the gene score trajectory, keeping row order consistent with the combined matrix
    ht_GSM <- plotTrajectoryHeatmap(flt_trajGSM, pal = paletteContinuous(set = "horizonExtra"),  
                                    varCutOff = 0, rowOrder = rowOrder)
    
    # Plot heatmap for the motif trajectory, keeping row order consistent with the combined matrix
    ht_MM <- plotTrajectoryHeatmap(flt_trajMM, pal = paletteContinuous(set = "solarExtra"), 
                                   varCutOff = 0, rowOrder = rowOrder)
    
    # Display both heatmaps side by side for comparison
    ht_GSM + ht_MM
                    

    Session information

    
    # Display session info to track package versions
    sessionInfo()