This notebook introduces essential command-line operations in Linux, covering fundamental commands that are broadly applicable across programming languages with minimal adaptations. These foundational skills will support efficient data management and analysis in computational biology. Additionally, we will explore the key steps in processing raw sequencing reads into count matrices using Cell Ranger, discussing its main outputs and role in single-cell transcriptomics. Processing scRNA-seq data is a crucial step in single-cell analysis. The chosen library preparation method determines whether RNA sequences are captured from transcript ends (e.g., 10X Genomics, Drop-seq) or full-length transcripts (e.g., Smart-seq), directly influencing downstream analysis and biological insights.
Google Colab/Jupyter Noetbook using for default python as programming language. It's permite to use another languages, like shell script.
This is a marker to Google Colab understand this is Shell's code. For your personal use, "!" is not necessary.
%%bash
# A hashtag is a comment, this parte of the code is not using. However serve to importante annotations.
# It is a good and common practice in code as a reminder and as a mechanism for reproduction.
# We recommend to always comment your codes and script
echo "Hello, world!"
Along this Jupyter Notebook you will see diferents command in shell. They will be explained, but here is a small set of the most common commands.
%%bash
# make a folder.
# In programming, folder is called directory. We will use the name directory from here on.
# The name folder could be anything (e.g. folder1, folder_1, etc), try always to name directories that you remember what data are you saving on it
mkdir folder
%%bash
#list of files and directories
# In general, directories has a / in ending (e.g. /Documents/Files/scRNAseq_data/, here we have three directories
ls
%%bash
# A command could be follow for arguments. Arguments specify your main code. There are could be - or -- or positional.
# It's a importante thing to know about a software that you want to use.
ls -l
%%bash
# A command could be follow for arguments. Arguments specify your main code. There are could be - or -- or positional.
# It's a importante thing to know about a software that you want to use.
ls -l
%%bash
# The command cd is used to move into a directory or out of a directory
# If you want to move to another directory, you can use cd .. to move back forward or only cd to move forward
cd
%%bash
# The command mv is used to move your file to a specific directory
# You only need to specify the path to the directory you want to move your file
mv your_arquive local_to_move/
%%bash
# Also, mv could change names of directories or files. Try:
mv folder/ directory/
%%bash
# The command cat shows the entire contents of your file
cat
%%bash
# Also, you can concatenate two or more files using this command
cat file_1 file_2 > file_3
%%bash
# The command head shows the first 10 lines of your file
# Like a head of the "cat"
head your_file
%%bash
# The command tail shows the last 10 lines of your file
# Like a tail of the "cat"
tail
%%bash
# The command wget is used to download files
# You can put a link next to the command to download data (e.g. an specific dataset of a database, here is an example of a RNA-seq sample from SRA)
wget https://trace.ncbi.nlm.nih.gov/Traces/sra?run=SRR24765940
To begin, you must download SRAtoolkit in order to use the fastq-dump tool, which allows you to download SRA files.
The SRA Toolkit is a set of tools developed by the National Center for Biotechnology Information (NCBI) to access, download, and manipulate high-throughput data stored in the Sequence Read Archive (SRA)
The fastq-dump is one of the tools in the SRA Toolkit. It is used to extract data in FASTQ or FASTA format from SRA accessions. FASTQ is a common format for storing DNA and RNA sequencing data, which includes the sequence reads and their qualities. fastq-dump allows users to convert SRA files to FASTQ files, making it easier to process and analyze these data in other bioinformatics tools.
%%bash
# The -q is for quit mode, to don't provide all the output of wget, just run silently.
# The --output-document is to specify the name of the file where the downloaded content should be saved.
wget -q --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.1.0/sratoolkit.3.1.0-ubuntu64.tar.gz
# .tar.gz is a file that has been first packaged with tar and then compressed with gzip.
# tar command is used to pack or unpack files on Unix/Linux systems. The name "tar" comes from "tape archive"
# -xzf: Indicates that tar should extract the contents of a gzip-compressed archive.
tar -xzf sratoolkit.tar.gz
%%bash
# list of archives
ls
Cell Ranger is a software suite developed by 10x Genomics for analyzing single-cell RNA sequencing (scRNA-seq) data generated from their Chromium platform. It processes the raw sequencing data into meaningful insights, including gene expression matrices, cell clustering, and other downstream analyses.
Note:
Cell Ranger requires more than 12GB of RAM to run, it does not work with Colab.
Download:Update the link if it has expired and then the relevant genomic reference.
%%bash
# -O: name of output
wget -O cellranger-9.0.0.tar.gz "https://www.10xgenomics.com/support/software/cell-ranger/downloads"
%%bash
wget -O cellranger-9.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-9.0.0.tar.gz?Expires=1735019868&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=jr1wurwScX~6D75pHB8jDbGMyGRI28tiLJAe9LUo0Xz5hqUgQlVpaRBz50wZewTz9lp2ozBI91iEQkZ7s2ZbOGbugctslStBOCkILh3gfkIiv63YBJRSDm0kPJjAvLXHm6BUf20bPTJPiuwLvRWZuSrri0tV7vzn9iCY24I3~tjGCy4377Dm-1oCuiQCuHjXlyZjEZVXQZuS9ghX9ZmmldDnf6wE9hqIE80PhiSclOyFWfqLg8v5OBGP0JqrdI5oNhGiRKKmezIpT2eZ-pfzSCGMcQu1r6irmsz4pgZxl4ElWIpLtzB7xJb2H2OvL~j3rsBX6Ay-idAxDOF31tMqyQ__"
wget "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz"
It merely has to be decompressed to be installed.
%%bash
tar -xzf cellranger-7.2.0.tar.gz
tar -xzf refdata-gex-GRCh38-2020-A.tar.gz
!cellranger-7.2.0/bin/cellranger mkref --genome=reference --fasta=/genome/genome.fa --genes=/rato/gencode.vM36.annotation.gtf
%%bash
#shows the full path where you are working
pwd
%%bash
#DO NOT RUN!
#This \ break a line and continue the code in another line
cellranger-7.2.0/bin/cellranger count --id=run_count_SRR11537950 \
--fastqs=SRR11537950 --sample=SRR11537950 \
--transcriptome=refdata-gex-GRCh38-2020-A
#--localcores=2 --localmem=12
The cellranger count pipeline outputs an interactive summary HTML file named web_summary.html that contains summary metrics and automated secondary analysis results. Let's check it out!
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/summary
Barcode Rank Plot:
A steep drop-off is indicative of good separation between the cell-associated barcodes and the barcodes associated with empty GEMs.