In this notebook, learners practice structuring and documenting metadata using consistent standards and assemble submission-ready packages for public repositories and data portals. The notebooks emphasize common file formats, required metadata fields and practical submission workflows, helping ensure that datasets remain discoverable, interoperable and readily reusable across platforms and studies.
This document provides guidance on how to prepare metadata and submit transcriptomic data to the NCBI (National Center for Biotechnology Information) in the SCEA (Single Cell Expression Atlas) and the HCA Data Portal (Human Cell Atlas Data Portal), and is applicable to single-cell, bulk RNA-Seq, and spatial transcriptomics, following the FAIR (Findable, Accessible, Interoperable, Reusable) principles.
Metadata is data that describes the data. In other words, it is descriptive information that accompanies the raw and processed data, allowing them to be understood, contextualized, and reused. It can be tables with information about individuals, patients, a supplementary file with details of sample and file processing, etc.
In transcriptomic studies, whether bulk RNA-seq, single-cell, or spatial, metadata must capture both biological and technical aspects. This includes:
Well-structured metadata is essential to ensure that other researchers can correctly interpret the data, validate results, and integrate them into comparative analyses or reference databases.
For metadata to fully fulfill its role, it must exhibit a series of characteristics that guarantee its usefulness and quality. In this sense, the FAIR principles offer a set of best practices that help organize and make metadata more effective.
Although not mandatory, following these principles is highly recommended to increase the visibility, accessibility, and reuse of data.
Before understanding the NCBI submission workflow, it's important to know the most common file formats. Each extension carries a type of information and has specific uses in RNA-seq analyses (bulk, single-cell, or spatial).
A file where fields are separated by commas. It is widely accepted in spreadsheet and analysis software. Frequently used for counting files.
Caution: May cause conflicts if there are commas in the text.
A text file where each column is separated by tabs (TAB) and each line represents an entry (e.g., sample or gene).
Very commonly used for metadata and count matrices, as it avoids conflicts with commas in the text.
| .csv | .tsv |
|---|---|
| Uses commas between fields | Uses tabulation (TAB) |
| May cause conflicts with commas in the text | Safer for textual metadata |
| Extension: .csv | Extension: .tsv |
How to save:
A plain text file, typically used for sample counts or basic metadata. It can be structured as a table (with tabs or spaces) or as a list.
It is a binary format specific to R. It allows saving complex objects (such as normalized arrays, Seurat objects, or SingleCellExperiments) while maintaining structure and metadata. Ideal for direct reuse in R analyses.
A code file written in R (an R Script). Used to share analysis pipelines, including normalization, DESeq2, Seurato, etc.
It's a shell script, a terminal script (bash/shell). Used to automate processing steps, such as alignment, format conversion, or job submission to servers.
Interactive file that combines code (Python, R, etc.), results, and documentation. Very useful for reproducibility, as it shows the analysis step by step. Accepted by GEO as a way to share complete pipelines.
| .csv | .tsv |
|---|---|
| Uses commas between fields | Uses tabulation (TAB) |
| May cause conflicts with commas in the text | Safer for textual metadata |
| Extension: .csv | Extension: .tsv |
[ BioProject ]
↓
[ BioSample ] → [ SRA (raw files: FASTQ) ]
↓
[ GEO (Processed data: matrices, metadata, scripts) ]
The NCBI uses a hierarchical structure that connects different levels of information:
BioProject groups all samples and data from a study. Create only one BioProject per study.
You need to create an NCBI login to do this.
Include title, description, organism, and data type.
Example:
project_title: Single-cell transcriptomic atlas of PBMCs during CHIKV infection
data_type: Transcriptome (bulk + single-cell + spatial)
Each biological sample receives a unique ID. For transcriptomic samples, describe:
sample_name, organism, tissue, cell_type, disease, treatment, time_point, geo_loc_name
Additional fields: sequencing_protocol, dissociation_method, library_prep, cell_capture_platform (e.g., 10x Genomics Chromium, Smart-seq2)
| sample_name | organism | tissue | cell_capture_platform | library_prep | disease | time_point | bioproject_accession |
|---|---|---|---|---|---|---|---|
| CHIKV_sc01 | Homo sapiens | PBMC | 10x Genomics | Chromium 10x 3’ v3 | Chikungunya fever | 3 dpi | PRJNA123456 |
Create BioSamples linked to your BioProject through the same portal https://submit.ncbi.nlm.nih.gov/subs/bioproject, , or prepare a batch upload with .tsv files.
Preferably use more complete tables, such as:
| sample_name | organism | tissue | cell_type | cell_capture_platform | library_prep | sequencing_protocol | dissociation_method | disease | time_point | geo_loc_name | age | sex | bioproject_accession | description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CHIKV_sc01 | Homo sapiens | PBMC | Lymphocytes (mixed) | 10x Genomics | Chromium 10x 3’ v3 | Illumina NovaSeq 6000, paired-end 2×75 bp | Ficoll gradient + RBC lysis | Chikungunya fever | 3 dpi | Brazil: Bahia | 35 | F | PRJNA123456 | PBMCs isolated 3 days post CHIKV infection, processed with 10x Genomics Chromium 3’ v3 |
This table can be saved as biosample_metadata.tsv
After submission, the system returns accessions for each sample as:
SAMN45678901
SAMN45678902
This ensures proper linking between the different levels of submission and allows the data to be navigated in an integrated manner. The BioProject serves as the primary identifier of the study, and without it, there is no way to relate samples, raw data, and processed data.
The metadata table needs to describe the raw sequencing files and how they were generated. In addition to the basic fields, it is essential to include the BioProject and BioSamples codes to ensure correct linking.
| sample_name | biosample_accession | bioproject_accession | library_ID | title | time_point | library_strategy | library_source | library_selection | library_layout | platform | instrument_model | insert_size | filetype | filename | design_description | library_construction_protocol |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CHIKV_sc01 | SAMN45678901 | PRJNA123456 | LIB01 | scRNA-seq of PBMCs | 3 dpi | scRNA-Seq | TRANSCRIPTOMIC | RANDOM | PAIRED | ILLUMINA | NovaSeq 6000 | 280 | fastq | CHIKV_sc01_R1.fastq.gz;CHIKV_sc01_R2.fastq.gz | Single-cell RNA-seq of PBMCs infected with CHIKV, 3 days post infection | 10x Genomics Chromium 3’ v3 kit |
| CHIKV_sc02 | SAMN45678902 | PRJNA123456 | LIB02 | scRNA-seq of PBMCs | 5 dpi | scRNA-Seq | TRANSCRIPTOMIC | RANDOM | PAIRED | ILLUMINA | NovaSeq 6000 | 280 | fastq | CHIKV_sc02_R1.fastq.gz;CHIKV_sc02_R2.fastq.gz | Single-cell RNA-seq of PBMCs infected with CHIKV, 5 days post infection | 10x Genomics Chromium 3’ v3 kit |
Add other fields if you want more details (e.g., basecaller, alignment software). Save as sra_metadata.tsv
GEO is a public repository from the NCBI focused on processed gene expression data, including:
| File Type | Extension | Example |
|---|---|---|
| Count Matrix | .tsv, .csv | counts_matrix.tsv |
| Sample Counts | .tsv, .txt | counts_CHIKV_01.tsv |
| Normalized Files | .tsv, .rds | normalized_counts.rds |
| Scripts or pipelines | .R, .sh, .ipynb | deseq2_analysis.R |
| Sample Metadata | .tsv | geo_sample_metadata.tsv |
You can also include experimental flow diagrams, batch factors, and even RIN and RNA concentration.
| title | biosample_accession | source_name | organism | treatment | time_point | file_type | file_name | BioProject |
|---|---|---|---|---|---|---|---|---|
| Expression of PBMCs CHIKV 3dpi | SAMN45678901 | PBMC | Homo sapiens | CHIKV | 3dpi | Counts | counts_CHIKV_01.tsv | PRJNA123456 |
Prepare based on GEO submission templates:https://www.ncbi.nlm.nih.gov/geo/info/submission.html?form=MG0AV3
A more complete example
| sample_title | biosample_accession | source_name | organism | characteristics_ch1 | time_point | treatment | protocol_ch1 | data_processing | file_name | file_type | BioProject |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CHIKV_01 | SAMN45678901 | PBMC | Homo sapiens | disease: Chikungunya fever | 3 dpi | CHIKV infection | rRNA depletion + TruSeq | alignment with HISAT2, counts with StringTie and prepDE | counts_CHIKV_01.tsv | TSV | PRJNA123456 |
| CHIKV_02 | SAMN45678902 | PBMC | Homo sapiens | disease: Chikungunya fever | 5 dpi | CHIKV infection | rRNA depletion + TruSeq | alignment with HISAT2, counts with StringTie and prepDE | counts_CHIKV_02.tsv | TSV | PRJNA123456 |
The characteristics_ch1 field in GEO is extremely flexible and powerful; it allows you to describe various biological, clinical, or technical characteristics of your sample, in addition to the disease.
Another example of how it can be more complete:
| sample_title | biosample_accession | source_name | organism | characteristics_ch1 | characteristics_ch1 | characteristics_ch1 | characteristics_ch1 | time_point | treatment | protocol_ch1 | data_processing | file_name | file_type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CHIKV_01_3dpi | SAMN45678901 | PBMC | Homo sapiens | disease: Chikungunya fever | sex: female | age: 35 | RIN: 8.5 | 3 dpi | CHIKV infection | rRNA depletion + TruSeq | alignment with HISAT2, counts with StringTie & prepDE | counts_CHIKV_01.tsv | TSV |
PRJNA123456 (BioProject)
├── SAMN45678901 (BioSample)
│ ├── SRRxxxxxxx (SRA - raw data)
│ └── GSMxxxxxxx (GEO - processed data)
├── SAMN45678902 (BioSample)
│ ├── SRRyyyyyyy (SRA - raw data)
│ └── GSMyyyyyyy (GEO - processed data)
1. Create your BioProject
2.Submit your BioSamples
When filling in each line (via form or .tsv), include the field:
bioproject_accession
PRJNA123456
Each sample receives a code like: SAMN45678901
Each BioSample must have a unique name (e.g., CHIKV_01), and this same name will be used in the SRA and GEO metadata.
3. Submission to the SRA (raw data)
| sample_name | biosample_accession |
|---|---|
| CHIKV_01 | SAMN45678901 |
The SRA will use this to link your .fastq.gz file to the correct sample.
4.Submission to GEO
BioSample BioProject
SAMN45678901 PRJNA123456
Review and submit for review. After submission, you will receive a temporary GSE ID (e.g., GSE123456), and the NCBI team will curate it.
When everything is correctly linked, anyone (or reviewer!) will be able to:
Log in to BioProject → View the BioSamples → Access the data in SRA → View the processed files in GEO, as if it were a single interconnected study.
To submit processed files (e.g., gene counts):
Go to: https://submit.ncbi.nlm.nih.gov/subs/geo/
1.Create a new submission
2.Choose: Processed Data Submission (GSE)
3. Upload:
4. Fill in the study description, protocol, objectives, etc.
The Single Cell Expression Atlas is a public repository of EMBL-EBI that brings together single-cell RNA-seq and spatial transcriptomics data, reprocessed with standardized pipelines and enriched with ontologies. Submission follows the MAGE-TAB standard (IDF and SDRF files) and undergoes curation before being integrated into the Atlas.
This is the official technical guide, There are also additional instructions; here is a simplified version of the process: The workflow is similar to NCBI
[ ArrayExpress (input) ]
↓
[ ENA/SRA (raw data: FASTQ/BAM) ]
↓
[ Single Cell Expression Atlas (processed data and metadata) ]
Raw data: FASTQ or BAM files → submitted to ENA/SRA.
Processed data: Expression matrices (genes × cells), cell metadata (clusters, cell types, QC, mitochondrial percentage), normalized files.
Scripts/pipelines: .R, .ipynb, .sh files used for analysis.
Metadata: Complete tables describing samples, cells, and experimental conditions. Similar to BioSample.
Accepted formats: .tsv, .h5ad, .loom, plus supplementary files such as .R, .ipynb.
SCEA uses the MAGE-TAB standard for metadata, which is mandatory and consists of two main files:
IDF (Investigation Description File):
SDRF (Sample and Data Relationship File):
Simplified example of an SDRF:
| Sample Name | Organism | Tissue | Cell Type | Library Prep | Sequencing | Protocol | File Name |
|---|---|---|---|---|---|---|---|
| CHIKV_sc01 | Homo sapiens | PBMC | lymphocyte | 10x Genomics Chromium 3’ v3 | Illumina NovaSeq 6000 | CHIKV_sc01_R1.fastq.gz | CHIKV_sc01_R2.fastq.gz |
It should also include dissociation information, capture platform (e.g., 10x Genomics, Smart-seq2), and experimental conditions.
ArrayExpress is the EMBL-EBI repository used as an entry point for transcriptomic data. All submissions of single-cell RNA-seq or spatial transcriptomics go through it before being integrated into SCEA.
[ ArrayExpress ]
├── ENA (raw data: FASTQ/BAM)
└── Expression Atlas / Single Cell Expression Atlas (processed data + metadata)
Raw data: FASTQ or BAM, sent to ENA.
Processed data: expression matrices (genes × cells), cell metadata (clusters, cell types, QC), normalized files.
Scripts/pipelines: .R, .ipynb, .sh to guarantee reproducibility.
Metadata: complete tables describing samples, cells, and experimental conditions.
Equal to what was described previously
1.Create an account on EMBL-EBI.
2.Prepare the MAGE-TAB files (IDF + SDRF).
3.Send raw data to ENA.
4.Link BioProject and BioSample IDs.
5.Submit metadata and processed data to ArrayExpress.
6.Upload of .tsv, .h5ad, .loom, and .rds files.
7.Upload of scripts/pipelines.
8.Curation: The Atlas team reviews the metadata, applies ontologies, and reprocesses the data.
9.Publication: The dataset receives a public identifier (e.g., E-MTAB-12345) and is integrated into the Single Cell Expression Atlas.
The Atlas team reprocesses the data using standardized pipelines (e.g., alignment, normalization, clustering).
The metadata is harmonized with ontologies (Cell Ontology, Uberon, Disease Ontology).
The dataset receives a public identifier (E-MTAB-12345) and becomes searchable on the portal.
The HCA Data Portal does not function as an "open repository for any data from anyone"; there are well-defined criteria for submission. They accept single-cell and spatial transcriptomics data (scRNA-seq, ATAC-seq, multi-omics, spatial RNA-seq), including raw data (FASTQ/BAM) and processed data (AnnData .h5ad matrices). It must be accompanied by structured metadata following the official schemas (Tier 1 and Tier 2).
And it must be from high-quality studies, with clear protocols and sufficient documentation to allow reuse.
And it must be from high-quality studies, with clear protocols and sufficient documentation to allow reuse.
Restrictions:
Metadata is organized into two levels (tiers) to separate technical information from more sensitive information
This ensures that the data is FAIR (findable, accessible, interoperable, and reusable). This data is publicly available through the HCA Data Portal and on platforms such as CellxGene Discover.
Examples of fields:
This can enrich biological interpretation while maintaining anonymization and privacy. This metadata has controlled access, where some information may be restricted or anonymized to protect donors.
Examples of fields:
Full documentation is available on the HCA Data Portal and in the data ingestion guide:
HCA Data Portal – Contribube
HCA Data Ingestion Instructions PDF
The HCA organizes data into different layers to ensure accessibility and protection of sensitive information:
[ HCA Data Coordination Platform ]
├── ENA/SRA (Raw data: FASTQ)
├── HCA Data Repository (Tier 2 metadata + sensitive data)
└── CellxGene Discover (matrices AnnData + Tier 1 metadata)
Raw data: FASTQ or BAM files to be submitted to the HCA Data Repository.
Processed data: Expression matrices (genes × cells) in AnnData format (.h5ad).
Tier 1 Metadata: Technical information
Tier 2 Metadata: More detailed information
1.Register the project in HCA Data Coordination Platform
2.Prepare Tier 1 and Tier 2 metadata according to the official ingestion guide.
3.Submit raw data to the HCA Data Repository.
4.Submit processed matrices (AnnData .h5ad) to the HCA portal.
5.Receive accession ID and track curation.
6.Publication: the data is integrated into the portal, the dataset receives a public identifier (e.g., HCA12345).