FAIR data and sharing data

In this notebook, learners practice structuring and documenting metadata using consistent standards and assemble submission-ready packages for public repositories and data portals. The notebooks emphasize common file formats, required metadata fields and practical submission workflows, helping ensure that datasets remain discoverable, interoperable and readily reusable across platforms and studies.

Transcriptomic Data Submission Guide

This document provides guidance on how to prepare metadata and submit transcriptomic data to the NCBI (National Center for Biotechnology Information) in the SCEA (Single Cell Expression Atlas) and the HCA Data Portal (Human Cell Atlas Data Portal), and is applicable to single-cell, bulk RNA-Seq, and spatial transcriptomics, following the FAIR (Findable, Accessible, Interoperable, Reusable) principles.

Metadata in transcriptomic data

Metadata is data that describes the data. In other words, it is descriptive information that accompanies the raw and processed data, allowing them to be understood, contextualized, and reused. It can be tables with information about individuals, patients, a supplementary file with details of sample and file processing, etc.

In transcriptomic studies, whether bulk RNA-seq, single-cell, or spatial, metadata must capture both biological and technical aspects. This includes:

Biological information: organism, tissue, cell type, experimental condition, treatment, collection time, clinical or environmental characteristics.
Technical information: sequencing platform, library protocol, capture parameters (e.g., 10x Genomics, Smart-seq2, Visium), RNA quality, software and versions used.
Experimental context: study design, comparison groups, replicates, batch factors, controls.
Processed data: output files such as count matrices, normalized files, spatial coordinates; as well as scripts and pipelines used.

Well-structured metadata is essential to ensure that other researchers can correctly interpret the data, validate results, and integrate them into comparative analyses or reference databases.

FAIR Metadata

For metadata to fully fulfill its role, it must exhibit a series of characteristics that guarantee its usefulness and quality. In this sense, the FAIR principles offer a set of best practices that help organize and make metadata more effective.

Findable: use of persistent identifiers, e.g. for articles, DOIs are used. For data and metadata, identifiers such as BioProject, BioSample, SRA, and GEO, in addition to standardized keywords that facilitate searching.
Accessible: availability in public repositories, in open and machine-readable formats, ensuring broad and transparent access.
Interoperable: adoption of recognized vocabularies and ontologies (e.g., Cell Ontology, Uberon, Disease Ontology), allowing integration between different databases and consistent data recognition.
Reusable: Complete metadata, with clear protocols and sufficient documentation to allow for reanalysis and comparisons, while also ensuring the preservation of the coding and anonymization of individuals.

Although not mandatory, following these principles is highly recommended to increase the visibility, accessibility, and reuse of data.

File formats for submission

Before understanding the NCBI submission workflow, it's important to know the most common file formats. Each extension carries a type of information and has specific uses in RNA-seq analyses (bulk, single-cell, or spatial).

File .csv (Comma-Separated Values)

A file where fields are separated by commas. It is widely accepted in spreadsheet and analysis software. Frequently used for counting files.

Caution: May cause conflicts if there are commas in the text.

Excel: File > Save As > csv
Google Sheets: File > Download > csv

File .tsv (Tab-Separated Values)

A text file where each column is separated by tabs (TAB) and each line represents an entry (e.g., sample or gene).

Very commonly used for metadata and count matrices, as it avoids conflicts with commas in the text.

.csv	.tsv
Uses commas between fields	Uses tabulation (TAB)
May cause conflicts with commas in the text	Safer for textual metadata
Extension: .csv	Extension: .tsv

How to save:

Excel: File > Save As > Text (tab delimited) (.txt) → rename to .tsv (if necessary)
Google Sheets: File > Download > Tab delimited values (.tsv)

File .txt (Plain Text)

A plain text file, typically used for sample counts or basic metadata. It can be structured as a table (with tabs or spaces) or as a list.

File .rds (R Data Serialization)

It is a binary format specific to R. It allows saving complex objects (such as normalized arrays, Seurat objects, or SingleCellExperiments) while maintaining structure and metadata. Ideal for direct reuse in R analyses.

File .R (R Script)

A code file written in R (an R Script). Used to share analysis pipelines, including normalization, DESeq2, Seurato, etc.

File .sh (Shell Script)

It's a shell script, a terminal script (bash/shell). Used to automate processing steps, such as alignment, format conversion, or job submission to servers.

File .ipynb (Jupyter Notebook)

Interactive file that combines code (Python, R, etc.), results, and documentation. Very useful for reproducibility, as it shows the analysis step by step. Accepted by GEO as a way to share complete pipelines.

NCBI Submission Workflow

.csv	.tsv
Uses commas between fields	Uses tabulation (TAB)
May cause conflicts with commas in the text	Safer for textual metadata
Extension: .csv	Extension: .tsv


[ BioProject ]
     ↓
[ BioSample ] → [ SRA (raw files: FASTQ) ]
     ↓
[ GEO (Processed data: matrices, metadata, scripts) ]

The NCBI uses a hierarchical structure that connects different levels of information:

BioProject: is the highest level, representing the study as a whole. Each project receives a unique identifier (e.g., PRJNA123456) and groups all related samples and experiments.
BioSample: describes each individual biological sample within the project. This includes information such as organism, tissue, cell type, experimental condition, treatment, and collection site. Each BioSample receives its own code (e.g., SAMN45678901).
SRA (Sequence Read Archive): stores raw sequencing data, such as FASTQ or BAM files. Each submission generates an identifier SRRxxxxxxx.
GEO (Gene Expression Omnibus): stores processed data, such as count matrices, normalized files, and expression metadata. Each submission receives an identifier GSExxxxxx.

Creating the BioProject

BioProject groups all samples and data from a study. Create only one BioProject per study.

You need to create an NCBI login to do this.

Include title, description, organism, and data type.

Example:

project_title: Single-cell transcriptomic atlas of PBMCs during CHIKV infection

data_type: Transcriptome (bulk + single-cell + spatial)

Create via the website: https://submit.ncbi.nlm.nih.gov/subs/bioproject
When submitted, it generates a PRJNAxxxxx code (e.g., PRJNA123456)

Creating the BioSample

Each biological sample receives a unique ID. For transcriptomic samples, describe:

sample_name, organism, tissue, cell_type, disease, treatment, time_point, geo_loc_name

Additional fields: sequencing_protocol, dissociation_method, library_prep, cell_capture_platform (e.g., 10x Genomics Chromium, Smart-seq2)

sample_name	organism	tissue	cell_capture_platform	library_prep	disease	time_point	bioproject_accession
CHIKV_sc01	Homo sapiens	PBMC	10x Genomics	Chromium 10x 3’ v3	Chikungunya fever	3 dpi	PRJNA123456

Create BioSamples linked to your BioProject through the same portal https://submit.ncbi.nlm.nih.gov/subs/bioproject, , or prepare a batch upload with .tsv files.

Preferably use more complete tables, such as:

sample_name	organism	tissue	cell_type	cell_capture_platform	library_prep	sequencing_protocol	dissociation_method	disease	time_point	geo_loc_name	age	sex	bioproject_accession	description
CHIKV_sc01	Homo sapiens	PBMC	Lymphocytes (mixed)	10x Genomics	Chromium 10x 3’ v3	Illumina NovaSeq 6000, paired-end 2×75 bp	Ficoll gradient + RBC lysis	Chikungunya fever	3 dpi	Brazil: Bahia	35	F	PRJNA123456	PBMCs isolated 3 days post CHIKV infection, processed with 10x Genomics Chromium 3’ v3

This table can be saved as biosample_metadata.tsv

After submission, the system returns accessions for each sample as:


SAMN45678901
SAMN45678902

All metadata, whether from BioSample, SRA, or GEO, must include the BioProject code to which they belong.

This ensures proper linking between the different levels of submission and allows the data to be navigated in an integrated manner. The BioProject serves as the primary identifier of the study, and without it, there is no way to relate samples, raw data, and processed data.

SRA Metadata

The metadata table needs to describe the raw sequencing files and how they were generated. In addition to the basic fields, it is essential to include the BioProject and BioSamples codes to ensure correct linking.

Bulk RNA-seq: paired or single-end .fastq.gz files or aligned .bam files.
Single-cell RNA-seq: .fastq.gz files per library/cell, with platform metadata.
Spatial transcriptomics: .fastq.gz per capture/library, accompanied by platform‑specific metadata (e.g., spot or coordinate matrices).

sample_name	biosample_accession	bioproject_accession	library_ID	title	time_point	library_strategy	library_source	library_selection	library_layout	platform	instrument_model	insert_size	filetype	filename	design_description	library_construction_protocol
CHIKV_sc01	SAMN45678901	PRJNA123456	LIB01	scRNA-seq of PBMCs	3 dpi	scRNA-Seq	TRANSCRIPTOMIC	RANDOM	PAIRED	ILLUMINA	NovaSeq 6000	280	fastq	CHIKV_sc01_R1.fastq.gz;CHIKV_sc01_R2.fastq.gz	Single-cell RNA-seq of PBMCs infected with CHIKV, 3 days post infection	10x Genomics Chromium 3’ v3 kit
CHIKV_sc02	SAMN45678902	PRJNA123456	LIB02	scRNA-seq of PBMCs	5 dpi	scRNA-Seq	TRANSCRIPTOMIC	RANDOM	PAIRED	ILLUMINA	NovaSeq 6000	280	fastq	CHIKV_sc02_R1.fastq.gz;CHIKV_sc02_R2.fastq.gz	Single-cell RNA-seq of PBMCs infected with CHIKV, 5 days post infection	10x Genomics Chromium 3’ v3 kit

Attention: Note that .fastq.gz files must be named consistently in the correct field.

Add other fields if you want more details (e.g., basecaller, alignment software). Save as sra_metadata.tsv

GEO Metadata

GEO is a public repository from the NCBI focused on processed gene expression data, including:

File Type	Extension	Example
Count Matrix	.tsv, .csv	counts_matrix.tsv
Sample Counts	.tsv, .txt	counts_CHIKV_01.tsv
Normalized Files	.tsv, .rds	normalized_counts.rds
Scripts or pipelines	.R, .sh, .ipynb	deseq2_analysis.R
Sample Metadata	.tsv	geo_sample_metadata.tsv

You can also include experimental flow diagrams, batch factors, and even RIN and RNA concentration.

title	biosample_accession	source_name	organism	treatment	time_point	file_type	file_name	BioProject
Expression of PBMCs CHIKV 3dpi	SAMN45678901	PBMC	Homo sapiens	CHIKV	3dpi	Counts	counts_CHIKV_01.tsv	PRJNA123456

Prepare based on GEO submission templates:https://www.ncbi.nlm.nih.gov/geo/info/submission.html?form=MG0AV3

A more complete example

sample_title	biosample_accession	source_name	organism	characteristics_ch1	time_point	treatment	protocol_ch1	data_processing	file_name	file_type	BioProject
CHIKV_01	SAMN45678901	PBMC	Homo sapiens	disease: Chikungunya fever	3 dpi	CHIKV infection	rRNA depletion + TruSeq	alignment with HISAT2, counts with StringTie and prepDE	counts_CHIKV_01.tsv	TSV	PRJNA123456
CHIKV_02	SAMN45678902	PBMC	Homo sapiens	disease: Chikungunya fever	5 dpi	CHIKV infection	rRNA depletion + TruSeq	alignment with HISAT2, counts with StringTie and prepDE	counts_CHIKV_02.tsv	TSV	PRJNA123456

The characteristics_ch1 field in GEO is extremely flexible and powerful; it allows you to describe various biological, clinical, or technical characteristics of your sample, in addition to the disease.

Another example of how it can be more complete:

sample_title	biosample_accession	source_name	organism	characteristics_ch1	characteristics_ch1	characteristics_ch1	characteristics_ch1	time_point	treatment	protocol_ch1	data_processing	file_name	file_type
CHIKV_01_3dpi	SAMN45678901	PBMC	Homo sapiens	disease: Chikungunya fever	sex: female	age: 35	RIN: 8.5	3 dpi	CHIKV infection	rRNA depletion + TruSeq	alignment with HISAT2, counts with StringTie & prepDE	counts_CHIKV_01.tsv	TSV

Simplified submission flowchart


PRJNA123456   (BioProject)
   ├── SAMN45678901   (BioSample)
   │      ├── SRRxxxxxxx   (SRA - raw data)
   │      └── GSMxxxxxxx   (GEO - processed data)
   ├── SAMN45678902   (BioSample)
   │      ├── SRRyyyyyyy   (SRA - raw data)
   │      └── GSMyyyyyyy   (GEO - processed data)

1. Create your BioProject

Access: https://submit.ncbi.nlm.nih.gov/subs/bioproject
Fill in the study information (title, organism, type) as explained above
When submitted, it generates a code like: PRJNA123456

2.Submit your BioSamples

Go to:

When filling in each line (via form or .tsv), include the field:


bioproject_accession
PRJNA123456

Each sample receives a code like: SAMN45678901

Each BioSample must have a unique name (e.g., CHIKV_01), and this same name will be used in the SRA and GEO metadata.

3. Submission to the SRA (raw data)

Access: https://submit.ncbi.nlm.nih.gov/subs/sra
Upload the .fastq.gz files
In your .tsv file or form, include:

sample_name	biosample_accession
CHIKV_01	SAMN45678901

The SRA will use this to link your .fastq.gz file to the correct sample.

4.Submission to GEO

Go to: https://submit.ncbi.nlm.nih.gov/subs/geo
In sample_metadata.tsv, include:


BioSample         BioProject
SAMN45678901      PRJNA123456

Review and submit for review. After submission, you will receive a temporary GSE ID (e.g., GSE123456), and the NCBI team will curate it.

When everything is correctly linked, anyone (or reviewer!) will be able to:

Log in to BioProject → View the BioSamples → Access the data in SRA → View the processed files in GEO, as if it were a single interconnected study.

To submit processed files (e.g., gene counts):

Go to: https://submit.ncbi.nlm.nih.gov/subs/geo/

1.Create a new submission

2.Choose: Processed Data Submission (GSE)

3. Upload:

The processed files (.tsv, .rds, etc.)
The metadata spreadsheet
The scripts or supplementary materials

4. Fill in the study description, protocol, objectives, etc.

Workflow for submission to the Single Cell Expression Atlas (SCEA)

The Single Cell Expression Atlas is a public repository of EMBL-EBI that brings together single-cell RNA-seq and spatial transcriptomics data, reprocessed with standardized pipelines and enriched with ontologies. Submission follows the MAGE-TAB standard (IDF and SDRF files) and undergoes curation before being integrated into the Atlas.

This is the official technical guide, There are also additional instructions; here is a simplified version of the process: The workflow is similar to NCBI


[ ArrayExpress (input) ]
       ↓
[ ENA/SRA (raw data: FASTQ/BAM) ]
       ↓
[ Single Cell Expression Atlas (processed data and metadata) ]

ArrayExpress: submission entry point.
ENA (European Nucleotide Archive)/SRA: stores raw sequencing data.
SCEA: receives processed data, metadata, and pipelines, and integrates them into the portal.

Required Data

Raw data: FASTQ or BAM files → submitted to ENA/SRA.

Processed data: Expression matrices (genes × cells), cell metadata (clusters, cell types, QC, mitochondrial percentage), normalized files.

Scripts/pipelines: .R, .ipynb, .sh files used for analysis.

Metadata: Complete tables describing samples, cells, and experimental conditions. Similar to BioSample.

Accepted formats: .tsv, .h5ad, .loom, plus supplementary files such as .R, .ipynb.

Creating MAGE-TAB Files

SCEA uses the MAGE-TAB standard for metadata, which is mandatory and consists of two main files:

IDF (Investigation Description File):

describes the study: title, abstract, contacts, associated publications.
Example fields: Investigation Title, Experiment Description, Submitter Email.

SDRF (Sample and Data Relationship File):

detailed table relating samples, cells, files, and biological/technical characteristics.
Important fields: Sample Name, Organism, Cell Type, Library Prep, Sequencing Protocol, File Name (FASTQ, counting matrix, etc.).

Simplified example of an SDRF:

Sample Name	Organism	Tissue	Cell Type	Library Prep	Sequencing	Protocol	File Name
CHIKV_sc01	Homo sapiens	PBMC	lymphocyte	10x Genomics Chromium 3’ v3	Illumina NovaSeq 6000	CHIKV_sc01_R1.fastq.gz	CHIKV_sc01_R2.fastq.gz

It should also include dissociation information, capture platform (e.g., 10x Genomics, Smart-seq2), and experimental conditions.

Submission to ArrayExpress

ArrayExpress is the EMBL-EBI repository used as an entry point for transcriptomic data. All submissions of single-cell RNA-seq or spatial transcriptomics go through it before being integrated into SCEA.


[ ArrayExpress ]
   ├── ENA (raw data: FASTQ/BAM)
   └── Expression Atlas / Single Cell Expression Atlas (processed data + metadata)

Required Data

Raw data: FASTQ or BAM, sent to ENA.

Processed data: expression matrices (genes × cells), cell metadata (clusters, cell types, QC), normalized files.

Scripts/pipelines: .R, .ipynb, .sh to guarantee reproducibility.

Metadata: complete tables describing samples, cells, and experimental conditions.

Creating MAGE-TAB files

Equal to what was described previously

Workflow

1.Create an account on EMBL-EBI.

2.Prepare the MAGE-TAB files (IDF + SDRF).

3.Send raw data to ENA.

4.Link BioProject and BioSample IDs.

5.Submit metadata and processed data to ArrayExpress.

6.Upload of .tsv, .h5ad, .loom, and .rds files.

7.Upload of scripts/pipelines.

8.Curation: The Atlas team reviews the metadata, applies ontologies, and reprocesses the data.

9.Publication: The dataset receives a public identifier (e.g., E-MTAB-12345) and is integrated into the Single Cell Expression Atlas.

Curation and Integration

The Atlas team reprocesses the data using standardized pipelines (e.g., alignment, normalization, clustering).

The metadata is harmonized with ontologies (Cell Ontology, Uberon, Disease Ontology).

The dataset receives a public identifier (E-MTAB-12345) and becomes searchable on the portal.

Workflow for submission to the Human Cell Atlas Data Portal (HCA Data Portal)

The HCA Data Portal does not function as an "open repository for any data from anyone"; there are well-defined criteria for submission. They accept single-cell and spatial transcriptomics data (scRNA-seq, ATAC-seq, multi-omics, spatial RNA-seq), including raw data (FASTQ/BAM) and processed data (AnnData .h5ad matrices). It must be accompanied by structured metadata following the official schemas (Tier 1 and Tier 2).

And it must be from high-quality studies, with clear protocols and sufficient documentation to allow reuse.

Restrictions:

They do not accept any uncurated data: datasets undergo technical review to ensure consistency and quality.
Privacy: human data must be anonymized. Sensitive information (Tier 2, such as age, sex, clinical condition) is controlled and only accessible in secure environments.
Format: only standardized formats are accepted (FASTQ, BAM, AnnData .h5ad, metadata in structured tables).
Scope: the focus is on single-cell and spatial data. Bulk RNA-seq data, for example, are not included in the HCA Data Portal.

Tiers

Metadata is organized into two levels (tiers) to separate technical information from more sensitive information

Tier 1 Metadata: technical and experimental information necessary to interpret the data.

This ensures that the data is FAIR (findable, accessible, interoperable, and reusable). This data is publicly available through the HCA Data Portal and on platforms such as CellxGene Discover.

Examples of fields:

Organism
Tissue / Organ
Cell type (with ontologies such as Cell Ontology, Uberon)
Library preparation method (10x Genomics, Smart-seq2, etc.)
Sequencing protocol (Illumina NovaSeq, etc.)
File names (FASTQ, BAM, AnnData .h5ad)

Tier 2 Metadata: Additional information that may include sensitive or clinical data

This can enrich biological interpretation while maintaining anonymization and privacy. This metadata has controlled access, where some information may be restricted or anonymized to protect donors.

Examples of fields:

Donor age
Sex
Ethnicity
Clinical condition / disease status
Treatment history

Full documentation is available on the HCA Data Portal and in the data ingestion guide:

HCA Data Portal – Contribube

HCA Metadata Schema

HCA Data Ingestion Instructions PDF

Structure

The HCA organizes data into different layers to ensure accessibility and protection of sensitive information:


[ HCA Data Coordination Platform ]
       ├── ENA/SRA (Raw data: FASTQ)
       ├── HCA Data Repository (Tier 2 metadata + sensitive data)
       └── CellxGene Discover (matrices AnnData + Tier 1 metadata)

HCA Data Repository: stores raw files (FASTQ) and Tier 2 metadata (may contain personal or sensitive information).
CellxGene Discover: stores expression matrices in AnnData format (.h5ad) and Tier 1 metadata (technical information such as capture protocol, cell enrichment, QC).
Data Coordination Platform (DCP): entry point for submission, where you register the project, submit datasets, and receive an accession ID.

Required Data

Raw data: FASTQ or BAM files to be submitted to the HCA Data Repository.

Processed data: Expression matrices (genes × cells) in AnnData format (.h5ad).

Tier 1 Metadata: Technical information

Capture platform
Dissociation method
Library protocol
QC

Tier 2 Metadata: More detailed information

Clinical data
Age
Sex
Clinical condition

Submission process

1.Register the project in HCA Data Coordination Platform

Create a new project and provide a title, summary, and contact information.
Link the raw data (FASTQ/BAM) already submitted to ENA/SRA.

2.Prepare Tier 1 and Tier 2 metadata according to the official ingestion guide.

3.Submit raw data to the HCA Data Repository.

4.Submit processed matrices (AnnData .h5ad) to the HCA portal.

The file will be validated against the official schema (checking for required)
After curation, the dataset will be integrated into CellxGene Discover.

5.Receive accession ID and track curation.

6.Publication: the data is integrated into the portal, the dataset receives a public identifier (e.g., HCA12345).

It becomes searchable and viewable on CellxGene Discover, with filters by organism, tissue, cell type, disease, etc.