FAIR data and sharing data

In this notebook, learners practice structuring and documenting metadata using consistent standards and assemble submission-ready packages for public repositories and data portals. The notebooks emphasize common file formats, required metadata fields and practical submission workflows, helping ensure that datasets remain discoverable, interoperable and readily reusable across platforms and studies.


Transcriptomic Data Submission Guide


This document provides guidance on how to prepare metadata and submit transcriptomic data to the NCBI (National Center for Biotechnology Information) in the SCEA (Single Cell Expression Atlas) and the HCA Data Portal (Human Cell Atlas Data Portal), and is applicable to single-cell, bulk RNA-Seq, and spatial transcriptomics, following the FAIR (Findable, Accessible, Interoperable, Reusable) principles.


Metadata in transcriptomic data


Metadata is data that describes the data. In other words, it is descriptive information that accompanies the raw and processed data, allowing them to be understood, contextualized, and reused. It can be tables with information about individuals, patients, a supplementary file with details of sample and file processing, etc.

In transcriptomic studies, whether bulk RNA-seq, single-cell, or spatial, metadata must capture both biological and technical aspects. This includes:

  • Biological information: organism, tissue, cell type, experimental condition, treatment, collection time, clinical or environmental characteristics.
  • Technical information: sequencing platform, library protocol, capture parameters (e.g., 10x Genomics, Smart-seq2, Visium), RNA quality, software and versions used.
  • Experimental context: study design, comparison groups, replicates, batch factors, controls.
  • Processed data: output files such as count matrices, normalized files, spatial coordinates; as well as scripts and pipelines used.

Well-structured metadata is essential to ensure that other researchers can correctly interpret the data, validate results, and integrate them into comparative analyses or reference databases.


FAIR Metadata


For metadata to fully fulfill its role, it must exhibit a series of characteristics that guarantee its usefulness and quality. In this sense, the FAIR principles offer a set of best practices that help organize and make metadata more effective.

  • Findable: use of persistent identifiers, e.g. for articles, DOIs are used. For data and metadata, identifiers such as BioProject, BioSample, SRA, and GEO, in addition to standardized keywords that facilitate searching.
  • Accessible: availability in public repositories, in open and machine-readable formats, ensuring broad and transparent access.
  • Interoperable: adoption of recognized vocabularies and ontologies (e.g., Cell Ontology, Uberon, Disease Ontology), allowing integration between different databases and consistent data recognition.
  • Reusable: Complete metadata, with clear protocols and sufficient documentation to allow for reanalysis and comparisons, while also ensuring the preservation of the coding and anonymization of individuals.

Although not mandatory, following these principles is highly recommended to increase the visibility, accessibility, and reuse of data.


File formats for submission


Before understanding the NCBI submission workflow, it's important to know the most common file formats. Each extension carries a type of information and has specific uses in RNA-seq analyses (bulk, single-cell, or spatial).


File .csv (Comma-Separated Values)


A file where fields are separated by commas. It is widely accepted in spreadsheet and analysis software. Frequently used for counting files.

Caution: May cause conflicts if there are commas in the text.

  • Excel: File > Save As > csv
  • Google Sheets: File > Download > csv

File .tsv (Tab-Separated Values)

A text file where each column is separated by tabs (TAB) and each line represents an entry (e.g., sample or gene).

Very commonly used for metadata and count matrices, as it avoids conflicts with commas in the text.


.csv .tsv
Uses commas between fields Uses tabulation (TAB)
May cause conflicts with commas in the text Safer for textual metadata
Extension: .csv Extension: .tsv

    How to save:

  • Excel: File > Save As > Text (tab delimited) (.txt) → rename to .tsv (if necessary)
  • Google Sheets: File > Download > Tab delimited values ​​(.tsv)

File .txt (Plain Text)

A plain text file, typically used for sample counts or basic metadata. It can be structured as a table (with tabs or spaces) or as a list.


File .rds (R Data Serialization)

It is a binary format specific to R. It allows saving complex objects (such as normalized arrays, Seurat objects, or SingleCellExperiments) while maintaining structure and metadata. Ideal for direct reuse in R analyses.


File .R (R Script)

A code file written in R (an R Script). Used to share analysis pipelines, including normalization, DESeq2, Seurato, etc.


File .sh (Shell Script)

It's a shell script, a terminal script (bash/shell). Used to automate processing steps, such as alignment, format conversion, or job submission to servers.


File .ipynb (Jupyter Notebook)

Interactive file that combines code (Python, R, etc.), results, and documentation. Very useful for reproducibility, as it shows the analysis step by step. Accepted by GEO as a way to share complete pipelines.


NCBI Submission Workflow

.csv .tsv
Uses commas between fields Uses tabulation (TAB)
May cause conflicts with commas in the text Safer for textual metadata
Extension: .csv Extension: .tsv

[ BioProject ]
     ↓
[ BioSample ] → [ SRA (raw files: FASTQ) ]
     ↓
[ GEO (Processed data: matrices, metadata, scripts) ]
    

The NCBI uses a hierarchical structure that connects different levels of information:

  • BioProject: is the highest level, representing the study as a whole. Each project receives a unique identifier (e.g., PRJNA123456) and groups all related samples and experiments.
  • BioSample: describes each individual biological sample within the project. This includes information such as organism, tissue, cell type, experimental condition, treatment, and collection site. Each BioSample receives its own code (e.g., SAMN45678901).
  • SRA (Sequence Read Archive): stores raw sequencing data, such as FASTQ or BAM files. Each submission generates an identifier SRRxxxxxxx.
  • GEO (Gene Expression Omnibus): stores processed data, such as count matrices, normalized files, and expression metadata. Each submission receives an identifier GSExxxxxx.

Creating the BioProject

BioProject groups all samples and data from a study. Create only one BioProject per study.


You need to create an NCBI login to do this.


Include title, description, organism, and data type.


Example:

project_title: Single-cell transcriptomic atlas of PBMCs during CHIKV infection

data_type: Transcriptome (bulk + single-cell + spatial)


Creating the BioSample


Each biological sample receives a unique ID. For transcriptomic samples, describe:


sample_name, organism, tissue, cell_type, disease, treatment, time_point, geo_loc_name


Additional fields: sequencing_protocol, dissociation_method, library_prep, cell_capture_platform (e.g., 10x Genomics Chromium, Smart-seq2)

sample_name organism tissue cell_capture_platform library_prep disease time_point bioproject_accession
CHIKV_sc01 Homo sapiens PBMC 10x Genomics Chromium 10x 3’ v3 Chikungunya fever 3 dpi PRJNA123456

Create BioSamples linked to your BioProject through the same portal https://submit.ncbi.nlm.nih.gov/subs/bioproject, , or prepare a batch upload with .tsv files.


Preferably use more complete tables, such as:

sample_name organism tissue cell_type cell_capture_platform library_prep sequencing_protocol dissociation_method disease time_point geo_loc_name age sex bioproject_accession description
CHIKV_sc01 Homo sapiens PBMC Lymphocytes (mixed) 10x Genomics Chromium 10x 3’ v3 Illumina NovaSeq 6000, paired-end 2×75 bp Ficoll gradient + RBC lysis Chikungunya fever 3 dpi Brazil: Bahia 35 F PRJNA123456 PBMCs isolated 3 days post CHIKV infection, processed with 10x Genomics Chromium 3’ v3

This table can be saved as biosample_metadata.tsv


After submission, the system returns accessions for each sample as:


SAMN45678901
SAMN45678902
    
All metadata, whether from BioSample, SRA, or GEO, must include the BioProject code to which they belong.

This ensures proper linking between the different levels of submission and allows the data to be navigated in an integrated manner. The BioProject serves as the primary identifier of the study, and without it, there is no way to relate samples, raw data, and processed data.


SRA Metadata


The metadata table needs to describe the raw sequencing files and how they were generated. In addition to the basic fields, it is essential to include the BioProject and BioSamples codes to ensure correct linking.

  • Bulk RNA-seq: paired or single-end .fastq.gz files or aligned .bam files.
  • Single-cell RNA-seq: .fastq.gz files per library/cell, with platform metadata.
  • Spatial transcriptomics: .fastq.gz per capture/library, accompanied by platform‑specific metadata (e.g., spot or coordinate matrices).

sample_name biosample_accession bioproject_accession library_ID title time_point library_strategy library_source library_selection library_layout platform instrument_model insert_size filetype filename design_description library_construction_protocol
CHIKV_sc01 SAMN45678901 PRJNA123456 LIB01 scRNA-seq of PBMCs 3 dpi scRNA-Seq TRANSCRIPTOMIC RANDOM PAIRED ILLUMINA NovaSeq 6000 280 fastq CHIKV_sc01_R1.fastq.gz;CHIKV_sc01_R2.fastq.gz Single-cell RNA-seq of PBMCs infected with CHIKV, 3 days post infection 10x Genomics Chromium 3’ v3 kit
CHIKV_sc02 SAMN45678902 PRJNA123456 LIB02 scRNA-seq of PBMCs 5 dpi scRNA-Seq TRANSCRIPTOMIC RANDOM PAIRED ILLUMINA NovaSeq 6000 280 fastq CHIKV_sc02_R1.fastq.gz;CHIKV_sc02_R2.fastq.gz Single-cell RNA-seq of PBMCs infected with CHIKV, 5 days post infection 10x Genomics Chromium 3’ v3 kit
Attention: Note that .fastq.gz files must be named consistently in the correct field.

Add other fields if you want more details (e.g., basecaller, alignment software). Save as sra_metadata.tsv


GEO Metadata


GEO is a public repository from the NCBI focused on processed gene expression data, including:

File Type Extension Example
Count Matrix .tsv, .csv counts_matrix.tsv
Sample Counts .tsv, .txt counts_CHIKV_01.tsv
Normalized Files .tsv, .rds normalized_counts.rds
Scripts or pipelines .R, .sh, .ipynb deseq2_analysis.R
Sample Metadata .tsv geo_sample_metadata.tsv

You can also include experimental flow diagrams, batch factors, and even RIN and RNA concentration.


title biosample_accession source_name organism treatment time_point file_type file_name BioProject
Expression of PBMCs CHIKV 3dpi SAMN45678901 PBMC Homo sapiens CHIKV 3dpi Counts counts_CHIKV_01.tsv PRJNA123456

Prepare based on GEO submission templates:https://www.ncbi.nlm.nih.gov/geo/info/submission.html?form=MG0AV3


A more complete example

sample_title biosample_accession source_name organism characteristics_ch1 time_point treatment protocol_ch1 data_processing file_name file_type BioProject
CHIKV_01 SAMN45678901 PBMC Homo sapiens disease: Chikungunya fever 3 dpi CHIKV infection rRNA depletion + TruSeq alignment with HISAT2, counts with StringTie and prepDE counts_CHIKV_01.tsv TSV PRJNA123456
CHIKV_02 SAMN45678902 PBMC Homo sapiens disease: Chikungunya fever 5 dpi CHIKV infection rRNA depletion + TruSeq alignment with HISAT2, counts with StringTie and prepDE counts_CHIKV_02.tsv TSV PRJNA123456

The characteristics_ch1 field in GEO is extremely flexible and powerful; it allows you to describe various biological, clinical, or technical characteristics of your sample, in addition to the disease.


Another example of how it can be more complete:

sample_title biosample_accession source_name organism characteristics_ch1 characteristics_ch1 characteristics_ch1 characteristics_ch1 time_point treatment protocol_ch1 data_processing file_name file_type
CHIKV_01_3dpi SAMN45678901 PBMC Homo sapiens disease: Chikungunya fever sex: female age: 35 RIN: 8.5 3 dpi CHIKV infection rRNA depletion + TruSeq alignment with HISAT2, counts with StringTie & prepDE counts_CHIKV_01.tsv TSV

Simplified submission flowchart


PRJNA123456   (BioProject)
   ├── SAMN45678901   (BioSample)
   │      ├── SRRxxxxxxx   (SRA - raw data)
   │      └── GSMxxxxxxx   (GEO - processed data)
   ├── SAMN45678902   (BioSample)
   │      ├── SRRyyyyyyy   (SRA - raw data)
   │      └── GSMyyyyyyy   (GEO - processed data)
    

1. Create your BioProject

2.Submit your BioSamples

  • Go to:

When filling in each line (via form or .tsv), include the field:


bioproject_accession
PRJNA123456
    

Each sample receives a code like: SAMN45678901


Each BioSample must have a unique name (e.g., CHIKV_01), and this same name will be used in the SRA and GEO metadata.

3. Submission to the SRA (raw data)


sample_name biosample_accession
CHIKV_01 SAMN45678901

The SRA will use this to link your .fastq.gz file to the correct sample.


4.Submission to GEO



BioSample         BioProject
SAMN45678901      PRJNA123456
    

Review and submit for review. After submission, you will receive a temporary GSE ID (e.g., GSE123456), and the NCBI team will curate it.


When everything is correctly linked, anyone (or reviewer!) will be able to:


Log in to BioProject → View the BioSamples → Access the data in SRA → View the processed files in GEO, as if it were a single interconnected study.


To submit processed files (e.g., gene counts):


Go to: https://submit.ncbi.nlm.nih.gov/subs/geo/

1.Create a new submission

2.Choose: Processed Data Submission (GSE)

3. Upload:

  • The processed files (.tsv, .rds, etc.)
  • The metadata spreadsheet
  • The scripts or supplementary materials

4. Fill in the study description, protocol, objectives, etc.


Workflow for submission to the Single Cell Expression Atlas (SCEA)


The Single Cell Expression Atlas is a public repository of EMBL-EBI that brings together single-cell RNA-seq and spatial transcriptomics data, reprocessed with standardized pipelines and enriched with ontologies. Submission follows the MAGE-TAB standard (IDF and SDRF files) and undergoes curation before being integrated into the Atlas.


This is the official technical guide, There are also additional instructions; here is a simplified version of the process: The workflow is similar to NCBI


[ ArrayExpress (input) ]
       ↓
[ ENA/SRA (raw data: FASTQ/BAM) ]
       ↓
[ Single Cell Expression Atlas (processed data and metadata) ]
    
  • ArrayExpress: submission entry point.
  • ENA (European Nucleotide Archive)/SRA: stores raw sequencing data.
  • SCEA: receives processed data, metadata, and pipelines, and integrates them into the portal.

Required Data

Raw data: FASTQ or BAM files → submitted to ENA/SRA.


Processed data: Expression matrices (genes × cells), cell metadata (clusters, cell types, QC, mitochondrial percentage), normalized files.


Scripts/pipelines: .R, .ipynb, .sh files used for analysis.


Metadata: Complete tables describing samples, cells, and experimental conditions. Similar to BioSample.


Accepted formats: .tsv, .h5ad, .loom, plus supplementary files such as .R, .ipynb.


Creating MAGE-TAB Files


SCEA uses the MAGE-TAB standard for metadata, which is mandatory and consists of two main files:

IDF (Investigation Description File):

  • describes the study: title, abstract, contacts, associated publications.
  • Example fields: Investigation Title, Experiment Description, Submitter Email.

SDRF (Sample and Data Relationship File):

  • detailed table relating samples, cells, files, and biological/technical characteristics.
  • Important fields: Sample Name, Organism, Cell Type, Library Prep, Sequencing Protocol, File Name (FASTQ, counting matrix, etc.).

Simplified example of an SDRF:

Sample Name Organism Tissue Cell Type Library Prep Sequencing Protocol File Name
CHIKV_sc01 Homo sapiens PBMC lymphocyte 10x Genomics Chromium 3’ v3 Illumina NovaSeq 6000 CHIKV_sc01_R1.fastq.gz CHIKV_sc01_R2.fastq.gz

It should also include dissociation information, capture platform (e.g., 10x Genomics, Smart-seq2), and experimental conditions.


Submission to ArrayExpress


ArrayExpress is the EMBL-EBI repository used as an entry point for transcriptomic data. All submissions of single-cell RNA-seq or spatial transcriptomics go through it before being integrated into SCEA.


[ ArrayExpress ]
   ├── ENA (raw data: FASTQ/BAM)
   └── Expression Atlas / Single Cell Expression Atlas (processed data + metadata)
    

Required Data

Raw data: FASTQ or BAM, sent to ENA.


Processed data: expression matrices (genes × cells), cell metadata (clusters, cell types, QC), normalized files.


Scripts/pipelines: .R, .ipynb, .sh to guarantee reproducibility.


Metadata: complete tables describing samples, cells, and experimental conditions.


Creating MAGE-TAB files


Equal to what was described previously


Workflow

1.Create an account on EMBL-EBI.

2.Prepare the MAGE-TAB files (IDF + SDRF).

3.Send raw data to ENA.

4.Link BioProject and BioSample IDs.

5.Submit metadata and processed data to ArrayExpress.

6.Upload of .tsv, .h5ad, .loom, and .rds files.

7.Upload of scripts/pipelines.

8.Curation: The Atlas team reviews the metadata, applies ontologies, and reprocesses the data.

9.Publication: The dataset receives a public identifier (e.g., E-MTAB-12345) and is integrated into the Single Cell Expression Atlas.


Curation and Integration


The Atlas team reprocesses the data using standardized pipelines (e.g., alignment, normalization, clustering).


The metadata is harmonized with ontologies (Cell Ontology, Uberon, Disease Ontology).


The dataset receives a public identifier (E-MTAB-12345) and becomes searchable on the portal.


Workflow for submission to the Human Cell Atlas Data Portal (HCA Data Portal)


The HCA Data Portal does not function as an "open repository for any data from anyone"; there are well-defined criteria for submission. They accept single-cell and spatial transcriptomics data (scRNA-seq, ATAC-seq, multi-omics, spatial RNA-seq), including raw data (FASTQ/BAM) and processed data (AnnData .h5ad matrices). It must be accompanied by structured metadata following the official schemas (Tier 1 and Tier 2).


And it must be from high-quality studies, with clear protocols and sufficient documentation to allow reuse.


And it must be from high-quality studies, with clear protocols and sufficient documentation to allow reuse.


Restrictions:


  • They do not accept any uncurated data: datasets undergo technical review to ensure consistency and quality.
  • Privacy: human data must be anonymized. Sensitive information (Tier 2, such as age, sex, clinical condition) is controlled and only accessible in secure environments.
  • Format: only standardized formats are accepted (FASTQ, BAM, AnnData .h5ad, metadata in structured tables).
  • Scope: the focus is on single-cell and spatial data. Bulk RNA-seq data, for example, are not included in the HCA Data Portal.

Tiers

Metadata is organized into two levels (tiers) to separate technical information from more sensitive information


Tier 1 Metadata: technical and experimental information necessary to interpret the data.

This ensures that the data is FAIR (findable, accessible, interoperable, and reusable). This data is publicly available through the HCA Data Portal and on platforms such as CellxGene Discover.

Examples of fields:

  • Organism
  • Tissue / Organ
  • Cell type (with ontologies such as Cell Ontology, Uberon)
  • Library preparation method (10x Genomics, Smart-seq2, etc.)
  • Sequencing protocol (Illumina NovaSeq, etc.)
  • File names (FASTQ, BAM, AnnData .h5ad)

Tier 2 Metadata: Additional information that may include sensitive or clinical data

This can enrich biological interpretation while maintaining anonymization and privacy. This metadata has controlled access, where some information may be restricted or anonymized to protect donors.

Examples of fields:

  • Donor age
  • Sex
  • Ethnicity
  • Clinical condition / disease status
  • Treatment history

Full documentation is available on the HCA Data Portal and in the data ingestion guide:


HCA Data Portal – Contribube

HCA Metadata Schema

HCA Data Ingestion Instructions PDF


Structure

The HCA organizes data into different layers to ensure accessibility and protection of sensitive information:


[ HCA Data Coordination Platform ]
       ├── ENA/SRA (Raw data: FASTQ)
       ├── HCA Data Repository (Tier 2 metadata + sensitive data)
       └── CellxGene Discover (matrices AnnData + Tier 1 metadata)
    
  • HCA Data Repository: stores raw files (FASTQ) and Tier 2 metadata (may contain personal or sensitive information).
  • CellxGene Discover: stores expression matrices in AnnData format (.h5ad) and Tier 1 metadata (technical information such as capture protocol, cell enrichment, QC).
  • Data Coordination Platform (DCP): entry point for submission, where you register the project, submit datasets, and receive an accession ID.

Required Data


Raw data: FASTQ or BAM files to be submitted to the HCA Data Repository.

Processed data: Expression matrices (genes × cells) in AnnData format (.h5ad).

Tier 1 Metadata: Technical information

  • Capture platform
  • Dissociation method
  • Library protocol
  • QC

Tier 2 Metadata: More detailed information

  • Clinical data
  • Age
  • Sex
  • Clinical condition

Submission process


1.Register the project in HCA Data Coordination Platform

  • Create a new project and provide a title, summary, and contact information.
  • Link the raw data (FASTQ/BAM) already submitted to ENA/SRA.

2.Prepare Tier 1 and Tier 2 metadata according to the official ingestion guide.


3.Submit raw data to the HCA Data Repository.


4.Submit processed matrices (AnnData .h5ad) to the HCA portal.

  • The file will be validated against the official schema (checking for required)
  • After curation, the dataset will be integrated into CellxGene Discover.

5.Receive accession ID and track curation.


6.Publication: the data is integrated into the portal, the dataset receives a public identifier (e.g., HCA12345).

  • It becomes searchable and viewable on CellxGene Discover, with filters by organism, tissue, cell type, disease, etc.