Table 1.

Commonly used file formats

Name Common versions Type of data Binary or plain text Extension References Comments
FASTQ fastq-sanger
fastq-solexa
fastq-illumina
Sequencing reads Plain text .fastq
.fq
Cock et al. (2010) http://en.wikipedia.org/wiki/FASTQ_format Illumina pipelines ≥1.8 generate fastq-sanger
SAM v1.0–v1.4 Sequencing reads aligned to a reference genome Plain text .sam Li et al. (2009) http://samtools.sourceforge.net/SAM1.pdf New versions are backward compatible
BAM v1.0–v1.4 Sequencing reads aligned to a reference genome Binary .bam Li et al. (2009) http://samtools.sourceforge.net/SAM1.pdf New versions are backward compatible.
Compressible and indexable.
Variant Call Format (VCF) 4.2
4.1
4.0
Variants and genotypes across a reference genome Plain text .vcf Danecek et al. (2011) http://vcftools.sourceforge.net/specs.html Can be compressed and then indexed with Tabix
FASTA Sequence data, including reference genomes Plain text .fa
.fasta
.fsa
.fas
.seq
.fna
http://en.wikipedia.org/wiki/FASTA_format FASTA can also store non-DNA based sequences
Browser extensible data (BED) Genomic region and Features Plain text .bed http://genome.ucsc.edu/FAQ/FAQformat.html Zero-based start, half-open (the base position at the location of the end column is not included)
Generic feature format (GFF) GFF3
GFF2
GFF1 (no formal specification)
Genomic region and Features Plain text .gff http://www.sequenceontology.org/gff3.shtml GFF3 is strong preferred, but many tools are still only able to work with GFF2. One-based, fully closed intervals.

This Article

  1. Cold Spring Harb Protoc 2015: pdb.top083667-