Table 1.
Commonly used file formats
| Name | Common versions | Type of data | Binary or plain text | Extension | References | Comments |
|---|---|---|---|---|---|---|
| FASTQ | fastq-sanger fastq-solexa fastq-illumina |
Sequencing reads | Plain text | .fastq .fq |
Cock et al. (2010) http://en.wikipedia.org/wiki/FASTQ_format | Illumina pipelines ≥1.8 generate fastq-sanger |
| SAM | v1.0–v1.4 | Sequencing reads aligned to a reference genome | Plain text | .sam | Li et al. (2009) http://samtools.sourceforge.net/SAM1.pdf | New versions are backward compatible |
| BAM | v1.0–v1.4 | Sequencing reads aligned to a reference genome | Binary | .bam | Li et al. (2009) http://samtools.sourceforge.net/SAM1.pdf | New versions are backward compatible. Compressible and indexable. |
| Variant Call Format (VCF) | 4.2 4.1 4.0 |
Variants and genotypes across a reference genome | Plain text | .vcf | Danecek et al. (2011) http://vcftools.sourceforge.net/specs.html | Can be compressed and then indexed with Tabix |
| FASTA | Sequence data, including reference genomes | Plain text | .fa .fasta .fsa .fas .seq .fna |
http://en.wikipedia.org/wiki/FASTA_format | FASTA can also store non-DNA based sequences | |
| Browser extensible data (BED) | Genomic region and Features | Plain text | .bed | http://genome.ucsc.edu/FAQ/FAQformat.html | Zero-based start, half-open (the base position at the location of the end column is not included) | |
| Generic feature format (GFF) | GFF3 GFF2 GFF1 (no formal specification) |
Genomic region and Features | Plain text | .gff | http://www.sequenceontology.org/gff3.shtml | GFF3 is strong preferred, but many tools are still only able to work with GFF2. One-based, fully closed intervals. |










