fasta format

fasta or fasta.gz (compressed version) is a very generic text format that is used to store sequence data. A fasta file has at least one record, each record consists of a minimum of two lines.
  1. ID, starts with >
  2. Sequence, typically wrapped to multiple lines at a fixed maximum line witdh.
>gi|30212|emb|X56692.1| H.sapiens mRNA for C-reactive protein

  1. Galaxy Wiki

sam/ bam format

sam or bam (compressed version) is used to store NGS alignment data.

QNAMEthe sequence/ read name
FLAGbitwise flag
RNAMEthe reference sequence name
POSthe position in the reference sequence (1-based indices)
MAPQthe mapping quaility
RNEXTRef. name of the mate/next read
PNEXTPosition of the mate/next read
TLENthe template length for paired-end reads
SEQthe read sequence
QUALthe base call qualities of the read sequence (same as in FASTQ format)

  1. SAMv1 specs
  2. Galaxy Wiki

fastq format

fastq or fastq.gz (compressed version) is used to store NGS data. A fastq file has at least one record, each record consists of four lines.
  1. ID, starts with @
  2. sequence
  3. End of sequence, starts with +
  4. Sequencing quality information. One ASCII encoded quality score per base.
A record’s sequence is called read.
Quality scores can be represented using three different encodings which use a different range of ASCII characters:
NameASCII character range
Sanger, Illumina >= v1.833-126
Solexa, Illumina < v1.359-126
Illumina v1.3 - v1.764-126

  1. Wikipedia
  2. Bioinformatics Data Skills
  3. Galaxy Wiki