gene expression analysis using TopHat and Cufflinks

  1. Install latest Bowtie2. See here.
    $ bowtie2 --version
    /opt/bi/bowtie2-2.2.6/bowtie2-align-s version 2.2.6
    64-bit
    Built on localhost.localdomain
    Wed Jul 22 16:18:32 EDT 2015
    Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
    Options: -O3 -m64 -msse2  -funroll-loops -g3 -DPOPCNT_CAPABILITY
    Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
    
  2. Install latest TopHat. See here.
    $ tophat2 -v
    TopHat v2.1.0
    
  3. Install latest Cufflinks. See here.
    $ cufflinks
    cufflinks v2.2.1
    linked against Boost version 104700
    
  4. Prepare the reference genome. See here.
  5. Check your data
  6. Build the transcriptome index. See here.
  7. Protect your reference data.
  8. Prepare your working directory.
  9. Download data (lung).
  10. Download data (stomach).
  11. Check your data.
  12. Read alignment with TopHat.
    Map the reads for each sample to the reference genome:
  13. Quantification with Cuffquant.
    Compute the gene expression profiles which are used subsequently by Cuffdiff:
  14. Optional:
    1. Delete data.
    2. Check your data.

fastq format

fastq or fastq.gz (compressed version) is used to store NGS data. A fastq file has at least one record, each record consists of four lines.
  1. ID, starts with @
  2. sequence
  3. End of sequence, starts with +
  4. Sequencing quality information. One ASCII encoded quality score per base.
A record’s sequence is called read.
@ERR315326.7031172/1
TGGCACCACACCCCTCTAAGACGCAGCAAT
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIII
Quality scores can be represented using three different encodings which use a different range of ASCII characters:
NameASCII character range
Sanger, Illumina >= v1.833-126
Solexa, Illumina < v1.359-126
Illumina v1.3 - v1.764-126

References:
  1. Wikipedia
  2. Bioinformatics Data Skills
  3. Galaxy Wiki

bowtie2 paired-end alignment

Bowtie 2 version 2.2.6 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: 
  bowtie2 [options]* -x  {-1  -2  | -U } [-S ]
time bowtie2 -p 8 -x reference/human/hg19 -1 input_1.fastq -2 input_2.fastq > output.sam
48755614 reads; of these:
  48755614 (100.00%) were paired; of these:
    24164921 (49.56%) aligned concordantly 0 times
    14767863 (30.29%) aligned concordantly exactly 1 time
    9822830 (20.15%) aligned concordantly >1 times
    ----
    24164921 pairs aligned concordantly 0 times; of these:
      4355288 (18.02%) aligned discordantly 1 time
    ----
    19809633 pairs aligned 0 times concordantly or discordantly; of these:
      39619266 mates make up the pairs; of these:
        27256949 (68.80%) aligned 0 times
        8927823 (22.53%) aligned exactly 1 time
        3434494 (8.67%) aligned >1 times
72.05% overall alignment rate

real	44m49.082s
product: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz