cluster and plot (gene expression) data

  1. Open RStudio or an R terminal.
  2. Assuming your files are stored in bi/fancy-project/, set the correct working directory:
  3. Read the data and convert to a matrix Note that a txt file is read (column separator is tab). You can export a tab-delimited text file from your favorite spreadsheet software. If you choose to export to a CSV, make sure you change the sep="\t" param to sep=",". See here.
  4. Cluster column and row data
  5. Create a heatmap and plot to a PDF file

gene expression analysis using TopHat and Cufflinks

  1. Install latest Bowtie2. See here.
    $ bowtie2 --version
    /opt/bi/bowtie2-2.2.6/bowtie2-align-s version 2.2.6
    64-bit
    Built on localhost.localdomain
    Wed Jul 22 16:18:32 EDT 2015
    Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
    Options: -O3 -m64 -msse2  -funroll-loops -g3 -DPOPCNT_CAPABILITY
    Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
    
  2. Install latest TopHat. See here.
    $ tophat2 -v
    TopHat v2.1.0
    
  3. Install latest Cufflinks. See here.
    $ cufflinks
    cufflinks v2.2.1
    linked against Boost version 104700
    
  4. Prepare the reference genome. See here.
  5. Check your data
  6. Build the transcriptome index. See here.
  7. Protect your reference data.
  8. Prepare your working directory.
  9. Download data (lung).
  10. Download data (stomach).
  11. Check your data.
  12. Read alignment with TopHat.
    Map the reads for each sample to the reference genome:
  13. Quantification with Cuffquant.
    Compute the gene expression profiles which are used subsequently by Cuffdiff:
  14. Optional:
    1. Delete data.
    2. Check your data.

sam/ bam bitwise FLAGs

Flag (bin)Flag (hex)Meaning
10x1Paired-end sequence (or multiple-segment, as in stroble sequencing)
20x2Aligned in proper pair (according to the aligner)
40x4Unmapped
80x8Mate pair unmapped (or next segment, if multipe-segment)
160x10Sequence is reverse complemented
320x20Sequence of mate pair is reversed
640x40The first read in the pair (or segment, if multiple-segment)
1280x80The second read in the pair (or segment, if multiple-segment)
2560x100Secondary alignment
5120x200QC failure
10240x400PRC or optical duplicate
20480x800Supplementary alignment

See SAMv1 specs

fastq format

fastq or fastq.gz (compressed version) is used to store NGS data. A fastq file has at least one record, each record consists of four lines.
  1. ID, starts with @
  2. sequence
  3. End of sequence, starts with +
  4. Sequencing quality information. One ASCII encoded quality score per base.
A record’s sequence is called read.
@ERR315326.7031172/1
TGGCACCACACCCCTCTAAGACGCAGCAAT
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIII
Quality scores can be represented using three different encodings which use a different range of ASCII characters:
NameASCII character range
Sanger, Illumina >= v1.833-126
Solexa, Illumina < v1.359-126
Illumina v1.3 - v1.764-126

References:
  1. Wikipedia
  2. Bioinformatics Data Skills
  3. Galaxy Wiki

Count number of reads in a SAM/ BAM file

  1. Assuming each read consists of four lines:
    my.sam | wc -l
    
    Divide by four Note be aware of headers!
    samtools view my.bam | wc -l
    
    Divide by four, e.g. echo $((`samtools view my.bam | wc -l` / 4))
  2. Using samtools and awk:
    samtools idxstats my.bam | awk '{c+=$3+$4} END {print c}'
    

SAM/ BAM flags

Flag (bin)Flag (hex)Meaning
10x1Paired-end sequence (or multiple-segment, as in stroble sequencing)
20x2Aligned in proper pair (according to the aligner)
40x4Unmapped
80x8Mate pair unmapped (or next segment, if multipe-segment)
160x10Sequence is reverse complemented
320x20Sequence of mate pair is reversed
640x40The first read in the pair (or segment, if multiple-segment)
1280x80The second read in the pair (or segment, if multiple-segment)
2560x100Secondary alignment
5120x200QC failure
10240x400PRC or optical duplicate
20480x800Supplementary alignment

Source: Buffalo2015

See also: broadinstitute

Tophat transcriptome index

When runnning TopHat with the -G option, it will build a Bowtie index from the provided reference annotation each time. TopHat will output the following lines when doing so:
[..] Building transcriptome data files tophat_out/tmp/genes
[..] Building Bowtie index from genes.fa
Building the index can take quite some time. You can pre-build this index once and use it afterwards each time you run TopHat. To do so, invoke TopHat just with the -G and the --transcriptome-index parameters. TopHat will build an index in the provided location.
$ tophat -G /var/data/bi/reference/prebuild/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf --transcriptome-index /var/data/bi/reference/prebuild/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/transciptome_index/genes /var/data/bi/reference/prebuild/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome
Note that you still have to provide the <bowtie_index> argument. Read the TopHat manual as well. Output:
Building transcriptome files with TopHat v2.1.0
-----------------------------------------------
Checking for Bowtie
Bowtie version:	 2.2.6.0
Checking for Bowtie index files (genome)..
Checking for reference FASTA file
Building transcriptome data files /var/data/bi/reference/prebuild/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/transciptome_index/genes
Building Bowtie index from genes.fa
-----------------------------------------------
Transcriptome files prepared. This was the only task requested.
Files created:
$ ls -lh /var/data/bi/reference/prebuild/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/transciptome_index/
total 1,7G
-rw-rw-r-- 1 root root 125M Jan  3 17:08 genes.1.bt2
-rw-rw-r-- 1 root root  69M Jan  3 17:08 genes.2.bt2
-rw-rw-r-- 1 root root 1,7M Jan  3 16:56 genes.3.bt2
-rw-rw-r-- 1 root root  69M Jan  3 16:56 genes.4.bt2
-rw-rw-r-- 1 root root 305M Jan  3 16:56 genes.fa
-rw-rw-r-- 1 root root  26M Jan  3 16:56 genes.fa.tlst
-rwxrwxr-x 1 root root 874M Jan  3 16:55 genes.gff
-rw-rw-r-- 1 root root 125M Jan  3 17:19 genes.rev.1.bt2
-rw-rw-r-- 1 root root  69M Jan  3 17:19 genes.rev.2.bt2
-rw-rw-r-- 1 root root   24 Jan  3 16:56 genes.ver