cluster and plot (gene expression) data

  1. Open RStudio or an R terminal.
  2. Assuming your files are stored in bi/fancy-project/, set the correct working directory:
  3. Read the data and convert to a matrix Note that a txt file is read (column separator is tab). You can export a tab-delimited text file from your favorite spreadsheet software. If you choose to export to a CSV, make sure you change the sep="\t" param to sep=",". See here.
  4. Cluster column and row data
  5. Create a heatmap and plot to a PDF file

gene expression analysis using TopHat and Cufflinks

  1. Install latest Bowtie2. See here.
    $ bowtie2 --version
    /opt/bi/bowtie2-2.2.6/bowtie2-align-s version 2.2.6
    64-bit
    Built on localhost.localdomain
    Wed Jul 22 16:18:32 EDT 2015
    Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
    Options: -O3 -m64 -msse2  -funroll-loops -g3 -DPOPCNT_CAPABILITY
    Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
    
  2. Install latest TopHat. See here.
    $ tophat2 -v
    TopHat v2.1.0
    
  3. Install latest Cufflinks. See here.
    $ cufflinks
    cufflinks v2.2.1
    linked against Boost version 104700
    
  4. Prepare the reference genome. See here.
  5. Check your data
  6. Build the transcriptome index. See here.
  7. Protect your reference data.
  8. Prepare your working directory.
  9. Download data (lung).
  10. Download data (stomach).
  11. Check your data.
  12. Read alignment with TopHat.
    Map the reads for each sample to the reference genome:
  13. Quantification with Cuffquant.
    Compute the gene expression profiles which are used subsequently by Cuffdiff:
  14. Optional:
    1. Delete data.
    2. Check your data.

fasta format

fasta or fasta.gz (compressed version) is a very generic text format that is used to store sequence data. A fasta file has at least one record, each record consists of a minimum of two lines.
  1. ID, starts with >
  2. Sequence, typically wrapped to multiple lines at a fixed maximum line witdh.
>gi|30212|emb|X56692.1| H.sapiens mRNA for C-reactive protein
GGACTTCTAGCCCCTGAACTTTCAGCCGAATACATCTTTTCCAAAGGAGTGAATTCAGGCCCTTGT
CTGGCAGCAGGACGTGACC

References:
  1. Galaxy Wiki

sam/ bam format

sam or bam (compressed version) is used to store NGS alignment data.

QNAMEthe sequence/ read name
FLAGbitwise flag
RNAMEthe reference sequence name
POSthe position in the reference sequence (1-based indices)
MAPQthe mapping quaility
CIGARCIGAR strings
RNEXTRef. name of the mate/next read
PNEXTPosition of the mate/next read
TLENthe template length for paired-end reads
SEQthe read sequence
QUALthe base call qualities of the read sequence (same as in FASTQ format)

References:
  1. SAMv1 specs
  2. Galaxy Wiki

sam/ bam bitwise FLAGs

Flag (bin)Flag (hex)Meaning
10x1Paired-end sequence (or multiple-segment, as in stroble sequencing)
20x2Aligned in proper pair (according to the aligner)
40x4Unmapped
80x8Mate pair unmapped (or next segment, if multipe-segment)
160x10Sequence is reverse complemented
320x20Sequence of mate pair is reversed
640x40The first read in the pair (or segment, if multiple-segment)
1280x80The second read in the pair (or segment, if multiple-segment)
2560x100Secondary alignment
5120x200QC failure
10240x400PRC or optical duplicate
20480x800Supplementary alignment

See SAMv1 specs

fastq format

fastq or fastq.gz (compressed version) is used to store NGS data. A fastq file has at least one record, each record consists of four lines.
  1. ID, starts with @
  2. sequence
  3. End of sequence, starts with +
  4. Sequencing quality information. One ASCII encoded quality score per base.
A record’s sequence is called read.
@ERR315326.7031172/1
TGGCACCACACCCCTCTAAGACGCAGCAAT
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIII
Quality scores can be represented using three different encodings which use a different range of ASCII characters:
NameASCII character range
Sanger, Illumina >= v1.833-126
Solexa, Illumina < v1.359-126
Illumina v1.3 - v1.764-126

References:
  1. Wikipedia
  2. Bioinformatics Data Skills
  3. Galaxy Wiki

Count number of reads in a SAM/ BAM file

  1. Assuming each read consists of four lines:
    my.sam | wc -l
    
    Divide by four Note be aware of headers!
    samtools view my.bam | wc -l
    
    Divide by four, e.g. echo $((`samtools view my.bam | wc -l` / 4))
  2. Using samtools and awk:
    samtools idxstats my.bam | awk '{c+=$3+$4} END {print c}'