Differential Expression Analysis of RNA Sequencing Data using Cufflinks

Author : InsideDNA Time : 17 January 2017 Read time : 11 min

 This tutorial illustrates how to perform basic differential expression test between 2 groups of samples (for instance: “treated” vs “control”, “disease” vs “control”, “tissue A” vs “tissue B” etc.). The result of this test gives researcher information on how different factors influence expression of numerous genes thus predicting biological mechanism. The main aim of this tutorial is to illustrate how to use Cufflinks tools for differential expression analysis.

Firstly, let’s move to the command line window of InsideDNA platform (click on TERMINAL button at the top of main page).

Differential expression analysis of RNA sequencing data using Cufflinks screen 1

Then type 2 commands: “mkdirsra” to create working directory and “cd sra” to move into it.

Now we need to download RNA-Seq datasets for analysis. In our case we will use RNA-Seq data from Sequence Read Archive (SRA) NCBI database. Follow the link (http://www.ncbi.nlm.nih.gov/sra) and then type study identifier “SRP026387” to see the description of dataset that will be used in the further steps.

Differential expression analysis of RNA sequencing data using Cufflinks screen 2

You will see the list of 14 samples which correspond to specified study. Choose any sample, click on its id and you will see study description and detailed experimental protocol.

Differential expression analysis of RNA sequencing data using Cufflinks screen 3

In the nutshell, there are 14 RNA-Seq samples originating from Illumina HiSeq platform. Total RNA of all 14 samples was extracted from prostate tumor cells. 7 samples were taken before and 7 after androgen ablation therapy treatment.Androgen ablation therapy (AAT) is a standard treatment for locally-advanced/metastatic prostate cancer. One may download samples from this HTML page to the local computer and then to upload them to InsideDNA platform, however, it will be faster to download all samples at once with wget command from SRA FTP server (ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP026/SRP026387/).

wget –m ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP026/SRP026387/

Since SRA files of our samples are located in subdirectories of specified FTP directory we need to download recursively all files from it (–m parameter is used for that purpose).

After the completion of download step you will see that samples are saved in ftp-trace.ncbi.nih.gov directory, moreover, subdirectories tree has the same structure as in FTP server. To make our further analysis more comfortable it is better to find all sample SRA files and to copy them into another directory.Type “mkdir samples” to create a new directory. Then type the command which finds recursively all sample files and copies them into samples directory:

find . -name "*.sra" -exec cp -t ./samples/ {} +

where«. -name "*.sra"»part means that we look for SRA files in the working directory; -execargument defines what to do with the files it will find; “cp -t ./samples/” - copies files that were found “{}” into samples directory (all at once “+”).

You may check files in the new directory by typing “ls –l ./samples” and if everything is fine with your files you can delete recursively ftp-trace.ncbi.nih.gov directory with the command “rm –rftp-trace.ncbi.nih.gov” (-r argument means recursive removal).

Differential expression analysis of RNA sequencing data using Cufflinks screen 4

Also we need to download reference genome file. One may download fasta file for Homo sapiens GRCh38 build reference genome and index it, but in this tutorial we will download pre-build reference genome index files from this page (https://ccb.jhu.edu/software/hisat2/index.shtml).

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_snp.tar.gz

After completion of download step we need to unpack grch38_snp.tar.gz archive

tar –xvzfgrch38_snp.tar.gz

You will see 8 index files with prefix “genome_snp” in the grch38_snp directory “ls grch38_snp”.

Differential expression analysis of RNA sequencing data using Cufflinks screen 5

Now we need to convert our data files into fastaq format. For that purpose we will use fastq-dump command from SRA Toolkit. We have 14 SRA files, so it will be more convenient to write a script which will help us to run fastq-dump command for all our files. To create this script type “nano script.sh” and copy the following lines into the new script.sh file:

for i in `ls -d /data/userXXX/sra/samples/*sra; do
          isub -t tes -c 8 -r 30 -e "idna_fastq-dump -F -O /data/userXXX/sra/fa_samples $i"

For each file with .sra extension in the samples directory we call the command idna_fastq-dump which converts SRA file into fastaq file and saves it in the output directory (-O argument); #!/bin/bash - path to bash interpreter; userXXX – your user name.                                                                                                                                                                         

To save changes and exit nano text editor use the following key combination “Ctrl^X + Y”.Create output directory before we run this script “mkdirfa_samples”. To run script.sh type “bash ./script.sh”. You will see that there will be 14 tasks in the TASKS window with the same identifiers “tes” (each task corresponds to one sample file). We need to wait until completion of all 14 tasks before moving to the next step.

Now we need to map the reads from fastq files on the reference genome. HISAT2 aligner will be used for that purpose. We will use the similar script to run 14 mapping commands in parallel. You may modify script.sh file with nanoeditor or create a new script with the following content:

for i in `ls /data/userXXX/sra/fa_samples/; do
          isub -t hisas -c 16 -r 60 -e "/srv/dna_tools/hisat2-2.0.4/hisat2 -- dta-cufflinks -x
          /data/userXXX/sra/grch38_snp/genome_snp -U /data/userXXX/sra/fa_samples/$i -S

For each file in the fa_samples directory we call hisat2 command, where --dta-cufflinks parameter makes hisat2 report alignment tailored specifically for Cufflinks; -x parameter defines reference genome directory; -U – input fastq file; -S – output SAM file.

In the same manner we will modify script.sh file to convertoutput SAM files into BAM format. Change the input directory for ls command to /data/userXXX/sra/hisout/and write another command in the for loop:

isub -t sab -c 8 -r 30 -e "idna_samtools_view -bh /data/userXXX/sra/hisout/$i> /data/userXXX/sra/hisout2/$i.bam"

where–b means that output is written in BAM format; -h – includes SAM header in the output.

To sort BAM files change the input directory for ls command to /data/userXXX/sra/hisout2/and use another command:

isub -t sort -c 16 -r 60 -e "idna_samtools_sort -O bam -T /data/userXXX/sra/hisout3/$i_2 -o /data/userXXX/sra/hisout3/$i.sort /data/userXXX/sra/hisout2/$i"

where–O bam defines output file format; -T specifies prefix name and location; -o – output file name where sorted BAM file is written.

We sorted our BAM files (and used --dta-cufflinks parameter for hisat2) to prepare our alignment files for further analysis by different tools of Cufflink. Firstly, we will use cufflinks to create annotation GTF files with all transcripts which were observed in each sample. Once again we will use BASH script to run cufflinks for several samples separately.Use the following script:

for i in `ls /data/userXXX/sra/hisout3/
*sort`; do
         IFS='/' read -r -a array <<< "$i"
         IFS='.' read -r -a array2 <<< "${array[5]}"
         isub -t cufflinks1 -c 8 -r 30 -e “idna_cufflinks -g
         /data/userXXX/sra/Homo_sapiens.GRCh38.85.chr.gtf -o
         /data/userXXX/sra/hisout3/${array2[0]}/ $i"

where3rd and 4th line are used to extract sample id from file name. This sample is then used to create directory in hisout3 where output files will be written (-o directory). Parameter –g tells cufflinks to use reference annotation GFF file for guidance.

Differential expression analysis of RNA sequencing data using Cufflinks screen 7

Each output directory contains 4 files:

Differential expression analysis of RNA sequencing data using Cufflinks screen 7

isoforms.fpkm_trackingfile contains the estimated isoform-level expression values in the generic FPKM Tracking Format, while genes.fpkm_tracking file contains the estimated gene-level expression values in FPKM Tracking Format.skipped.gtffile contains the skipped loci of reference genome.transcripts.gtf is the most important file for further analysis among other output files. This GTF file contains cufflinks assembled isoforms. There is one GTF record per row, and each record represents either a transcript or an exon within a transcript.The first 7 columns are standard for GTF format (see decription: http://www.ensembl.org/info/website/upload/gff.html), further columns containadditional fields, such as: gene id, transcript id, FPKM (Fragments PerKilobase of exon model per Million mapped fragments) etc.

Differential expression analysis of RNA sequencing data using Cufflinks screen 8

To compare expression levels between different samples we need to merge all annotation files into 1 GTF file which contains description of all transcripts which were observed in all our samples. We will use cuffmerge command. Cuffmerge takes the list of all GTF files that we want to merge as an input, so our next step will let us to create such file.

find /data/userXXX/sra/ -type f | grep 'transcripts' >/data/userXXX/sra/list.txt

Last command means that we take all the files from working directory (-type foption allows us to get rid of directory names) and write all GTF file names with their paths into list.txt file.

Differential expression analysis of RNA sequencing data using Cufflinks screen 9

Now let’s call cuffmerge command itself:

isub -t merge -c 8 -r 30 -e "idna_cuffmerge–o /data/userXXX/sra/merged /data/userXXX/sra/list.txt"

where–o specifies output directory. There will be merged GTF annotation file merged.gtf in /data/userXXX/sra/merged/ directory. Now we will use this annotation file to predict differentially expressed genes between treated and untreated samples with сuffdiff command.Cuffdiff command takes 2 sets of BAM/SAM files as an input.

cuffdiff [options]*

This command tells cuffdiff to look for differentially expressed genes between k control samples and m samples after treatment. However, it is not very convenient to write all our 14 files with their paths manually, so we will use the following script.

array ()
function join_by { local IFS="$1"; shift; echo "$*";}
for i in `ls /data/userXXX/sra/hisout3/*sort`; do
join_by ', ' ${array[@]:0:7}
join_by ', ' ${array[@]:7:7}

In the second line of the script we define the function join_by which merges several strings by specified delimiter. In the for loop we add all sorted BAM file names to array. Finally, we join 7 sample file names by ‘,’ and print the resulting line. We also print the same line for 7 control samples.

Differential expression analysis of RNA sequencing data using Cufflinks screen 10

Now we will use these lines to run cuffdiff command

isub -t cuffdiff -c 8 -r 30 -e “idna_cuffdiff –u –L control,treated–o /data/userXXX/sra/difex/data/userXXX/sra/merged/merged.gtf”

where –u option is used for multiple reads correction; -L option gives labels ‘control’ or ‘treated’ to our 2 sets of samples; -o – specifies output directory; and are 2 lines output of last BASH script. Eventually we have 14 output files in difex directory.

Differential expression analysis of RNA sequencing data using Cufflinks screen 11

For detailed description of all of these files read cuffdiff manual: http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/. In the nutshell, there are 3 categories of output files:

  1. FPKM tracking files (“.fpkm_tracking” extension) contain FPKM values of each transcript, primary transcript or gene in each sample;

  2. Count tracking files(“.count_tracking” extension) contain number of reads that originated from each transcript, primary transcript or gene in each sample;

  3. Differential feature files (“.diff” extension) contain list of features which differ between 2 sets of samples (“treated” and “control”). For example, splicing.diff file lists all differential splicing sites between isoforms processed from a single primary transcript. cds.diff file describes differential CDS between samples; promoters.diff describes differential promoter use between samples. There are also 4 differential expression files (with “_exp.diff” suffix in their names) with results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. Differential features files are the most interesting among all output files because based on these results we can make conclusions on biological effect of a treatment. I won’t go into detailed description of each of these files (better read here: http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/), however, all differential features files have similar structure: gene name, locus (coordinates), samples, [other fields], p-value, q-value (FDR-adjusted p-value), significance field (“yes” or “no” values).

You may also interested in

Analyzing differential expression of de novo assembled transcriptomes using RSEM

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials