Analysing transcription factors with ChIP-Sequencing data: a full bioinformatics pipeline and an example of Candida fungal pathogen

Author : InsideDNA Time : 02 April 2016 Read time : 10 min

The phenotype of an organism depends not only on its genomic sequences, but also on the activity of its genes or, in other words, on gene expression levels. Gene expression, in turn, is determined by the structure of chromatin – a complex aggregate of DNA and proteins that forms chromosomes. Various methods allow us to evaluate levels of gene expression (e.g. RNA-sequencing), but when the goal of the study is to investigate protein-DNA interactions and, for example, identify transcription factor binding sites, ChIP-sequencing (ChIP-seq) is the best strategy. In this detailed tutorial, we work with ChIP-seq data for pathogenetic fungus Candida albicans and explain an entire pipeline of how to analyse ChIP-seq data from A to Z. 

ChIP stands for chromatin immunoprecipitation – the recognition of certain molecules by specialized antibodies. The method works as follows: antibodies to certain proteins are added to the sample of chromatin. The added antibodies bind to their target proteins, and these proteins are crosslinked to DNA. DNA with linked proteins is then sheared (i.e. cut) into fragments by sonication, and fragments that do not bind to the protein of interest (and nor consequently to the antibodies) are discarded. The remaining fragments of DNA contain the binding sites of the protein factors. This remaining DNA is sequenced, and that’s where the second part of the method’s name (“-seq”) comes from. Sequenced reads are then aligned to a reference genome and yield enriched peaks, describing the distribution of studied protein on DNA.

In this tutorial we will process the data from an exciting paper that describes heat-shock transcription regulation factors that enable the pathogenic microorganism Candida albicans to survive in stress conditions. In fact, Candida albicans can be both a harmless colonizer of the body’s organs and an invasive pathogen. The switch between these two states is linked to the ability of Candida albicans to go through a yeast-to-filament shift. Such a shift is regulated by different environmental factors, including temperature. The morphogenesis of Candida albicans is regulated by two transcription factors - Sfl1p and Sfl2p - which work antagonistically: Sfl1p represses the transition from yeast-to-filament, whereas Sfl2p activates it. The paper shows how these factors bind to the DNA motifs on the chromosomes and activate key determinants of morphogenesis and virulence in Candida albicans.

1. Download raw data

To start a new project, log into InsideDNA, navigate to the Files tab, and create a new folder called “ChipSeq”.

Raw ChIP-Seq data are the short fragment reads of a genome. Many researchers deposit their sequencing data into the Sequence Read Archive, or SRA database, in the form of .sra files. For example, this page describes and provides access to the data of one of the samples used in the Candida research paper. To obtain the data from this database and convert them into the fastq-format in the same run, we will use a fastq-dump tool. You can also download the data from here and upload them into the ChipSeq folder.

Navigate to the Files tab and enter into the Console:

then run the fastq-dump tool with the following command:

isub –t fastq-dump –c 8 –r 7.2 –e “idna_fastq-dump –F –O /data/userXXX/ChipSeq/ SRR630885” 

SRR630885 stands for the name of a data sample we will use for processing. It can be found in the header of the SRA database record.

Using this code, fastq-dump will find the file in the database and convert it into the fastq-format.

Here and from this point on, XXX should be replaced by your userID and you can find this userID in the headers of the Console tabs. Submit the task by pressing the Enter key.

You can monitor the progress of the task in the Tasks bar.

When the task is completed, you will find the fastq file with raw reads in the ChipSeq folder.

2. Process raw reads

We discussed quality control and filtering of fastq reads data in our Tutorial “Improving the quality of raw fastq data using Trimmomatic and RemoveBadTiles”. Now, we won’t dive deep into details, but will instead just give you a short recipe optimised for processing of our raw data.

isub –t candida_trimmomatic –c 8 –r 7.2 –e “idna_trimmomatic SE -phred33 /data/userXXX/ChipSeq/SRR630885.fastq /data/userXXX/ChipSeq/Filtered.fastq ILLUMINACLIP:/data/userXXX/ChipSeq/TruSeq3-SE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:30”

This command will run the Trimmomatic tool, filter any reads with low quality, and trim the remaining adapter sequences. If you used your own adapters in the library preparation, you should upload a file with the adapter sequences. In our case, the data were sequenced with a standard Illumina library preparation and the reads have standard Illumina adapters for single end reads. It is thus enough to specify TruSeq3-SE.fa (the file itself is in the archive attached to this tutorial), in order to remove these adapters. Submit the task and navigate to the Tasks bar to monitor its progress.

The resulting reads will be written in the file Filtered.fastq.

3. Align reads to reference genome

To discover the locations of genome sites revealed by the ChIP-Seq method, we need to align filtered reads to a reference genome. We will use the same version of the Candida albicans genome, as authors of the original research. You can find and download the genome on this page or download it from the InsideDNA website.

Download the sequence of chromosomes and upload them into the ChipSeq folder.

To unzip the archive with the reads, use the command:

gunzip C_albicans_SC5314_A21_current_chromosomes.fasta.gz

To align ChIP-Seq reads into this reference, we will use the bowtie tool. We discussed this process in our tutorial “Mapping sequencing reads to a reference genome with Bowtie2: a step-by-step guide”. Briefly, you need to create an “index” from the reference sequence using the command below:

isub –t candida_bowtie-build –c 8 –r 7.2 –e “idna_bowtie2-build /data/userXXX/ChipSeq/C_albicans_SC5314_A21_current_chromosomes.fasta /data/userXXX/ChipSeq/candida_index”

As usual, you can follow the progress of your task in the Tasks bar. 

When this task is complete, the set of files with the prefix “candida_index” will be created. Using these files as a reference, the alignment of filtered reads can be performed using the following command:

isub –t candida_bowtie –c 8 –r 7.2 –e “idna_bowtie2 –x /data/userXXX/ChipSeq/candida_index –U /data/userXXX/ChipSeq/Filtered.fastq –S /data/userXXX/ChipSeq/candida.sam”

As a result, bowtie2 will create the file candida.sam, containing information about aligned reads.

4. Filter unmapped reads, sort and compress alignment file

The next processing steps were discussed in-depth in the tutorial “SAM files processing and variants calling in bacterial genomes”. Again, in this tutorial we will just give the necessary commands and briefly describe their meaning.

We will filter out the reads that weren’t aligned to a reference genome, using the samtools view command. In a single run we will also create the compressed version of a SAM file, called a BAM file:

isub –t candida_view –c 8 –r 7.2 –e “idna_samtools_view –F 4 –Sb /data/userXXX/ChipSeq/candida.sam>/data/userXXX/ChipSeq/candida.bam”

After this, we need to sort reads in alignment in the order of coordinates in the reference genome. For this we will use the samtools sort tool:

isub –t candida_sort –c 8 –r 7.2 –e “idna_samtools_sort /data/userXXX/ChipSeq/candida.bam /data/userXXX/ChipSeq/candida_sorted”

5. Adjust format of data

Next we need to convert the BAM file into a BED file. BED file is similar to the SAM format, but contains less information and has a different order of columns. In fact, homer tools for peaks calling can take BAM files as input, but since BED files do not contain any unnecessary information, it’s better to convert files into the BED format, in order to avoid data misinterpretation. We will do this using a useful reformatting tool called bamToBed:

isub –t candida_bamtobed –c 8 –r 7.2 –e “idna_bamToBed -i /data/userXXX/ChipSeq/candida_sorted.bam > /data/userXXX/ChipSeq/candida.bed"

Finally, you have a file with a proper structure for exploring ChIP-Seq peaks.

6. Finding enriched peaks using the homer toolset

The goal of the ChIP-Seq analysis is to find regions of a genome where we have more sequenced reads than would be expected by chance. For this purpose, we will use the homer findPeaks tool. It takes as input the so-called ‘Tag’ folder. We will create this Tag folder using an auxiliary tool called makeTagDirectory. This tool will not only place our file into a special folder, but also will analyse input data to ensure that it is suitable for peaks finding.

Type the following command into the Console:

isub –t candida_make_tag_directory –c 8 –r 7.2 –e “idna_makeTagDirectory /data/userXXX/ChipSeq/homer_tag_directory /data/userXXX/ChipSeq/candida.bed –format bed –forceBED”

The program automatically tries to guess the format of the input data, but you can specify format with the –format option to ensure that processing will be performed correctly. To ignore the 5th column of the input file (it contains information about the quality of alignment) we need to use an option –forceBED. This option is also not obligatory, but is recommended to ensure correct recognition of various types of data in the input file.

As soon as processing is done you can use the created homer_tag_directory as input for the findPeaks tool. Enter this command into the Console:

isub –t candida_find_peaks –c 8 –r 7.2 –e “idna_findPeaks /data/userXXX/ChipSeq/homer_tag_directory -o /data/userXXX/ChipSeq/homer_peaks”

When this task is finished, you will see the file homer_peaks in the ChIP-Seq folder. You can get acquainted with its structure using the ‘less’ command:

less /data/user410/ChipSeq/homer_peaks 

We will discuss the analysis and graphic representation of peaks data in the upcoming tutorials. Stay tunned!

Congratulations! You are now familiar with the entire bioinformatics ChIP-seq pipeline and can explore the DNA-protein interactions on your own data. 

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials