Information on genetic variants in a sample – meaning the differences between a sample and a reference genome – are generally stored in a form of VCF files. Unfortunately, the structure of VCF files is not standardized; VCF files can include various characteristics of genetic variants; files can be sorted in different ways; and their headers can be slightly different. Such differences may complicate analysis of genetic variants, especially when you use VCF files derived from your colleagues or from a web database, and you don’t have detailed information about how this VCF file was produced. In this tutorial, we will discuss some of the major headaches of working with VCF files and how to resolve these headaches with GATK and Piccard. We will filter variants in files downloaded from the European Nucleotide Archive, which contain information on genetic variants of human tumor.
Information on genetic variants in a sample – meaning the differences between a sample and a reference genome – are generally stored in the form of VCF files. We already discussed the structure of such files in our tutorial “SAM files processing and variants calling in bacterial genomes”. Briefly, a VCF file consists of many lines, each describing one genetic polymorphism. Respective columns within a VCF file must include the following information: the name of a chromosome, where the polymorphism was discovered, the variant’s position in the chromosome, variant nucleotides present in the reference and the sample genomes, quality of a polymorphism, and any additional information on its (polymorphism) characteristics. The header of the VCF file should include information about its fields (especially on fields containing any additional information, because these fields and their content may vary), names of chromosomes or contigs, and the source of the file.
Figure is from the GATK website
Unfortunately, the structure of VCF files is not standardized; VCF files can include various characteristics of genetic variants; files can be sorted in different ways; and their headers can be slightly different. Such differences may complicate analysis of genetic variants, especially when you use VCF files derived from your colleagues or from a web database, and you don’t have detailed information about how this VCF file was produced. Additional complication comes from the difference between the structure of a file with explored genetic variants and that of a file containing a reference genome. In this tutorial, we will discuss some of the major headaches of working with VCF files. We will filter variants in files downloaded from the European Nucleotide Archive, which contain information on genetic variants of human tumor.
Plenty of files are stored in web databases waiting to be processed. We will use a file containing the genetic variants of a human tumor sample. You can download the file on this page.
Log in to the InsideDNA application, navigate to File Manager, and create a folder called “Tumor” for the new project.
Upload the VCF file into this folder. You can also download the source file from the EBI ftp using 'wget' command in a console:
Let’s have a brief look at our source file. Open console:
Enter cd command into the Console to navigate to the Tumor folder:
To extract the VCF file from archive, type command:
and use the ‘less’ command to view the VCF file:
You will see header lines which begins with ## characters. Use Ctrl + Up and Down arrows to scroll through a file. File name (combined.sorted) suggests that this file is sorted, and if you scroll down a bit, you will see that first comes the line describing variants in chr1 (first chromosome), then follows variants in chr2, chr3, and so on. If you navigate upwards, back to the header, you can discover a list of all chromosomes mentioned in a file. Note that this list of chromosomes in a header is sorted in different order; first comes the line: “##contig=”, describing the first chromosome, and then follows the line describing the 10th chromosome: ##contig=. So, instead of being sorted as numbers 1, 2, 3, they are sorted as characters (1, 10, 11, 12, etc.).
For our task, the order of lines with chromosomes descriptions in the header of the VCF file is important, so we should fix this and reorder the lines according the normal numeric order (i.e. 1,2,3, etc.). You can reorder these lines using Notepad editor on your computer, or you can simply use a file with a fixed header if you download source files for this tutorial. Alternatively, if you are familiar with a Vim editor (an editor accessible via UNIX terminal) you can do all modification in the web browser in the InsideDNA Terminal like you would on UNIX/Mac.
To quit from ‘less’ mode in Console, press the Q key.
Now we need to know the correct reference for our VCF file, since several assemblies of human genome exist. Information about this can be found in the original study. To figure out this, we need to check the Study accession section on this page (also see Figure 2 above). It tells us that our file was produced during this research, and if you dive into the methods of the article, you will discover that authors used the hg19 assembly of the human genome as a reference.
You can download the archive with the hg19 assembly on this page (select file chromFa.tar.gz) or use wget command in a Terminal to get data directly. Upload the chromFa.tar.gz file into the Tumor folder. To unzip the files, type the following commands into Console:
tar –zxvf chromFa.tar.gz
Now, if you turn to the visual File manager or use ‘ls’ command in a terminal, you will discover plenty of files with chromosome sequences in the Tumor folder. We will need only those which were mentioned in the header of our VCF file. Delete all of the chr.fa files, except the ones from chr1.fa to chr22.fa, chrM, chrX and chrY.
We need to produce a single file with a reference human genome from the files with chromosomes, using the command ‘cat’. The sequence of chromosomes in our resulting files is important for the downstream analysis, so we will specify it explicitly:
cat $(ls chr* -v) > hg19.fa
Command ls –v here lists all files with names that begin with “chr”, in a lexicographic order, and the command ‘cat’ concatenate these files into a hg19.fa file.
After this, remove the files with chromosome sequences using the ‘rm’ command:
Only files containing the variants and a reference genome sequence will remain in the Tumor folder.
We will use GATK tools for the variants filtering. These tools can use a fasta file containing a reference sequence only if provided with an auxiliary index and dictionary files. These auxiliary and dictionary files can be made from the source reference fasta. To produce these files, we will use the samtools faidx and picard CreateSequenceDictionary tools.
To run faidx, enter the following command into Console:
isub -t select_variants -c 8 -r 7.2 -e "idna_samtools_faidx /data/userXXX/Tumor/hg19.fa"
XXX here should be replaced by your own userID, which you can find in the header of the Console tab:
To run CreateSequence dictionary, type the following command into Console:
isub -t create_dictionary -c 8 -r 7.2 -e "idna_CreateSequenceDictionary R=/data/userXXX/Tumor/hg19.fa O=/data/userXXX/Tumor/hg19.dict"
When these tasks are completed, you will find files hg19.fai.fa and hg19.dict in the Tumor folder. Now we finally have all the necessary components to filter variants in our VCF file.
There are several types of genetic variants in raw VCF files; they can include SNPs, indels and so-called structural variations – rearrangements of chromosomes parts. The GATK SelectVariants tool allows you to filter certain types of variants. Moreover, using this tool, you can discard variants of a low quality. Let’s explore the columns of the VCF file. This can help you to understand which variants you should keep for the downstream analysis.
You can find a description of other useful parameters in the VCF file header. Of most interest are: the depth of coverage (number of reads, containing a specific polymorphism); allele frequency and alternate allele observation (allowing us to judge if an organism is homo- or heterozygous by specific locus); and biases of certain strains (which can indicate the artefacts of the sequencing method).
Let’s filter our VCF file to leave only SNPs with both high quality and read depth.
Run the following command in the Console:
isub -t select_variants_t7 -c 8 -r 7.2 -e "idna_SelectVariants -R /data/userXXX/Tumor/hg19.fa -V /data/userXXX/Tumor/P1-CA2.combined.sorted_red.vcf -selectType SNP -select \"QUAL > 30.0 && DP > 10\" -o /data/userXXX/Tumor/filtered_snps.vcf"
Here, P1-CA2.combined.sorted_red.vcf stands for a file with ordered lines of chromosomes description in the header. -selectType SNP will discard indels and other types of polymorphisms to write only SNPs to an output file “filtered_snps.vcf”.
The command -select \"QUAL > 30.0 && DP > 10\" will filter out SNPs with low quality or low read depth (important: reverse slashes here prevent the shell from interpreting quotes as the end of the command).
When the task is finished, you will discover a new VCF file with filtered variants in the Tumor folder (note that the resulting file is much smaller than the source one). Congratulations, you have now mastered filtering of VCF files and can further explore only high-quality genetic variants.
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials