Variant Annotation with SnpEff for Human Tumor Samples

Author : InsideDNA Time : 13 November 2016 Read time : 7 min

DNA sequencing often aims to reveal special genetic features or differences between sample and reference genome. These genetic variants are commonly stored in VCF-files, where every line describes one feature – its position on chromosome, nucleotides of reference and sample sequence, quality, sequencing depth and so on. But the aim of the experiment is generally not just limited to obtain such set of variants but also to interpret them. We need to find which among the discovered variants fall into genes and which to intergenic spaces. We need to find which variants among all in coding sequences change the encoded amino acids, create new stop-codons or cause a frame shift resulting in loss of protein product function. In this study of Variant Annotation with SnpEff we will obtain annotation of genetic variants to answer these intriguing questions.

1. Upload source data

For the study of variant Annotation with SnpEff we will use file with filtered SNPs of human tumor tissue sample. If you followed our previous tutorial Variant filtration you would obtain this file during processing of data during this research. Alternatively, you can also download all files for this tutorial using this link.

Log into InsideDNA application, navigate to File Manager and create a folder called “Annotation” for a new project.

Variant Annotation with SnpEff screen 1

Upload source files into this folder.

2. Add annotation into VCF file

To annotate variants in our source file we will apply SnpEff tool. It uses databases, and InsideDNA have the most popular databases installed (If you will need to annotate genetic variants of not very common research object, contact our team to install necessary database). In this tutorial we will use database for human genome hg19.

To activate virtual console, click on Terminal button on the left.

Variant Annotation with SnpEff screen 2

Enter the following command into Terminal:

idna_submit.py -t annotate -c 8 -r 7.2 -e "idna_ann hg19 /data/userXXX/Annotation/filtered_snps.vcf > /data/userXXX/Annotation/snps_ann.vcf"

Variant Annotation with SnpEff screen 3

Press Enter to submit you task. You can monitor its progress in the Task bar.

Variant Annotation with SnpEff screen 4

When the task is finished, file called “snps_ann.vcf” will appear in Annotation folder. If you explore the structure of new file, you will discover the additional information in the INFO column of VCF file which was not present in original file – annotation of genetic variants. You can view contents of file in Terminal, for example, using this command:

less /data/userXXX/Annotation/snps_ann.vcf

Variant Annotation with SnpEff screen 5

Navigate the file using Ctrl + Up or Down arrow keys. When you have finished exploring the VCF file, you can press Q key to quit the file viewing mode.

Variant Annotation with SnpEff screen 6

(Lines describing one snp are marked by blue and lines with annotation for this snp are marked by red)

3. Adjust the structure of annotated file

Our file now has plenty of information in a very complex structure. Let’s discard some information to make it more readable. Since our source file was already filtered by quality parameters, we should consider the quality of all snps in the file acceptable. Let’s create new file, containing the most essential information – the position of genetic variant, reference and sample nucleotides, and annotation we just obtained. To create this file, we will use GATK VariantsToTable tool. To run it, enter the following line in Terminal:

idna_submit.py -t variants_to_table -c 1 -r 0.6 -e "idna_VariantsToTable -R /data/userXXX/Annotation/hg19.fa -V /data/userXXX/Annotation/snps_ann.vcf -F CHROM -F POS -F REF -F ALT -F ANN -o /data/userXXX/Annotation/table.vcf"  

Variant Annotation with SnpEff screen 7

As you can see, in this command we specify all the fields, we want to write in output file table.vcf (-F CHROM -F POS -F REF -F ALT -F ANN).

Variant Annotation with SnpEff screen 8

When the task is finished, you can explore the structure of new file – table.vcf:

less /data/userXXX/Annotation/snps_ann.vcf

Variant Annotation with SnpEff screen 9

Variant Annotation with SnpEff screen 10

Now it is slightly easier to view variants annotations. They have the following structure:

Allele | Effect | Putative Impact | Gene Name| Gene ID | Feature Type | Feature ID | Transcript type| Rank/Total| HGVS.c|HGVS.p| cDNA_position / cDNA_len| CDS_position / CDS_len| Protein_position / Protein_len| Distance to feature

1)  Allele: There can be several variants in each position, so this field helps to identify which ALT we are referring to.

2) Effect: This field describes to which part of genome variant falls, and if it has an impact on coding sequence. Possible options could be, for instance: intron variant, upstream gene variant or missense variant (in last case, since the sense was lost, variant fell into coding sequence).

3) Putative_impact: A simple estimation of putative impact and deleteriousness: HIGH, MODERATE, LOW or MODIFIER. You can see which variants fall into each category on this page.

4) Gene Name: Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”.

5)  Gene ID: Gene ID (usually ENSEMBL)

6)  Feature type: The type of feature which is in the next field (e.g. transcript, motif, miRNA, etc.).

7) Feature ID: Depending on the annotation, this may be: Transcript ID (preferably using version number), Motif ID, miRNA, ChipSeq peak, Histone mark, etc. Note: Some features may not have ID (e.g. histone marks from custom Chip-Seq experiments may not have a unique ID).

8) Transcript biotype: The bare minimum is at least a description on whether the transcript is {“Coding”, “Noncoding”}. Whenever possible, ENSEMBL biotypes are used.

9) Rank / total: Exon or Intron rank / total number of exons or introns.

10) HGVS.c: Variant using HGVS notation (DNA level)

11) HGVS.p: If variant is coding, this field describes the variant using HGVS notation (Protein level). Since transcript ID is already mentioned in ‘feature ID’, it may be omitted here.

12) cDNA_position / cDNA_len: Position in cDNA and transcripts’ cDNA length (one based).

13) CDS_position / CDS_len: Position and number of coding bases (one based includes START and STOP codons).

14) Protein_position / Protein_len: Position and number of amino acids (one based, including START, but not STOP).

15) Distance to feature: Optional feature, that can include (depending on Feature type): Distance to first / last codon, distance to closest gene, distance to closest intron boundary and so on.

4. Filter variants using annotation information

In previous step we received annotation in the form ready for interpretation. But now when you know the structure of annotation, you can filter it further, depending on certain type of information, you are interested most. For example, you can write to new file only variants with high or moderate effect, using the following command:

awk '$5 ~ /HIGH/ || $5 ~ /MODERATE/ {print $1,$2,$3,$4,$5}' /data/userXXX/Annotation/table.vcf > /data/userXXX/Annotation/table_filtered.vcf

In this case awk will search for key words “HIGH” and “MODERATE” in 5th field of file table.vcf, and write corresponding lines into new file table_filtered.vcf only if one of these words is present.

You can write other awk commands with similar structure, for example:

awk '$5 ~ /PIK3R3/ {print $1,$2,$3,$4,$5}' /data/userXXX/Annotation/table.vcf > /data/userXXX/Annotation/table_filtered.vcf

will help you extract variants in certain gene PIK3R3.

Now you are ready to select interesting information from VCF files and to interpret it, using annotations. Stay tuned!

You may also interested in

Samtools guide: learning how to filter and manipulate with SAM/BAM files

SAM files processing and variants calling in bacterial genomes

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials