This tutorial is a second tutorial about VDJtools. VDJtools framework includes different tools for repertoire sequencing (Rep-Seq) data processing. In our previous tutorial we studied how immune repertoire changed after HSCT. We looked at different statistics of samples, frequencies of V and J regions etc. However, in this tutorial we will use different tools. We will use the correct way to compare sample diversities and calculate the distance between samples immune repertoire. We will study the difference in immune repertoire of Multiple sclerosis samples and control ones.
First of all, we need to open the terminal of InsideDNA platform (“Show terminal” button at the bottom-right corner of the screen) and type the following commands:
mkdir vdj2 – to create working directory;
cd vdj2 – to move into it;
Now we need to download the metadata file:
This file includes information on all samples in the analysis: filename, path to the file, sex, age and medical condition (control or multiple sclerosis). The paths to the files are not correct in our case, so we need to modify it. The easiest way to change all the paths is to type the following command:
grep -v '#' metadata.txt | while read p; do echo `pwd`'/'`echo $p | cut -d "/" -f 3`; done > metadata2.txt
with this command we take all non header lines, extract the filenames and add the correct path (pwd). Resulting output is written in metadata2.txt file. We also need to paste the header line to this file (use any text editor, for example, nano). We also advise to reduce the number of samples in our analysis: remove several samples from the file list.
To download all the samples we will create the file with all samples URL addresses. Use the following command:
grep -v '#' metadata.txt | while read p; do echo 'https://github.com/mikessh/vdjtools-examples/raw/master/samples/'`echo $p | cut -d "/" -f 3 | cut -d' ' -f 1`; done > list_url.txt
We extract the filenames, add the URL address prefixes and write them into list_url.txt. Now we can download all the samples from the list in the following way:
wget -i list_url.txt
ls - to see the content of the directory;
Finally, create the directory for output files:
Now we can study the diversity of immune repertoires in our samples. In the previous tutorial we used different routines that calculated the diversities. However, these values have not that much of biological sense. The point is that observed diversity depends on the sample size (number of reads). That's why the following tools approximate diversity values for different sample sizes. This approximations will allow one to compare the samples diversity. CalcDiversity routine makes diversity estimates computed on original and down-sampled datasets:
isub -c 4 -r 3.6 -t calcstat -e '/srv/dna_tools/vdjtools-1.1.1/vdjtools CalcDiversityStats -m /data/userXXXX/vdj2/metadata2.txt /data/userXXXX/vdj2/out/'
where userXXXX - your username, /data/userXXXX/vdj2/out/ - path to the output directory.
You will see 2 output files diversity.strict.exact.txt and diversity.strict.resampled.txt with similar structure.
As you can see diversity.strict.exact.txt file includes number of reads (6th column), diversity (7th column) values and approximation of diversity (11th column) for the number of reads equal to the maximal read number observed among our samples (977264 in this case). diversity.strict.resampled.txt includes the same information, but approximations are made for down-sampled read counts.
We can also study repertoire diversity in more detailed way. For example you can run RarefactionPlot routine to make diversity approximations for numerous sample sizes. This data can be used to draw approximated diversity curves for the samples. Run RarefractionPlot routine:
isub -c 4 -r 3.6 -t rare -e '/srv/dna_tools/vdjtools-1.1.1/vdjtools RarefactionPlot -m /data/userXXXX/vdj2/metadata2.txt -s 20 /data/userXXXX/vdj2/out/'
where -m specifies the metadata file, /data/userXXXX/vdj2/out/ - output directory, -s 20 - the number of steps (points) in the rarefaction curve (default = 101).
As a result we will see rarefaction.strict.txt output file in the working directory. This file includes extrapolated diversity values for different sample sizes (rarefaction curves). Those curves are interpolated from 0 to the current sample size and then extrapolated up to the size of the largest of samples, allowing comparison of diversity estimates. There are4 main columns in the output file that include sample size, mean diversity estimate and confidence interval.
Another question that we didn't cover in the previous tutorial is estimation of similarity between samples. There are a lot of metrics that give us a hint about how similar our samples are. For example, Pearson correlation coefficient of clonotype frequencies between two samples (R), relative overlap diversity (D) - the fraction of shared clonotypes, geometric mean of relative overlap frequencies (F) etc. You may read more about these metrics here: http://vdjtools-doc.readthedocs.io/en/latest/overlap.html#calcpairwisedistances. Run the following command:
isub -c 8 -r 30 -t pd -e '/srv/dna_tools/vdjtools-1.1.1/vdjtools CalcPairwiseDistances -i aa\!nt -m /data/userXXXX/vdj2/metadata2.txt /data/userXXXX/vdj2/out/'
where -m specifies metadata file, /data/userXXXX/vdj2/out/ - output directory and -i aa\!nt (optional) is used to discard CDR3 nucleotide sequence matches (take only V and J regions into consideration). Backslash is used to protect "!" symbol.
As a result you will see intersect.batch.aa!nt.txt file. Each row is a pairwise distance between any 2 samples. The most important rows that characterize the distance between samples are those with R, D and F column names (see above).
Finally, we can use one more routine to study the frequencies of each V and J segments interactions. There were several tools in our previous tutorial, but we didn't use PlotFancyVJUsage. This one is usually used to draw fancy plots, but on the platform we can get only the text output. You may specify any input sample.
isub -c 8 -r 7.2 -t pfv -e '/srv/dna_tools/vdjtools-1.1.1/vdjtools PlotFancyVJUsage /data/userXXXX/vdj2/MS1.txt.gz /data/userXXXX/vdj2/out/table'
where table is an output file prefix.
As a result you will see table.fancyvj.wt.txt file with a simple structure. Each row corresponds to each J segment, each column corresponds to V segment. Each cell of this table is a frequency of this V+J combination in the input sample.
Follow us on Facebook and Twitter to be the first to read our new tutorials!