When working with sample of individuals it’s often important to consider their ancestry. If we consider individuals from a sample population there will be differences in their genetic background which may result in for instance varying predisposition to disease or varying effective doses of drugs. In this tutorial we will infer population structure of sample of individuals using 1000 Genomes project data.
To start our work let’s download VCF file with genotypes using this link. This file is a multi-sample VCF containing genotypes of 100 individuals. Log into InsideDNA application, navigate to Files tab and create a folder called 1000_Genomes.
Upload VCF file into this folder.
You can view the source file using the Terminal. Click on button “show terminal” in lower left corner and
enter the following command:
The XXX here and in any further use along the process should be replaced by your own userID, specified in the header of Terminal tab.
Press Enter and scroll the file using Up and Down arrows. Files start with a long header describing its content. Then follow the lines describing each genetic variant and its presence in each individual.
To explore population structure we will use program ADMIXTURE, which takes files in BED format as input. To convert our data to this format, we will use VCFtools and PLINK.
Type the following command in Terminal:
Press enter and wait when your task is done. You can monitor progress of your jobs in the Tasks tab.
When the task is complete you will discover new files pvariants.ped and pvariants.map in 1000_Genomes folder. The main part of data is contained in .ped file and it has a different structure as compared to VCF i.e., in .ped file there is one line for each individual describing his genotype. You can view this file using less command.
We will convert .ped file into binary format .bed which is more compact. Type this command in the Terminal:
When this task is complete we can run ADMIXTURE tool to explore population structure of our sample.
ADMIXTURE outputs probabilities for each individual to belong to each of K population. But how do we know how many populations are there in our sample? In general case we don’t know it, so we need to run ADMIXTURE with different values of K to determine which value fits best for our sample. So we need to write a small script which will run ADMIXTURE several times with various values of parameter K.
To create script type the following command in the Terminal:
This command will open an empty file called script.sh in text editor vim. To enter the editing mode, press “I” key.
Type the following text in editor:
After keying the command press Esc to exit editing mode and then type:wq to same file and exit from editor.
Now we have the script that will run ADMIXTURE 7 times with K values from 1 to 7 and testing which of these number of populations fits best for our sample using method of cross-validation.
To run this script, type the following command in Terminal:
When your task is done, many new files will appear in 1000_Genomes folder. Files with .Q extension contain ancestry fractions for each individual for each of the K populations. Files with .P extension contain allele frequencies for each of K populations. We have a pair of these files for each values of K from 1 to 7. To see what number of populations from 1 to 7 describes our sample better let’s compare cross-validation errors. For this purpose we can make a summary of logs files using grep command.
Type the following lines in Terminal:
You will see few lines with cross-validation errors for various K.
As you can see, K=1 has the least CV error which means that we don’t have any reasons to split our sample into populations when analyzing genotypes in given loci. If the least CV error corresponded to other K, we would take into consideration, that we have representatives of K populations in our sample.
Congratulations you have now learned the basics of inferring population structure!
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials