Running approximate Bayesian computation inferences about population history for RAD-seq data

Author : InsideDNA Time : 15 September 2015 Read time : 7 min

One of the most powerful frameworks for inference of population genetics or genomics scenarios is Approximate Bayesian Computation (ABC). Compared to conventional approaches such as likelihood-based ones, ABC-based strategy allows to effectively model complex population history scenarios and offers a flexible way of assessing a fit of alternative hypotheses. One of the popular tools for ABC modeling is DIY-ABC. It has an easy to use interface, but, unfortunately, requires a powerful cluster for generation of thousands of datasets for ABC evaluation. Here we describe a simple 2 steps wrapper pipeline for simulating large amount of ABC datasets. Our wrapper will be particularly useful for those who don’t have UNIX skills or cluster but still need large amount of ABC simulations.    

In essence, ABC computation is based on the idea that instead of having an explicit model with a well-defined likelihood one can simply generate lots of datasets with different parameters and then compare simulated datasets to a real dataset. Comparison is done with so-called summary statistics – values that summarize datasets (simulated and real). Simulated dataset with the values of summary statistics close enough to the values of summary statistics of the real data represent a correct population genetics scenario.

Nevertheless, there is a big limitation – to confidently compare different scenarios, ABC approach requires a large number of simulated datasets (thousands or even hundreds of thousands). Therefore, ABC is a computationally costly method and normally it requires a large cluster.

DIY-ABC approaches this problem in two steps:

  1. create a special file (header.txt) via graphical user interface of DIY-ABC. The header file, always named header.txt, contains all information necessary to compute a reference table associated with the data: i.e. the scenarios, the scenario parameter priors, the characteristics of loci, the loci parameter priors and the summary statistics to compute.
  2. use header.txt and data file to simulate many datasets on a cluster side. The problem with this second step is that it requires a cluster and a particular type of task distribution system.

In InsideDNA we added two simple tools that allow users who don’t have cluster or good UNIX skills to quickly and easily simulate large number of ABC datasets.

We are going to use a simulated dataset and header.txt prepared by DiyABC authors (downloaded from here). You can learn about DIY-ABC approach in the associated publication here.

1. Upload SNP matrix and header.txt to InsideDNA

Log in (or sign up if you have not yet) into InsideDNA application and read Introduction Tutorial to get familiar with different options available on the website. Once you learned the basics, navigate into Files tab. Create a new folder called diy-abc

Upload a SNP matrix (simulated_dataset.snp) and header.txt into this folder. Details of creating both files can be found in this manual.

2. Add DIYABC_rndgen and DIYABC_sim to your diy_abc project

Navigate into My Tools tab. Create a new project by clicking on + Add new project. Name it diy_abc

Now, search in the search field for “abc”. Three tools will be returned. Click on + button on DIYABC_rndgen and choose diy_abc project in the dropdown list. DIYABC_rndgen should appear in your diy_abc project.

 

Repeat this operation for DIYABC_sim. You now should have two tools in your diy_abc project

3. Initialize a task with DIYABC_rndgen (click to run)

Click on Run tool button for DIYABC_rndgen. This tool will generate one or more RNG files necessary for simulation of datasets. For now we will generate only 1 RNG file with a capacity of being multithreaded on 16 cores. In principle, if you’d like to parallelize computations not only between cores, but also between nodes, more RNG files can be generated.

You will have a Tool Settings menu opened for DIYABC_rndgen. Here you need to specify the Task name, tool parameters and computing settings. Then you will need to preview the task and submit it. Specify the task name which is easy for you to recognize later on (for example, abc_rnd).

Specify the directory root/diy_abc with your input data.

Specify number of cores to use – 16 in our case and number of computers – 1 in our case. More cores you choose, faster simulation of ABC datasets will be done. Number of computers is necessary when you want to use multiple nodes, but this may be tricky, so keep number of computers equal to 1. This task doesn’t require much computing power – so, keep core number and RAM low.

Preview task and submit it.

4. Monitoring task progress.

Monitor the progress of your task. It will be done in a couple of minutes, but right now it is in a Running group. Once done – it will be moved to a Completed group and we can verify that nothing went wrong by looking at the error log in the right panel.

5. Obtaining the files

Now, let’s move to the File Manager (FM). Click on Files in top menu and navigate into root/diy_abc directory. Here you will have a single RNG file. If you have chosen multiple computers (nodes) you will have more RNG files (equal to number of computers).

6. Initialize a task with DIYABC_sim (click to run)

Now, lets move back to the Tools and launch DIYABC_sim. This is the tool where most intense computation is going to happen.

First, remember that we have only one RNG file with a suffix *_0000.bin. Also, we selected 16 cores for this RNG file. So, now we will specify settings as follow:

  • input directory – directory with header.txt, data file, and RNG…0000.bin
  • number of required datasets – as many as you need (e.g. 10000), but more you choose, longer computation will take.
  • required number of threads – put 16 (it must be same as number chosen in DIYABC_rndgen)
  • computer number – because our RNG file has suffix 0000.bin we need to specify 0 here.
  • most importantly select 16 cores (!) for this task

Preview and submit it.

 

7. Monitoring task progress.

Check how things are going in the Task section. Remember, it may take quite a while for simulations to finish – this is an ABC, indeed.

8. Obtaining the files

When your task is completed, click on Files and navigate into root/diy_abc directory. Here you will find all the resulting files with simulations.

 

You should now transfer these files back to your local machine and analyze them with GUI of DIY-ABC. Summarizing simulated dataset is computationally simple operation which doesn’t require powerful cluster. So, you should be able to easily do it on your own machine.

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials