Processing population genomics RAD-seq data with pyRAD

Author : InsideDNA Time : 01 September 2015 Read time : 9 min

Restriction-site-associated genomic markers (RAD markers) sequencing is one of the cheapest and most convenient ways to obtain large number of loci for multiple species or individual samples. Typically, RAD-seq allows to generate relatively long SNP profiles across species or individuals and these profiles can then be used for admixture analysis, genetics map generation, phylogenetic analysis or population comparative genomics. Here we present one of several tools currently available for RAD-seq data processing and analysis – pyRAD. pyRAD is a great pipeline because it incorporates different stages of RAD-seq processing: from initial raw data de-multiplexing and filtering to generation of VCF, nexus, phylip, and other files. However, usage of VSEARCH for ortholog clustering within pyRAD pipeline requires powerful enough computing machines especially for large and high coverage datasets.

RAD-seq has now become a dominant method for evolutionary studies at the population level. Moreover, thanks to recent bioinformatics advances, it is now also possible to use RAD-seq data for between-species (phylogenetic level) comparisons. pyRAD is one the tools available for RAD data processing. The most important difference of pyRAD compared to, for instance, Stacks is its ability to handle indels. This ability is possible because clustering of reads is done via sequence similarity by global alignment, whereas in Stacks reads are grouped by counting base differences between reads.

pyRAD pipeline consists of seven steps:

  1. De-multiplexing (separation by barcodes of input fastq files)
  2. Quality filtering and removal of barcodes, cut sites and adapters
  3. Clustering within samples and alignment
  4. Joint estimation of error rate and heterozygosity with maximum likelihood
  5. Consensus base calling and paralog detection
  6. Clustering across samples
  7. Alignment across samples, filtering and formatting

In InsideDNA implementation of the pyRAD pipeline is split into two steps:

  1. generation of parameter file (params.txt)
  2. execution of pyRAD pipeline

pyRAD manuscript by D. Eaton is available here. Complete description of the pyRAD is available on its dedicated website.

Analyzing RAD-seq data with pyRAD pipeline

We are now going to use simulated dataset (download dataset). The dataset represents a simulated single-end RADseq library with sequence data being simulated on a species tree with 12 ingroup samples and one outgroup (details of simulation can be found here).

1. Upload RAD-seq dataset to InsideDNA

Log in into InsideDNA application and navigate into Files tab. Go to your root directory and create a new folder called “pyrad”.

Upload unrared (!) simulated RAD-seq dataset to InsideDNA project and place it in root/pyrad/ folder.

2. Add tools to your Pyrad project

Navigate to My Tools tab and create a new project called Pyrad by clicking on + Add new project

Now, search in the search field for a keyword pyrad. Two tools will appear – pyRAD_make and pyRAD. Click on add button and choose Pyrad project in the dropdown list for both tools. They should appear in your Pyrad project.

3. Initialize pyRAD_make (click to run)

Now we are going to initialize pyRAD_make. This tool will create a special config file (normally called params.txt) which will be run with the pyRAD pipeline itself. First, click on Run tool button.

You will have a Tool Settings menu opened for pyRAD_make. Here you need to specify the Task name, tool parameters and CPU/RAM settings. Then you will need to preview the task and submit it. 

Specify the task name which is easy for you to recognize later on. For instance, mypyrad_make. Now we need to select an output file which will store our pyRAD settings. Click on Browse and navigate into root/pyrad folder and name the file params.txt

 

 

For the rest of parameters we are going to set following choices:

  • Specify extension of non-demultiplexed files  - *.fastq.gz

Here you will need to specify extension (!) of non-demultiplexed files from your libraries. For instance, if you have fastq files – select *.fastq. If files are gzipped – specify gz option. In our case files are gzipped.

  • Specify extension of demultiplexed files - <leave empty>

Leave this option empty because files are not demultiplexed. If you already de-multiplexed (split by barcode) your source fastq files – then choose this option instead of previous and select an <empty option> in the field "Specify extension of non-demultiplexed files".

  • Specify name or extension of the barcode file - *.barcodes

If you have a single barcode file – type its name, if you have many – specify their extension

  • Prefix name for final output files - c88d6m4p3

Select prefix for the output files

  • Specify sequence of the restriction recognition site overhang - TGCAG

Type here you restriction recognition sequence

  • Specify number of processors - 4

Select number of processors – for instance 16 or 32 if your dataset is large. Here dataset is small so we need only 4 cores.

  • The minimum depth of coverage to make a statistical base call at each site in a cluster - 6

Specify how many times you want a base to be covered by reads so it is considered for the analysis. This option will vary depending on your original coverage depth (e.g. how many each read is repeated), but a minimum of 5 is recommended.

  • The maximum number of low quality, undetermined ("N") sites in step 2 filtered sequences - 4

Specify how many N (at most) you allow in each read.

  • The similarity threshold used by vsearch for sequence clustering (e.g. 0.98) – 0.88

Lower threshold means that more different reads will be clustered together. Beware that higher thresholds (closer to 1) would normally be more computationally intensive. During clustering the cutsite bases remain attached to the reads, thus all loci will have an invariable set of ~4-6 bases on one side. For 100 bp reads containing a 5-bp cutter and with have a 5 bp barcode trimmed off this results in 95 bp of data, 5 of which are invariable. Thus, if you want to allow 5 base differences between reads (90/95 = .947) you should use a setting of .94.

  • The type of Reduced representation library - rad

Select type of the library

  • Minimum taxon coverage - 4

The minimum number of (ingroup) samples with data for a given locus to be retained in the final data set. 

  • Maximum number (or proportion) of shared polymorphic sites in a locus - 3

Enter a number, or a decimal with the prefix “p” (e.g., p.10 for 10%). This option is used to detect potential paralogs.

We also will select all formats for our output datasets. To set this option, click on Show Secondary parameters and scroll to the Output formats field. In the dropdown menu select * (all):

Select 2 cores (important – for pyRAD_make tool you don’t need a lot of resources – it is a simple task, most resources should go into pyRAD pipeline itself) and click on Task preview button and then on Submit. Go to Tasks to monitor the progress.

4. Monitoring task progress.

Monitor the progress of your task. It will be done in a couple of minutes. Once done – task is moved to the Completed group and we can verify that nothing went wrong by looking at the error log in the right panel.

5. Obtaining the file

Now let’s move to the File Manager (FM). Click on Files and navigate into root/pyrad directory. Here you will have params.txt file.

You can verify its content by clicking on preview button

6. Initialize pyRAD (click to run)

Now we need to launch pyRAD itself. Go back to Tools > Pyrad project. Click on pyRAD.

In tool settings select your params.txt file and also all the files (fastq + barcodes) needed for analysis.

Select a powerful CPU options if needed, click on task preview and submit.

7. Monitor task progress

Go to Tasks and monitor task progress. As pyRAD is computationally intense tool it may take a while on real dataset. When the task is finished and moved to Completed group, go to Files

8. Verify output of pyRAD

You will find all the resulting files in your root/pyrad/ directory.

9. Understand output files of pyRAD

There are number of output files in the outfiles directory

Practically speaking, you are most likely going to look into vcf, phy and nex files. VCF file is a typical snp-calling format that can be used for example in admixture analysis and for calculation of other convetional population genetics statistics. Phylip and nexus formats are useful for phylogenetics analysis and tree building. Specifically, you can run tree reconstruction with RAXMLPHYML, and BEAST straightaway on InsideDNA website.

You can also follow our tutorial on phylogeny reconstruction with Phyml here.

In our next tutorials we are going to show what you can do with the resulting pyRAD files. So, stay tuned!

Follow us on Facebook and Twitter to be the first to read our new tutorials!

Run this tool More tutorials