Restriction-site-associated genomic markers (RAD markers) sequencing is one of the cheapest and most convenient ways to obtain large number of loci for multiple species or individual samples. Typically, RAD-seq allows to generate relatively long SNP profiles across species or individuals and these profiles can then be used for admixture analysis, genetics map generation, phylogenetic analysis or population comparative genomics. Here we present one of several tools currently available for RAD-seq data processing and analysis – pyRAD. pyRAD is a great pipeline because it incorporates different stages of RAD-seq processing: from initial raw data de-multiplexing and filtering to generation of VCF, nexus, phylip, and other files. However, usage of VSEARCH for ortholog clustering within pyRAD pipeline requires powerful enough computing machines especially for large and high coverage datasets.
RAD-seq has now become a dominant method for evolutionary studies at the population level. Moreover, thanks to recent bioinformatics advances, it is now also possible to use RAD-seq data for between-species (phylogenetic level) comparisons. pyRAD is one the tools available for RAD data processing. The most important difference of pyRAD compared to, for instance, Stacks is its ability to handle indels. This ability is possible because clustering of reads is done via sequence similarity by global alignment, whereas in Stacks reads are grouped by counting base differences between reads.
pyRAD pipeline consists of seven steps:
In InsideDNA implementation of the pyRAD pipeline is split into two steps:
Analyzing RAD-seq data with pyRAD pipeline
We are now going to use simulated dataset (download dataset). The dataset represents a simulated single-end RADseq library with sequence data being simulated on a species tree with 12 ingroup samples and one outgroup (details of simulation can be found here).
Log in into InsideDNA application and navigate into Files tab. Go to your root directory and create a new folder called “pyrad”.
Upload unrared (!) simulated RAD-seq dataset to InsideDNA project and place it in root/pyrad/ folder.
Navigate to My Tools tab and create a new project called Pyrad by clicking on + Add new project
Now, search in the search field for a keyword pyrad. Two tools will appear – pyRAD_make and pyRAD. Click on add button and choose Pyrad project in the dropdown list for both tools. They should appear in your Pyrad project.
Now we are going to initialize pyRAD_make. This tool will create a special config file (normally called params.txt) which will be run with the pyRAD pipeline itself. First, click on Run tool button.
You will have a Tool Settings menu opened for pyRAD_make. Here you need to specify the Task name, tool parameters and CPU/RAM settings. Then you will need to preview the task and submit it.
Specify the task name which is easy for you to recognize later on. For instance, mypyrad_make. Now we need to select an output file which will store our pyRAD settings. Click on Browse and navigate into root/pyrad folder and name the file params.txt
For the rest of parameters we are going to set following choices:
Here you will need to specify extension (!) of non-demultiplexed files from your libraries. For instance, if you have fastq files – select *.fastq. If files are gzipped – specify gz option. In our case files are gzipped.
Leave this option empty because files are not demultiplexed. If you already de-multiplexed (split by barcode) your source fastq files – then choose this option instead of previous and select an <empty option> in the field "Specify extension of non-demultiplexed files".
If you have a single barcode file – type its name, if you have many – specify their extension
Select prefix for the output files
Type here you restriction recognition sequence
Select number of processors – for instance 16 or 32 if your dataset is large. Here dataset is small so we need only 4 cores.
Specify how many times you want a base to be covered by reads so it is considered for the analysis. This option will vary depending on your original coverage depth (e.g. how many each read is repeated), but a minimum of 5 is recommended.
Specify how many N (at most) you allow in each read.
Lower threshold means that more different reads will be clustered together. Beware that higher thresholds (closer to 1) would normally be more computationally intensive. During clustering the cutsite bases remain attached to the reads, thus all loci will have an invariable set of ~4-6 bases on one side. For 100 bp reads containing a 5-bp cutter and with have a 5 bp barcode trimmed off this results in 95 bp of data, 5 of which are invariable. Thus, if you want to allow 5 base differences between reads (90/95 = .947) you should use a setting of .94.
Select type of the library
The minimum number of (ingroup) samples with data for a given locus to be retained in the final data set.
Enter a number, or a decimal with the prefix “p” (e.g., p.10 for 10%). This option is used to detect potential paralogs.
We also will select all formats for our output datasets. To set this option, click on Show Secondary parameters and scroll to the Output formats field. In the dropdown menu select * (all):
Select 2 cores (important – for pyRAD_make tool you don’t need a lot of resources – it is a simple task, most resources should go into pyRAD pipeline itself) and click on Task preview button and then on Submit. Go to Tasks to monitor the progress.
Monitor the progress of your task. It will be done in a couple of minutes. Once done – task is moved to the Completed group and we can verify that nothing went wrong by looking at the error log in the right panel.
Now let’s move to the File Manager (FM). Click on Files and navigate into root/pyrad directory. Here you will have params.txt file.
You can verify its content by clicking on preview button
Now we need to launch pyRAD itself. Go back to Tools > Pyrad project. Click on pyRAD.
In tool settings select your params.txt file and also all the files (fastq + barcodes) needed for analysis.
Select a powerful CPU options if needed, click on task preview and submit.
Go to Tasks and monitor task progress. As pyRAD is computationally intense tool it may take a while on real dataset. When the task is finished and moved to Completed group, go to Files
You will find all the resulting files in your root/pyrad/ directory.
There are number of output files in the outfiles directory
Practically speaking, you are most likely going to look into vcf, phy and nex files. VCF file is a typical snp-calling format that can be used for example in admixture analysis and for calculation of other convetional population genetics statistics. Phylip and nexus formats are useful for phylogenetics analysis and tree building. Specifically, you can run tree reconstruction with RAXML, PHYML, and BEAST straightaway on InsideDNA website.
You can also follow our tutorial on phylogeny reconstruction with Phyml here.
In our next tutorials we are going to show what you can do with the resulting pyRAD files. So, stay tuned!
Follow us on Facebook and Twitter to be the first to read our new tutorials!Run this tool More tutorials