Bioinformatics

GitHub

I am a largely self taught (scary) bioinformatics practitioner. I code primarily in R and have a lot of fully functional (messy) scripts here. What I lack in beautiful and efficient scripting I make up for in generously offering help and explanations for my analyses and code. I strongly support open science and make all my data and code available to everyone. Don't hesitate to reach out about getting any of these scripts or tools to work. Science is a process, eDNA is still a young and growing field and getting the code and software organized takes a lot of time and care. Hoping that one day I will get my act together...

GitHub

rCRUX

eDNA metabarcoding is increasingly used to survey biological communities using common universal and novel genetic loci. There is a need for an easy to implement computational tool that can generate metabarcoding reference libraries for any locus, and are specific and comprehensive. We have reimagined CRUX (Curd et al. 2019) and developed the rCRUX package R system for statistical computing R Core Team 2021 to fit this need by generating taxonomy and fasta files for any user defined locus. The typical workflow involves using get_seeds_local() or get_seeds_remote() to simulate in silico PCR (e.g. Ye et al. 2012) to acquire a set of sequences analogous to PCR products containing metabarcode primer sequences. The sequences or "seeds" recovered from the in silico PCR step are used to search databases for complementary sequence that lack one or both primers. This search step, blast_seeds() is used to iteratively align seed sequences against a local NCBI database for matches using a taxonomic rank based stratified random sampling approach. This step results in a comprehensive database of primer specific reference barcode sequences from NCBI. Using derep_and_clean_db(), the database is de-replicated by DNA sequence where identical sequences are collapsed into a representative read. If there are multiple possible taxonomic paths for a read, the taxonomic path is collapsed to the lowest taxonomic agreement.

Link to rCRUX here

Anacapa Toolkit

Anacapa is an eDNA toolkit that allows users to build comprehensive reference databases and assign taxonomy to raw multilocus metabarcode sequence data. It address longstanding needs of the eDNA for modular informatics tools, comprehensive and customizable reference databases, flexibility across high-throughput sequencing platforms, fast multilocus metabarcode processing, and accurate taxonomic assignment. Anacapa toolkit processes eDNA reads and assigns taxonomy using existing software or modifications to existing software. This modular toolkit is designed to analyze multiple samples and metabarcodes simultaneously from any Ilumina sequencing platform. A significant advantage of the Anacapa toolkit is that it does not require that paired reads overlap, or that both reads in a pair pass QC. Taxonomy results are generated for all read types and the user can decide which read types they wish to retain for downstream analysis. Check out the media coverage UCLA, MEE methods blogs, and Monga Bay.

Anacapa is being phased out over time as we can no longer maintain the software. Fortunately there are great alternatives. Currently I am directing folks towards Tourmaline developed by NOAA AOML. Recent developments have allowed for the use of the Anacapa Toolkit's modified BLCA classifier more easily.

ranacapa

Environmental DNA (eDNA) metabarcoding is becoming a core tool in ecology and conservation biology, and is being used in a growing number of education, biodiversity monitoring, and public outreach programs in which professional research scientists engage community partners in primary research. Results from eDNA analyses can engage and educate natural resource managers, students, community scientists, and naturalists, but without significant training in bioinformatics, it can be difficult for this diverse audience to interact with eDNA results. Here we present the R package ranacapa, at the core of which is a Shiny web app that helps perform exploratory biodiversity analyses and visualizations of eDNA results. The app requires a taxonomy-by-sample matrix and a simple metadata file with descriptive information about each sample. The app enables users to explore the data with interactive figures and presents results from simple community ecology analyses. We demonstrate the value of ranacapa to two groups of community partners engaging with eDNA metabarcoding results.

Gruinard Decon

An eDNA metabarcoding decontamination pipeline. This R script conducts a modular 6 step decontamination protocol on output Anacapa Toolkit community tables. The objective of this script is to fit in between the Anacapa Toolkit and ranacapa, providing a series of cleaning and pre-processing steps to remove contaminant ASVs and poorly sequenced samples prior to data analysis and exploration.

NOTE: This script is currently underdevelopment and in version 0.0 . However, I decided to share this code incase any of the pieces maybe of interest to the larger eDNA metabarcoding community even at this early stage. Please feel free to contact me and recommend any suggestions, utilities, functions, etc.

I want to acknowledge that this script is built off of the work of many other great metabarcoding scientists. I heavily relying on code from Ramon Gallego, Ryan Kelly, and the microDecon package. Thank you for your dedication to open access software to provide coding resources to all.

Validating Taxonomic Assignments

DNA metabarcoding is an important tool for molecular ecology. However, its effectiveness hinges on the quality of reference sequence databases and classification parameters employed. Thus we are in the process of developing tools to allow for metabarcode and reference database validation. For example, we recently evaluated the performance of MiFish 12S taxonomic assignments using a case study of California Current Large Marine Ecosystem fishes to determine best practices for metabarcoding. Our results demonstrate the importance of comprehensive and curated reference databases for effective metabarcoding and the need for locus-specific validation efforts. Specifically, we employ a taxonomy cross-validation by identity framework to compare classification performance across classification parameters and reference databases. By combining CRUX with TAXXI researchers can validate any given metabarcodes of interest. We are currently developing these tools to allow for researchers to decide on the best metabarcoding loci for taxa of interest. Feel free to reach out if you are interest!

Bioinformatics and Data Analysis Tools for eDNA Metabarcoding

GitHub

rCRUX

Anacapa Toolkit

ranacapa

Gruinard Decon

Validating Taxonomic Assignments

Zack Gold

Support