Elevada's Curator includes new functionality for packaging remote services for transformation of data that can be configured to point at public or private APIs. This enables the construction of Data Automations that dramatically enrich processed datasets.

Using these capabilities, our Bioinformatics team was able to very quickly set up and automate processing on an informal public collection of disease-associated SNPs, enriching our data with genetic information from biomedical databases. We then visualizing them using the popular analytics tool Tableau. Below is a walkthrough of this method and our results.

As a lightweight and simple demonstration, we visited Eupedia’s genetic page (https://www.eupedia.com/genetics/medical_dna_test.shtml) and scraped the site for SNP id charts with the following information related to autoimmune disease:

  • Description [of disease]
  • Chromosome
  • Gene
  • SNP
  • Risk Alleles
  • Tested by
  • Links


Quickly, we saw issues arise with the Chromosome column, which sporadically featured alphabetic characters rather than the standard chromosome numbers. Gene values were also incomplete (and perhaps some were inaccurate), which would lead to problems at the analytical stage. Lastly, the Tested by* column featured a code, which would not reveal any telling information during analysis.

Curator

The scraped data were uploaded into Curator, where the following additional columns were added: [Group 1: Global MAF, Chromosome Position (Annotation 108), Method, Function Class, Intron-Variant (Binary), Missense (Binary), NC-Transcript-Variant (Binary), Reference (Binary), Allele], [Group 2: Top Associated Disease], [Group 3: Population of MAF Study].

Data from Group 1 were extracted from the NCBI SNP Database through Curator using NCBI’s Eutils API. Group 2 data came from an in-house API, created through an AWS Lambda. The data for this service was downloaded from the Comparative Toxicogenomics Database (http://ctdbase.org), which provided gene names as well as a quantitative breakdown of their most common disease associations. Curator accessed this data through its Remote Services feature. Lastly, Group 3 data was calculated based on the Global MAF score. Curator ran mathematical calculations (using an Arithmetic Transformation) on each MAF score to determine the population of the study for analysis.

Our exported data appeared as the following:


Tableau

After transforming our data in Curator, we set up a Project Automation to automatically deliver the curated form to our Tableau analysis application's Project.

The following graphs illustrate our light analysis of Eupedia-posted SNPs on Autoimmune disease:

Figure 1. Autoimmune-related SNPs from Eupedia most dominantly favored Chromosomes 1 and 6, followed by Chromosomes 2 and 5.

Figure 2. This line graph demonstrates another way to visualize the prevalence of SNPs in Eupedia’s Autoimmune Disease charts. Visible here is a somewhat of a visible up-and-down pattern, where the number of SNPs per chromosome slowly decreases as the chromosome number increases.

Figure 3. This is a visual representation of Eupedia’s report of SNPs related to Autoimmune Disease by Gene. Intergenic SNPs seem to be the most common type, followed by SNPs in the IL23R gene and those in CTLA4 and IL2RA.


Figure 4. Here, we looked at four function classes of SNPs available in NCBI. Note that some SNPs contained several function classes based on what was discovered about each SNP. Additionally, some function classes are not represented in this graph -- this graphic only interested in intron-variant, missense, NC-transcript-variant, and reference SNPs. The amalgamated counts of all related SNP function classes per each chromosome are shown here.

Figure 5. This treemap was created with data from the Comparative Toxicogenomics Database (CTD). First, we identified unique genes associated with SNPs in Eupedia’s list of Autoimmune Disease-related SNPs. Then, we used CTD data to analyze the top-correlated disease for that gene, which was sometimes the listed autoimmune disease and sometimes not. Here is an ordered representation of the most common top-diseases associated with those genes.

Sources: