DNA Analysis with Node.js

Tweet about this on TwitterShare on LinkedIn2Share on Google+0Share on Reddit0Buffer this pageFlattr the authorEmail this to someonePrint this page

Have you ever wondered why we are the way we are? I certainly did many times. And this one time I got really aggressive and started googling about it. And stumbled upon a DNA and from there a Genome.

If you didn’t already know what a Genome is, take a look at this

This video introduced me to a lot of new things. Then I started wondering how does one sequence a Genome and then I stumbled upon

Fascinating right? I fell in love the Genome sequencing and wanted to see if anything can be done in Node.js.

Upon some quick googling I found genome.js written by Eric Schoffstall – code name contra. He is a big contributor to the Gulp community among others.

He has written 2 major libraries for Genome processing

  1. DNA2JSON
  2. GQL (Genome Query Language)

He has also written boilerplate code to work with genosets, which we will be using in this post.

A big thanks to Contra for his contributions!

In this post, we will take a look at Genomes, SNP and  Genoset. We will take a look at how you can get sample Genome data, build a Genome JSON from this and then process that Genome JSON against a Genoset to detect a particular feature.

Lots of buzz words right? I know! Just hang around for a while and you will definitely be fascinated.

So, let us get started

The concepts

Since I was not from a Bio-Tech background, it took me quite sometime to wrap my head around the concepts (not sure if I understood it correctly now as well ).

Googling the buzz words from the above videos

Genome

A genome is an organism’s complete set of DNA, including all of its genes. Each genome contains all of the information needed to build and maintain that organism. In humans, a copy of the entire genome—more than 3 billion DNA base pairs—is contained in all cells that have a nucleus.

More info here.

SNP (pronounced as Snip)

A single nucleotide polymorphism, or SNP (pronounced “snip”), is a variation at a single position in a DNA sequence among individuals. Recall that the DNA sequence is formed from a chain of four nucleotide bases: A, C, G, and T. If more than 1% of a population does not carry the same nucleotide at a specific position in the DNA sequence, then this variation can be classified as a SNP. If a SNP occurs within a gene, then the gene is described as having more than one allele. In these cases, SNPs may lead to variations in the amino acid sequence. SNPs, however, are not just associated with genes; they can also occur in noncoding regions of DNA.

More info here.

Genoset

A genoset refers to a defined set of genotypes; we have coined the term to represent the combination of alleles at 2 or more loci, especially when they are not contiguous along a chromosome. Typically, these are then related to a medical (or ancestral) consequence that arises only when a particular combination of SNPs is inherited.

More info here.

I ended up understanding that when a DNA Analysis is done, SNP data is generated. This is the blueprint of that particular living organism. Now, when we compare a SNP with a Genoset for a particular feature, we can find out if that feature prevails in that living organism.

Example, when you get your DNA test done, the raw data consist of SNP. This data when compared with a Genoset like GS144, it will tell if you are a male or not (female). Interesting right?

Once you have the raw data in hand, you ca run your SNP through any Genoset and check for diseases, ancestral history, genetic disorders etc.

Getting started with Genome.js

genome.js is a fully open source platform built on Node.js that utilizes streams for high-performance analysis of DNA SNPs

To work with Genome.js, you need to

  1. Get your DNA Sequenced by a supported vendor
    1. 23andMe
    2. ancestryDNA
    3. FamilyTree
    4. Alternatively, you can get some sample data from SNPedia
  2. Convert your SNP file to SNP-JSON using dna2json library
  3. Feed your SNP-JSON into any genosets

So, if you want to work with your own DNA and run tests (which I do not recommend at all), you can get it tested and use the raw data. Or else, you can head to SNPedia and get the sample data from there. Here is a spreadsheet that consist of the links where you can get some raw data. Or you can head to genomejs/demo-genomes to get the demo data (txt files are the data).

Before we proceed, let us setup a new Node.js project. If you have not installed Node.js on your machine, you can do so following this. Next, create a new folder named dnaAnalysis. Open a new terminal/prompt here and run

npm init

Next, we will download the data from genomejs/demo-genomes. Run

git clone https://github.com/genomejs/demo-genomes.git data

Note : If you do not have git  setup, refer this.

This will clone the repo into the data folder. This process may take upto 5 mins depending on your internet connection (as the files are big).

Once the clone is completed, you will find a bunch of files. The txt files here are the raw data.

Alternatively, you can also download the raw data from the above spreadsheet.

Once you have the txt file/raw data, which will be > 15 MB and will look like

we will convert it to a JSON file, so that we can run GQL (Genome Query Language) queries on that data.

Create a new file at the root of the project named RawtoJSON.js. Here we will write the logic to convert raw data to JSON using dna2json module.

Before we proceed, we will install dna2json and JSONStream modules. Run

npm install dna2json JSONStream --save

Next, update your RawtoJSON.js with the below contents

On line 4, we provide the path to the txt file. Do remember to delete the 23andme-male.json file, if you want to re-generate this it again.

Now run

node RawtoJSON.js

This will take 30-45 mins depending on your machine. It might even take longer than that. So please be patient.

Genoset 144 – Male

Now that we have JSON data to work with, we will check the gender of the person using GS144.

GS144 provides us with a criteria, which specifies how a male DNA should be structured. If this Genoset matches our sample data, then the person is a male.

GS144 criteria looks like

You can take a look at this to understand the criteria syntax above.

For example  rs1234(A;T) means True if at both alleles are observed and  rs1234(T;T) means True if at least one T allele is observed and so on.

Now, we will use a genoset module named genoset-male, which implements the GS144 genoset to test our sample data. Install the module by running

npm install genoset-male event-stream --save

We also need event-stream for processing the file.

Create a new file named checkMale.js at the root of the project and update it as

Next, run

node checkMale.js

After a few minutes of processing, you should see

Screen Shot 2015-01-02 at 5.46.20 pmNot very convincing, but there is a 27.27% chance that this person is a male.

Next, I have downloaded the data for Mark Davis

Screen Shot 2015-01-02 at 5.52.38 pm

build the JSON from the same

node RawtoJSON.js

and ran it through genoset-male to cross the check the results and it says

Screen Shot 2015-01-02 at 7.06.20 pmSo, how did we convert our GS144 criteria into a programmatic representation? If you take a quick peek at index.js from genoset-male, you will see

The above is called a GQL. Using the GQL, you can convert the criteria to Javascript. Simple right?

Genoset-*

With the GQL architecture, it becomes very easy for us to create our genoset-* node modules to test agains various diseases and features. Contra has included a repo named genoset-boilerplate, so that you can create you own node module around genoset.

To get started with this, first you need to pick a genoset which you would like to test against. You can find various genosets here.

For example, you want to build a module to check if the person is affected by sickle cell anemia. We will clone the genoset-boilerplate code and then tweak it to test for sickel-cell-anemia.

Step 1 : Clone the repo

git clone https://github.com/genomejs/genoset-boilerplate.git genoset-sickel-cell-anemia

Step 2 : Install dependencies. CD inside genoset-sickel-cell-anemia folder and run

npm install

Step 3 : Build the logic. As you can see from here, the Gs228 criteria is

And when we convert this to GQL, our index.js would become

That is it, we have created our own genome test set.

To test this, create a new file named checkSCA.js at the root of the project and update it as below

And you can run

node checkSCA.js

And you will know if the 23andme-male is having Sickle Cell Anemia. And23andme-male does not have Sickle Cell Anemia!

Simple right?

Now you can either push this module to NPM or request Contra to add it to the existing genome.js repo.

You can find the above code here.

Hope this post gave you an idea on how to start DNA analysis with Node.js and Geonome.js.


Thanks for reading! Do comment.
@arvindr21

Tweet about this on TwitterShare on LinkedIn2Share on Google+0Share on Reddit0Buffer this pageFlattr the authorEmail this to someonePrint this page
  • Johan Olsen

    Hi Arvind, thank you for the excellent tutorial.

    I’m trying to follow along with your tutorial but the code is not compiling:

    Here is my code:

    var dna = require(‘dna2json’);
    var JSONStream = require(‘JSONStream’);
    var fs = require(‘fs’);

    fs.createReadStream(“genome_Lilly_Mendel_Mom__Full_20120818003901.txt'”)
    .pipe(dna.createParser())
    .pipe(JSONStream.stringify())
    .pipe(fs.createWriteStream(“genome_Lilly_Mendel_Mom__Full_20120818003901.json”));

    is giving this error:

    .pipe(dna.createParser())
    ^
    TypeError: undefined is not a function
    at Object. (/app/app.js:41:11)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:501:10)
    at startup (node.js:129:16)
    at node.js:814:3

    When i look at the source code on github, there is no createParser() function

    https://github.com/genomejs/dna2json/blob/master/index.js

    I also tried an alternative way:

    var txt = fs.readFileSync(‘genome_Lilly_Mendel_Mom__Full_20120818003901.txt’);
    dna.parse(txt, function(err, snps){
    fs.writeFileSync(path.join(__dirname, ‘genome_Lilly_Mendel_Mom__Full_20120818003901.json’), JSON.stringify(snps));
    });

    …But this is also giving an error:

    throw new Error(‘DNA must be a string’);
    ^
    Error: DNA must be a string
    at Object.module.exports [as parse] (/app/node_modules/dna2json/lib/parse.js:9:11)
    at Object. (/app/app.js:34:5)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Function.Module.runMain (module.js:501:10)
    at startup (node.js:129:16)
    at node.js:814:3

    This exception is being thrown by this function in parse.js:

    module.exports = function(dna, cb){
    if (typeof dna !== ‘string’) {
    throw new Error(‘DNA must be a string’);
    }

    }

    Can you kindly help with this problem. Thanks

  • Lio

    Amazing. I’m new to Node.js and know nothing about DNA analysis – but your article was fascinating to read. Love how sci-fi this sounds: git clone genotype-boilerplate. Thanks, and I’ll be checking out more of your articles!

    • http://thejackalofjavascript.com/ Arvind Ravulavaru

      Great! Thanks!