Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

[Project Log #3] Pipeline for annotating DNA contigs

$
0
0

This is the set of steps I have followed in an attempt to annotate the data I have been given, and put it into a database. At the moment is is almost entirely a manual task, but if the scope for my project allows I would like to make this into a fully automated process that is compatible with the web front end that I produce.

Raw contigs

First set of data is the reads from the sequencer, assembled into many contigs.

This produces candida_boidinii.fa .

Diamond blast against NCBI non-redundant database

Second set of data is generated by using diamond to compare the contigs you have against the nr database to find known genes.

./diamond makedb --in nr -d nr_ref ./diamond blastx -d nr_ref.dmnd -q candida_boidinii.fa -o blastx_nr_boidinii.m8 -p 8

This produces blastx_nr_boidinii.m8 .

Select best matches from blast

Using awk pull out only the best matches from diamond. I need a proper bioinformatician to tell me if this is okay or not.

awk '!_[$1]++ {print}' blastx_nr_boidinii.m8 > boidinii_best.m8

This produces boidinii_best.m8 .

Get uniprot & Kegg IDs and proteins

Use awk to get the list of RefSeq protein IDs from the blast results

awk '{print $2}' boidinii_best.m8 | xclip

Then paste it into the uniprot look up and search for RefSeq Protein -> UniProtKB

Download the Mapping Table file, to get a list of the IDs that correspond to uniprot IDs.

Also download the fasta format of all the proteins found.

This produces boidinii_uniprot.fa & boidinii_uniprot_ids.lst .

Take the uniprot ids and get the conversion to kegg Id's in the same way. This produces boidinii_kegg_ids.lst .

Merge the data into a working file

Using Vim macro's it is simple to combine the found data into a copy of the original fasta file.

The gene is now annotated with the original scaffold, the blast result, and the uniprot & kegg ID.

>C76265 63.0 | XP_002553495.1 23.7 219 154 5 696 1331 25 237 4.1e-05 58.9 | C5DFV2 | lth:KLTH0D18194g AAAAAAAAAAAAATGTTCAGTCAAAAATAAGCTAATTTACCGTACAATGGCATGCATATGCGACAAGGTTCTTTTTTTCTGTTGTTTAGCAAATGCAGTA AACCAGTGGTTATACATTCATCATTAGGTGGTACTCTAAATCTGTCTTTATAATCCATCTTTTATCCATAAGTGAAGCTGAAAAGGCTGAAAGTCCTTTT

This produces boidinii_working.fa .

Create a database with the found data

Now we have two files that need to be ingested into a database. The boidinii_working.fa & boidinii_uniprot.fa files should contain all the data that the website needs.

I have created a quick and nasty script to put this data into MongoDB.

const fasta2json = require('fasta2json') , MongoClient = require('mongodb').MongoClient , url = 'mongodb://localhost:27017/candida' let proteins = fasta2json.ReadFasta('../data/boidinii/boidinii_uniprot.fa') , species = [] fasta2json.ParseFasta = (str) => { let fasta = [] , seqs = str.split('>') for (let i = 1; i < seqs.length; i++) { seqs[i] = seqs[i].replace(/\r/g, '') let seq = {} , fas = seqs[i].split('\n') , head = fas[0] seq.contig = head.split('|')[0].trim() seq.blast = head.split('|')[1] || 'No blast result' seq.uniprot = head.split('|')[2] || 'No uniprot match' seq.kegg = head.split('|')[3] || 'No kegg match' seq.blast = seq.blast.trim() seq.uniprot = seq.uniprot.trim() seq.kegg = seq.kegg.trim() seq.sequence = '' for (let j = 1; j < fas.length; j++) { seq.sequence = seq.sequence + fas[j] } seq.protein = proteins.filter((obj) => { let proteinId = obj.head.split('|')[1] return proteinId === seq.uniprot })[0] || 'No uniprot match' fasta.push(seq) } return fasta } species = fasta2json.ReadFasta('../data/boidinii/boidinii_working.fa') MongoClient.connect(url, (err, db) => { if (err) console.log('ERROR: ' + err) let collection = db.collection('boidinii') collection.drop() collection.insertMany(species, (err, result) => { if (err) console.log('ERROR: ' + err) db.close() }) })

This gives provides the following data:

> db.boidinii.findOne({ uniprot: "K4AC16" }) { "_id" : ObjectId("58b96c298660061466e9be53"), "contig" : "C69229 4.0", "blast" : "XP_004983205.1 91.7 36 3 0 146 39 311 346 6.1e-09 68.2", "uniprot" : "K4AC16", "kegg" : "sita:101760397", "sequence" : "GAGGATCCTAACAATCTAGTAAGGCTAG...", "protein" : { "head" : "tr|K4AC16|K4AC16_SETIT Uncharacterized protein...", "seq" : "MNIASAALVFLAHCLLLHRCMGSEA..." } } > Build a web front end

Then I just have to make a web front end to make this data accessible, including viewing, searching and comparing genes across the species.

This is my next big challenge. I will create a list of stories for this task.


Viewing all articles
Browse latest Browse all 6262

Trending Articles