[Project Log #3] Pipeline for annotating DNA contigs

This is the set of steps I have followed in an attempt to annotate the data I have been given, and put it into a database. At the moment is is almost entirely a manual task, but if the scope for my project allows I would like to make this into a fully automated process that is compatible with the web front end that I produce.

Raw contigs

First set of data is the reads from the sequencer, assembled into many contigs.

This produces candida_boidinii.fa .

Diamond blast against NCBI non-redundant database

Second set of data is generated by using diamond to compare the contigs you have against the nr database to find known genes.

./diamond makedb --in nr -d nr_ref ./diamond blastx -d nr_ref.dmnd -q candida_boidinii.fa -o blastx_nr_boidinii.m8 -p 8

This produces blastx_nr_boidinii.m8 .

Select best matches from blast

Using awk pull out only the best matches from diamond. I need a proper bioinformatician to tell me if this is okay or not.

awk '!_[$1]++ {print}' blastx_nr_boidinii.m8 > boidinii_best.m8

This produces boidinii_best.m8 .

Get uniprot & Kegg IDs and proteins

Use awk to get the list of RefSeq protein IDs from the blast results

awk '{print $2}' boidinii_best.m8 | xclip

Then paste it into the uniprot look up and search for RefSeq Protein -> UniProtKB

Download the Mapping Table file, to get a list of the IDs that correspond to uniprot IDs.

Also download the fasta format of all the proteins found.

This produces boidinii_uniprot.fa & boidinii_uniprot_ids.lst .

Take the uniprot ids and get the conversion to kegg Id's in the same way. This produces boidinii_kegg_ids.lst .

Merge the data into a working file

Using Vim macro's it is simple to combine the found data into a copy of the original fasta file.

The gene is now annotated with the original scaffold, the blast result, and the uniprot & kegg ID.

>C76265 63.0 | XP_002553495.1 23.7 219 154 5 696 1331 25 237 4.1e-05 58.9 | C5DFV2 | lth:KLTH0D18194g AAAAAAAAAAAAATGTTCAGTCAAAAATAAGCTAATTTACCGTACAATGGCATGCATATGCGACAAGGTTCTTTTTTTCTGTTGTTTAGCAAATGCAGTA AACCAGTGGTTATACATTCATCATTAGGTGGTACTCTAAATCTGTCTTTATAATCCATCTTTTATCCATAAGTGAAGCTGAAAAGGCTGAAAGTCCTTTT

This produces boidinii_working.fa .

Create a database with the found data

Now we have two files that need to be ingested into a database. The boidinii_working.fa & boidinii_uniprot.fa files should contain all the data that the website needs.

I have created a quick and nasty script to put this data into MongoDB.

This gives provides the following data:

> db.boidinii.findOne({ uniprot: "K4AC16" }) { "_id" : ObjectId("58b96c298660061466e9be53"), "contig" : "C69229 4.0", "blast" : "XP_004983205.1 91.7 36 3 0 146 39 311 346 6.1e-09 68.2", "uniprot" : "K4AC16", "kegg" : "sita:101760397", "sequence" : "GAGGATCCTAACAATCTAGTAAGGCTAG...", "protein" : { "head" : "tr|K4AC16|K4AC16_SETIT Uncharacterized protein...", "seq" : "MNIASAALVFLAHCLLLHRCMGSEA..." } } > Build a web front end

Then I just have to make a web front end to make this data accessible, including viewing, searching and comparing genes across the species.

This is my next big challenge. I will create a list of stories for this task.

[Project Log #3] Pipeline for annotating DNA contigs

Trending Articles

文学城｜姬胜德羞辱江泽民惹祸

雷電模擬器 9.1.24.2 中文版 - 電腦玩手遊的必備模擬器

感觉INFJ和INFP相恋很痛苦 (豆瓣 INFP的淡色彼岸小组)

HAKERS哈克士戶外 12月8~14日廠拍

进程里“UnInstDaemon.exe”是何出处和作用？

請問Rogue這個故障燈號是什麼意思？

Devart UniDAC v10.3.0 SOURCES Delphi / Lazarus [含附件]

晴色杀手《ＸＸ系列》：1993 美丽凶器、1994 美丽猎人、1996 掠色无罪、1997 温柔的美兽、1997 狂爱、1998 另一个XX

臺灣電子產業又傳資安事件，PCB大廠欣興公告部分系統遭病毒感染

Cocoscreator 打包 Android 踩坑笔记

明慧广播：明慧文章汇编-修心断欲（5）

[银色子弹字幕组][名侦探柯南][第1167集 17年前的真相皇后的謀略][WEBRIP][繁日雙語MP4][1080P]

素人的进击 AV小只马园田美樱性感炸裂

[MagicStar] 擦不掉的「我」-复仇的连锁- / 消せない「私」-復讐の連鎖- EP02 [WEBDL] [1080p] [HULU]【生】【附日字】

吉美建設「派樂地」-內湖地上權建案請益

PROTEAN ELECTRIC宣布在天津生产制造轮毂电机

出售: Marantz SM-11 & SC-11 前後級一套

泰语每日一词：ถอย“退”，“减弱”（Day 259）

关门一家亲：习远平、张澜澜、徐才厚

uniapp 在slot中使用v-for循环后无法显示