How to work with Bam files in Bioinformatics pipelines?

zhangzhen · July 8, 2018, 7:34am

I want to use GenStage to rewrite Bioinformatics pipelines. Processing bam files is one of most important tasks in Bioinformatics pipelines. Bam files are compressed using BGZF. Is there a package of dealing with bam files in Elixir? If not, could you please give me some hints about how to work with bam files?

patrickdm · July 8, 2018, 2:00pm

I suspect you would need to write the relevant code to handle reading/writing BAM files. I just started looking into The SAM/BAM Format Specification - 4 - where is described the binary structure of both BGZF and BAM. It seems it is all in there… but … well… so much of it

idi527 · July 8, 2018, 3:41pm

Have you started working on it? If so, do you have a github repo?

@zhangzhen, maybe you can call into some C code via a port to decode it before a library in erlang or elixir appears?

patrickdm · July 8, 2018, 4:14pm

No, I did not, just skimmed over the specification document at the moment…
I got intrigued by the question, as I’ve been writing (in the far past) some glue scripts, for processing sequencing data, multiple alignments, primer search…
Anyway I feel it is a big effort for (-unfortunately to me -just ) a pet project, even if I understand Elixir is very well suited for the task, and I’d be learning a lot out of it!
Moreover I feel that Elixir has lot of potential applied to bioinformatics, but this seems an almost unexplored domain yet.

patrickdm · July 8, 2018, 10:21pm

@zhangzhen, following idiot’s suggestion to use port with existing code, I did a little search and found these SAMtools, which indeed look like a good starting point and may be useful for your goals. Hope they could help.