Please help me improve my solution to the "Computing GC Content" challenge on the Rosalind site (bioinformatics topic)

krlsdu · April 23, 2018, 11:23pm

Hi guys!
I would like your help to improve my solution.
my test
I resolved the code-challenge, but not is clear.

In my mind this can be much better with other people helping.

OvermindDL1 · April 24, 2018, 3:04pm

Eh, it could be shortened a bit and made a bit more efficient but not really that much more readable either. I’d probably have the actual calculation be something like this though (as it is significantly faster):

iex(1)> dna = "AGCTATAG"
"AGCTATAG"
iex(2)> Enum.reduce(to_charlist(dna), 0, &if(&1==?C or &1==?G, do: &2+1, else: &2))/byte_size(dna)
0.375

Which if wrapped in a case do ... end then the whole thing could be pipelined into just a dozen lines or so.

mbuhot · April 24, 2018, 9:28pm

Here’s a version with some helper functions split out, using Stream to eliminate intermediate lists, and binary matching to count the G and C characters:

defmodule Gc do
  def gc_content(dataset) do
    {key, gc_percent} =
      dataset
      |> parse_lines()
      |> Stream.map(fn {k, v} -> {k, gc_percent(v)} end)
      |> Enum.max_by(&elem(&1, 1))

    "#{key}\n#{gc_percent}"
  end

  @spec parse_lines(String.t()) :: Enumerable.t()
  def parse_lines(dataset) do
    dataset
    |> String.replace("\n", "")
    |> String.split(">", trim: true)
    |> Stream.map(&String.split_at(&1, 13))
  end

  @spec gc_percent(String.t()) :: float
  def gc_percent(val), do: Float.round(100 * gc_count(val) / String.length(val), 7)

  @spec gc_count(String.t(), integer) :: integer
  def gc_count(val, n \\ 0)
  def gc_count("", n), do: n
  def gc_count("G" <> rest, n), do: gc_count(rest, n + 1)
  def gc_count("C" <> rest, n), do: gc_count(rest, n + 1)
  def gc_count(<<_::utf8>> <> rest, n), do: gc_count(rest, n)
end

krlsdu · April 24, 2018, 11:42pm

cool your solution!
Your words is true, your code is unreadable
But sometimes performance is more necessary.
Whereas bioinformatics usually works with large files and data, it’s really important.

Thinking about your solution, and looking to improve the readable.
The “if” can turn into functions.

What do you think?