Single BLOB many jason documents, how to parse

NobbZ · September 15, 2018, 3:50pm

I have a huge binary, containing multiple json documents. Is there a way to parse them using jason (or any other json-parser) into a list?

eg, make the result of Jason.decode("{}{}") equivalent to Jason.decode("[{},{}]")?

amnu3387 · September 16, 2018, 12:49pm

You asked specifically for using a parser to do it, but I’m not aware of any that can do that. You probably know better than I how to work with binaries and charlists, but anyway I would use something like this:

defmodule Test do
  def enlist_json(a), do: enlist(a, 0, "")
  def enlist(" " <> t, n, acc), do: enlist(t, n, acc)
  def enlist("{" <> t, 0, acc), do: enlist(t, 1, acc <> "{")
  def enlist("}" <> <<_:: binary-size(0)>>, 1, acc), do: enlist("", 0, acc <> "}")
  def enlist("}" <> t, 1, acc), do: enlist(t, 0, acc <> "},")
  def enlist("{" <> t, n, acc), do: enlist(t, n+1, acc <> "{")
  def enlist("}" <> t, n, acc), do: enlist(t, n-1, acc <> "}")
  def enlist(<<h :: binary-size(1) >> <> t, n, acc), do: enlist(t, n, acc <> h)
  def enlist("", 0, acc), do: "[" <> acc <> "]"

  def charlist_json(a), do: charlist(a, 0, [91])
  def charlist([125 | []], 1, acc), do: List.to_string(Enum.reverse([93, 125 | acc]))
  def charlist([32 | t], n, acc), do: charlist(t, n, acc)
  def charlist([123 | t], 0, acc), do: charlist(t, 1, [123 | acc])
  def charlist([125 | t], 1, acc), do: charlist(t, 0, [44, 125 | acc])
  def charlist([123 | t], n, acc), do: charlist(t, n+1, [123 | acc])
  def charlist([125 | t], n, acc), do: charlist(t, n-1, [125 | acc])
  def charlist([h|t], n, acc), do: charlist(t, n, [h | acc])

  
end

str = "{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}"

Benchee.run(
  %{"string" => fn -> Jason.decode!(Test.enlist_json(str)) end,
    "charlist" => fn ->
      charlist_str = String.to_charlist(str)
      Jason.decode!(Test.charlist_json(charlist_str))
    end},
  time: 20
)

I don’t know though if charlist is faster for this case where the binary is small itself, and if the charlist being the same if it’s held in memory while for the binary it might not? I think you would need to benchmark them with some real variable blobs and see for yourself, with this setup charlists were faster:

Name               ips        average  deviation         median         99th %
charlist       45.36 K       22.05 μs   ±116.55%          20 μs          67 μs
string         28.64 K       34.92 μs    ±45.31%          32 μs          74 μs

Comparison: 
charlist       45.36 K
string         28.64 K - 1.58x slower

NobbZ · September 16, 2018, 1:01pm

Yeah, I went this route as well (though splitting the blob into many strings I Enum.map/2 over), but it annoys me, that I have to do pretty much the same thing the parser does anyway (counting parens that are not in a string).

Id be happy with something that would roughly do this: {:ok, #{}, "{}"} = imaginary_parser("{}{}"), such that I can consume the remainder until it becomes "" and therefore I know I have parsed the full input stream.

arjan · September 16, 2018, 1:26pm

Have you looked at Jaxon? Seems like it might do what you want.

Eiji · September 16, 2018, 2:19pm

@NobbZ Here is my suggestion:

defmodule Example do
  import Jason, only: [decode: 1]

  @doc """
  Example results:

      > Example.sample("{}")
      {:ok, %{}}

      > Example.sample("{}{")
      {:error, %Jason.DecodeError{data: "[{},{", position: 5, token: nil}}

      > Example.sample("{}{}")
      {:ok, [%{}, %{}]}

      > Example.sample("{\"key\": \"value}")
      {:error, %Jason.DecodeError{data: "[{\"key\": \"value}", position: 16, token: nil}}

      > Example.sample("{\"key\": \"value\"}")
      {:ok, %{"key" => "value"}}

      > Example.sample("{\"key\": \"value\"}{}")
      {:ok, [%{"key" => "value"}, %{}]}

      > Example.sample("{}{\"key\": \"value\"}")
      {:ok, [%{}], %{"key" => "value"}}

      > Example.sample("{}{\"key\": \"value\"}{}")
      {:ok, [%{}], %{"key" => "value"}, %{}}
  """
  def sample(data, old \\ nil), do: data |> decode() |> do_sample(data, old)

  defp do_sample({:ok, result}, _data, _old), do: {:ok, result}
  defp do_sample({:error, %{position: new}}, data, nil), do: sample("[" <> data, new)
  defp do_sample({:error, %{position: new} = error}, _, old) when old == new, do: {:error, error}
  defp do_sample({:error, %{position: new}}, data, _old), do: do_sample(data, new)
  defp do_sample({left, ""}, _old), do: left |> String.split_at(-1) |> do_sample()
  defp do_sample({left, "{" <> right}, old), do: sample(left <> ",{" <> right, old)
  defp do_sample({left, right}, _old), do: decode(left <> right)
  defp do_sample(data, new), do: data |> String.split_at(new) |> do_sample(new)
  defp do_sample({"[" <> left, "]"}), do: decode("[" <> left <> "]")
  defp do_sample({"[" <> _ = left, "}"}), do: (left <> "}") |> decode() |> check(left <> "}")
  defp do_sample({left, "}"}), do: decode(left <> "}")
  defp do_sample({left, right}), do: decode(left <> right)

  defp check({:ok, result}, _), do: {:ok, result}
  defp check({:error, _error}, %{} = old_error), do: {:error, old_error}
  defp check({:error, error}, left), do: (left <> "]") |> decode() |> check(error)
end

This module automatically tries to fix:

Missing: [ at start of JSON
Missing: , between two JSON objects
Missing: ] at end of JSON

Unfortunately automatically fix missing [ and ] characters deeply is a bit hard for me based on Jason errors.

Let me know if it helps.

@AstonJ Is it my bad or we have problem with multi line text highlighting?

AstonJ · September 16, 2018, 4:31pm

What do you mean Tomasz? The code block above? It’s all there, you just have to scroll

Eiji · September 16, 2018, 4:33pm

Not scroll, by text highlighting I mean syntax highlighting - just take a look at it:

test = """
This is a "test"
"""

AstonJ · September 16, 2018, 4:46pm

Ah I see what you mean. Discourse uses highlight.js, so changes would need to be made there and then Discourse should pick the new version

github.com

highlightjs/highlight.js/blob/master/src/languages/elixir.js

/*
Language: Elixir
Author: Josh Adams <josh@isotope11.com>
Description: language definition for Elixir source code files (.ex and .exs).  Based on ruby language support.
Category: functional
*/

function(hljs) {
  var ELIXIR_IDENT_RE = '[a-zA-Z_][a-zA-Z0-9_.]*(\\!|\\?)?';
  var ELIXIR_METHOD_RE = '[a-zA-Z_]\\w*[!?=]?|[-+~]\\@|<<|>>|=~|===?|<=>|[<>]=?|\\*\\*|[-/+%^&*~`|]|\\[\\]=?';
  var ELIXIR_KEYWORDS =
    'and false then defined module in return redo retry end for true self when ' +
    'next until do begin unless nil break not case cond alias while ensure or ' +
    'include use alias fn quote require import with|0';
  var SUBST = {
    className: 'subst',
    begin: '#\\{', end: '}',
    lexemes: ELIXIR_IDENT_RE,
    keywords: ELIXIR_KEYWORDS
  };

This file has been truncated. show original

@knewter is the author, perhaps he could help?

dimitarvp · October 2, 2018, 5:44pm

Are you able to find a way to reliably split the separate documents by newline in the string blob? Replacing "}{" with "}\n{" might work depending on the input data (but is sadly likely very fragile solution).

If you manage to achieve separation then just parsing the JSON documents line by line should be extremely easy:

string_blob
|> magic_way_to_split_the_blob()
|> StringIO.open()   # emulate a device that reads from a string
|> elem(1)           # get the device from the {:ok, device} tuple
|> IO.stream(:line)  # wrap in a Stream-friendly object
|> Stream.map(&Jason.decode/1)
#|>  ...do any other filtering or transformation by using Stream functions here...
|> Enum.to_list

Which would give you a bunch of {:ok, document} or {:error, reason} tuples in a list.