Single BLOB many jason documents, how to parse

I have a huge binary, containing multiple json documents. Is there a way to parse them using jason (or any other json-parser) into a list?

eg, make the result of Jason.decode("{}{}") equivalent to Jason.decode("[{},{}]")?

1 Like

You asked specifically for using a parser to do it, but I’m not aware of any that can do that. You probably know better than I how to work with binaries and charlists, but anyway I would use something like this:

defmodule Test do
  def enlist_json(a), do: enlist(a, 0, "")
  def enlist(" " <> t, n, acc), do: enlist(t, n, acc)
  def enlist("{" <> t, 0, acc), do: enlist(t, 1, acc <> "{")
  def enlist("}" <> <<_:: binary-size(0)>>, 1, acc), do: enlist("", 0, acc <> "}")
  def enlist("}" <> t, 1, acc), do: enlist(t, 0, acc <> "},")
  def enlist("{" <> t, n, acc), do: enlist(t, n+1, acc <> "{")
  def enlist("}" <> t, n, acc), do: enlist(t, n-1, acc <> "}")
  def enlist(<<h :: binary-size(1) >> <> t, n, acc), do: enlist(t, n, acc <> h)
  def enlist("", 0, acc), do: "[" <> acc <> "]"

  def charlist_json(a), do: charlist(a, 0, [91])
  def charlist([125 | []], 1, acc), do: List.to_string(Enum.reverse([93, 125 | acc]))
  def charlist([32 | t], n, acc), do: charlist(t, n, acc)
  def charlist([123 | t], 0, acc), do: charlist(t, 1, [123 | acc])
  def charlist([125 | t], 1, acc), do: charlist(t, 0, [44, 125 | acc])
  def charlist([123 | t], n, acc), do: charlist(t, n+1, [123 | acc])
  def charlist([125 | t], n, acc), do: charlist(t, n-1, [125 | acc])
  def charlist([h|t], n, acc), do: charlist(t, n, [h | acc])

  
end

str = "{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}{ \"a\": { \"b\": 2, \"c\": { \"d\": 1 } } }{\"a\": 2, \"b\": { \"c\": 4 }}"

Benchee.run(
  %{"string" => fn -> Jason.decode!(Test.enlist_json(str)) end,
    "charlist" => fn ->
      charlist_str = String.to_charlist(str)
      Jason.decode!(Test.charlist_json(charlist_str))
    end},
  time: 20
)

I don’t know though if charlist is faster for this case where the binary is small itself, and if the charlist being the same if it’s held in memory while for the binary it might not? I think you would need to benchmark them with some real variable blobs and see for yourself, with this setup charlists were faster:

Name               ips        average  deviation         median         99th %
charlist       45.36 K       22.05 μs   ±116.55%          20 μs          67 μs
string         28.64 K       34.92 μs    ±45.31%          32 μs          74 μs

Comparison: 
charlist       45.36 K
string         28.64 K - 1.58x slower
1 Like

Yeah, I went this route as well (though splitting the blob into many strings I Enum.map/2 over), but it annoys me, that I have to do pretty much the same thing the parser does anyway (counting parens that are not in a string).

Id be happy with something that would roughly do this: {:ok, #{}, "{}"} = imaginary_parser("{}{}"), such that I can consume the remainder until it becomes "" and therefore I know I have parsed the full input stream.

1 Like

Have you looked at Jaxon? Seems like it might do what you want.

4 Likes

@NobbZ Here is my suggestion:

defmodule Example do
  import Jason, only: [decode: 1]

  @doc """
  Example results:

      > Example.sample("{}")
      {:ok, %{}}

      > Example.sample("{}{")
      {:error, %Jason.DecodeError{data: "[{},{", position: 5, token: nil}}

      > Example.sample("{}{}")
      {:ok, [%{}, %{}]}

      > Example.sample("{\"key\": \"value}")
      {:error, %Jason.DecodeError{data: "[{\"key\": \"value}", position: 16, token: nil}}

      > Example.sample("{\"key\": \"value\"}")
      {:ok, %{"key" => "value"}}

      > Example.sample("{\"key\": \"value\"}{}")
      {:ok, [%{"key" => "value"}, %{}]}

      > Example.sample("{}{\"key\": \"value\"}")
      {:ok, [%{}], %{"key" => "value"}}

      > Example.sample("{}{\"key\": \"value\"}{}")
      {:ok, [%{}], %{"key" => "value"}, %{}}
  """
  def sample(data, old \\ nil), do: data |> decode() |> do_sample(data, old)

  defp do_sample({:ok, result}, _data, _old), do: {:ok, result}
  defp do_sample({:error, %{position: new}}, data, nil), do: sample("[" <> data, new)
  defp do_sample({:error, %{position: new} = error}, _, old) when old == new, do: {:error, error}
  defp do_sample({:error, %{position: new}}, data, _old), do: do_sample(data, new)
  defp do_sample({left, ""}, _old), do: left |> String.split_at(-1) |> do_sample()
  defp do_sample({left, "{" <> right}, old), do: sample(left <> ",{" <> right, old)
  defp do_sample({left, right}, _old), do: decode(left <> right)
  defp do_sample(data, new), do: data |> String.split_at(new) |> do_sample(new)
  defp do_sample({"[" <> left, "]"}), do: decode("[" <> left <> "]")
  defp do_sample({"[" <> _ = left, "}"}), do: (left <> "}") |> decode() |> check(left <> "}")
  defp do_sample({left, "}"}), do: decode(left <> "}")
  defp do_sample({left, right}), do: decode(left <> right)

  defp check({:ok, result}, _), do: {:ok, result}
  defp check({:error, _error}, %{} = old_error), do: {:error, old_error}
  defp check({:error, error}, left), do: (left <> "]") |> decode() |> check(error)
end

This module automatically tries to fix:

  1. Missing: [ at start of JSON
  2. Missing: , between two JSON objects
  3. Missing: ] at end of JSON

Unfortunately automatically fix missing [ and ] characters deeply is a bit hard for me based on Jason errors.

Let me know if it helps.


@AstonJ Is it my bad or we have problem with multi line text highlighting?

2 Likes

What do you mean Tomasz? The code block above? It’s all there, you just have to scroll :slight_smile:

Not scroll, by text highlighting I mean syntax highlighting - just take a look at it:

test = """
This is a "test"
"""
2 Likes

Ah I see what you mean. Discourse uses highlight.js, so changes would need to be made there and then Discourse should pick the new version :slight_smile:

@knewter is the author, perhaps he could help?

1 Like

Are you able to find a way to reliably split the separate documents by newline in the string blob? Replacing "}{" with "}\n{" might work depending on the input data (but is sadly likely very fragile solution).

If you manage to achieve separation then just parsing the JSON documents line by line should be extremely easy:

string_blob
|> magic_way_to_split_the_blob()
|> StringIO.open()   # emulate a device that reads from a string
|> elem(1)           # get the device from the {:ok, device} tuple
|> IO.stream(:line)  # wrap in a Stream-friendly object
|> Stream.map(&Jason.decode/1)
#|>  ...do any other filtering or transformation by using Stream functions here...
|> Enum.to_list

Which would give you a bunch of {:ok, document} or {:error, reason} tuples in a list.

1 Like