How to split string into multiple chunks by size

csokun · May 13, 2021, 11:36am

What is the fastest way to split a large string into multiple chunks by size e.g breaking 10MB long string into multiple chunks of 5KB each?

hauleth · May 13, 2021, 12:24pm

Use binary pattern matching together with list comprehension:

for <<chunk::size(chunk_size)-binary <- input>>, do: chunk

dimitarvp · May 13, 2021, 3:15pm

This could produce invalid Unicode strings though?

csokun · May 13, 2021, 11:14pm

@hauleth I did a quick test and it seems like this method will throughway a leftover chunk.

for <<chunk::size(5)-binary ← “hello world”>>, do: chunk
[“hello”, " worl"]

csokun · May 13, 2021, 11:50pm

I found an answer from StackOverflow which goes like this:

 "hello world" 
|> String.codepoints
|> Enum.chunk_every(5)
|> Enum.map(&Enum.join/1)
# ["hello", " worl", "d"]

However, I not sure is there any performance implication here that I should take into consideration.

kelvinst · May 14, 2021, 12:26am

well, you could use Stream.chunk_every, so it would lazily chunk your string instead of eagerly. For long strings like you said you are working with, it should improve performance a lot already.

One question though, where are you getting this string from?

cmkarlsson · May 14, 2021, 12:39am

The stackoverflow is likely to be quite a bit slower.

handling utf8 is slow, but if not needed:

Here is a version based on the list comprehension but which takes the leftover into account/

defmodule Chunker do

  alias Chunker

  def chunk(string, size \\ 5), do: chunk(string, size, [])

  defp chunk(<<>>, size, acc), do: Enum.reverse(acc)
  defp chunk(string, size, acc) when byte_size(string) > size do
    <<c::size(size)-binary, rest::binary>> = string
    chunk(rest, size, [c | acc])
  end
  defp chunk(leftover, size, acc) do
    chunk(<<>>, size, [leftover | acc])
  end


  def stackoverflow(string, size \\ 5) do
    string
    |> String.codepoints
    |> Enum.chunk_every(size)
    |> Enum.map(&Enum.join/1)
  end

  def withstream(string, size \\ 5) do
    string
    |> String.codepoints
    |> Stream.chunk_every(size)
    |> Enum.map(&Enum.join/1)
  end

end

And here is the benchee run:

{:ok, data} = File.read("./data/alice29.txt")
Benchee.run(
  %{
    "chunk" => fn -> Chunker.chunk(data, 5000) end,
    "overflow" => fn -> Chunker.stackoverflow(data, 5000) end
  })

Here is the result:



Operating System: Linux
CPU Information: Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz
Number of Available Cores: 8
Available memory: 15.55 GB
Elixir 1.10.4
Erlang 22.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s

Benchmarking chunk...
Benchmarking overflow...
Benchmarking withstream...

Name                 ips        average  deviation         median         99th %
chunk          586726.39     0.00170 ms  ±1225.64%     0.00145 ms     0.00350 ms
withstream         17.50       57.15 ms    ±14.33%       54.27 ms       81.87 ms
overflow           15.71       63.64 ms    ±16.49%       61.00 ms       94.93 ms

Comparison:
chunk          586726.39
withstream         17.50 - 33528.78x slower +57.14 ms
overflow           15.71 - 37339.36x slower +63.64 ms

csokun · May 14, 2021, 12:43am

Hi @kelvinst I tried to port a method createFileFromText from azure-storage npm to my #ex_azure_storage lib. So to be honest I don’t know where the string comes from

evadne · May 14, 2021, 12:45am

Random thought here but if how UTF-8 uses a variable number of bytes bothers you or causes issues, you could use UTF-32 which is fixed-length!

Alternatively you might have to implement something which is smart enough to know how many bytes each grapheme requires. You might want to check out elixir/unicode.ex at v1.11.4 · elixir-lang/elixir · GitHub

Basically read a chunk of bytes and consume graphemes repeatedly off it (and emit chunks whenever you have one that is large enough), store the rest in the buffer and continue reading, repeat until done

csokun · May 14, 2021, 12:54am

@cmkarlsson impressive I’ve learnt something today thanks to you

kelvinst · May 14, 2021, 12:58am

I see, yeah, I was just curious cause normally big amounts of data like that come from files or uploads, that could be streamed themselves, avoiding to load the whole thing to memory. But in your case, as it’s a ported lib function that takes a loaded string, you don’t much control over that, so yeah, binary pattern matching seems like the way to go then.

John-Goff · May 14, 2021, 5:30am

Be sure to use String.graphemes/1 if there is any possibility of Unicode data in your strings, codepoints can do potentially not what you want. See here for more: String — Elixir v1.11.4

csokun · May 14, 2021, 5:38am

Thank @John-Goff I do want to support Unicode string.

Finally, with all your help I managed to put together a working code (decided to go with stream option for now) ex_azure_storage/azure_file_share.ex at fileshare_create_file · csokun/ex_azure_storage · GitHub

hauleth · May 14, 2021, 9:58am

What if I would upload file that content is not valid UTF-8? In general in this case you should not care at all about encoding of the file and use @cmkarlsson solution, but even then you do not need to construct list.

axelson · April 5, 2024, 1:36am

FYI that snippet will discard the last partial chunk if the string does not divide by chunk_size

iex(1)> for <<chunk::size(3)-binary <- "abcdefgh">> do chunk end
["abc", "def"]

(if the last partial chunk was included then it would also return "gh")

Here’s a simple recursive function that will return the last chunk (and not return any empty strings)

@doc """
Chunks a string into chunk of a given byte size (NOT unicode safe)

## Examples

    iex> chunk_string("abcdefgh", 3)
    ["abc", "def", "gh"]

    iex> chunk_string("abcdef", 3)
    ["abc", "def"]
"""
def chunk_string(string, chunk_size) when chunk_size > 0 do
  case string do
    <<chunk::size(chunk_size)-binary, rest::binary>> ->
      [chunk | chunk_string(rest, chunk_size)]

    "" ->
      []

    rest ->
      [rest]
  end
end

bdarla · April 5, 2024, 6:13am

A relevant library that came to my attention recently is text_chunker_ex by Revelry (thanks to @hugobarauna for sharing this in Elixir Radar 420).

The aim of the library is to split text to be fed to AI models. As I understood, it mimics some functionality of the LangChain. I have not used it (yet), but seems interesting!