How to download big files?

Hello

Why Tesla, HTTpoison so slow ?
Ruby “open” download files faster 10-20 times
Or maybe I do smth wrong

How to dowload files by URL, about 100-500Mb ?

Thanks!

2 Likes

There is a big difference between Ruby and Elixir… With Elixir, it is possible to spawn multiple proccesses, each downloading one file.

It would be nice to see some code of your HTTPoison usage, for me it is working fine with

    case HTTPoison.get(link) do
      {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
        File.write!(filename, body)
        {:reply, :ok, state}
     ...
    end

But what I like the most is running multiple download with Task.async/await

    tasks = Enum.map(list, fn({link, filename} = _tupple) ->
      Task.async(fn -> :poolboy.transaction(:worker,
        &(GenServer.call(&1, {:download, link, filename}, @genserver_call_timeout)), @task_async_timeout)
      end)
    end)

    result = Enum.map(tasks, fn(task) -> Task.await(task, @task_async_timeout) end)

The code is incomplete…

I translated an old scraper made in Ruby, to Elixir. It is 30x faster for my use case :slight_smile:

1 Like

I use this code

body = HTTPoison.get!(link, ["User-Agent": "Elixir"], [recv_timeout: 300_000]).body
File.write!(file_path, body)
1 Like

If the file is big, it might not fit into memory, so it’s better to use stream_to option with httpoison, and append to the file using IO.binwrite.

So it’ll be something like

def download!(file_url, filename) do
  file = if File.exists?(filename) do
    File.open!(filename, [:append])
  else
    File.touch!(filename)
    File.open!(filename, [:append])
  end
  
  %HTTPoison.AsyncResponse{id: ref} = HTTPoison.get!(file_url, %{}, stream_to: self())
  
  append_loop(ref, file)
end

defp append_loop(ref, file) do
  receive do
    %HTTPoison.AsyncChunk{chunk: chunk, id: ^ref} ->
      IO.binwrite(file, chunk)
      append_loop(ref, file)
    %HTTPoison.AsyncEnd{id: ^ref} ->
      File.close(file)
    # need something to handle errors like request timeout and such
    # otherwise it will loop forever
    # don't know what httpoison returns in case of an error ...
    # you can inspect `_other` below to find out
    # and match on the error to exit the loop early
    _other ->
      append_loop(ref, file)
  end
end

Note that receive won’t work in a genserver callback.

7 Likes

Yeah, not using that option for large files, will result in a lot of appending to previous chunks. Every now and then the whole previous chunk needs to get copied because no “appending” space for binaries is left. This will also cause a lot of pressure on the bin-heap garbage collection and cause a peak memory consumption that is at least twice the source file… Not to even say all the slow-downs du to the GC runs.

:stream_to though will cause “instant” handling of received chunks, this way, they have to GC’d every now and then, but there is not that much of appending and copying going and.

2 Likes

Thanks a lot, now Elixir is faster ever ))

2 Likes

How can I download and immediately read the file ?

If your code is okay with receiving data in chunks, then put it in append_loop (you would probably then name it receive_loop or something like that). This way you won’t even have to write the contents to the filesystem.

If not, you can File.read or File.stream the file (by passing the path to it) after append_loop returns.

1 Like

I’m a bit late to this thread, but I’ve created Downstream, a package for streaming downloads with HTTPoison.

7 Likes