Writing 1K images to disk using Task.async_stream

I am downloading images through HTTP request with which I am getting a binary image, and writing it to a file such as

File.write(image_with_dir, image, [:binary]) |> File.close

this whole operation of getting HTTP request and then writing it to disk is done in

|> List.flatten()
|> Enum.sort()
|> Task.async_stream(&(inline_process.(&1, images_directory)), max_concurrency: System.schedulers_online() * 2, timeout: :infinity)
|> Stream.run

When decreasing max_concurrency the process got slow approx 2 minutes, also results of System.schedulers_online() is 8

but with current max_concurrency it faster but with this. Disk IO starts touching the limits

Purpose of writing those files is to send them to Dropbox with a batch of 1000 as dropbox upload session supports 1000 images at a time.

Is there any better way to write images to disk? maybe in memory but I don’t know, any help would be wonderful also this operation is being done on Cuda GPU machine but I am not sure how I can use GPU for such purpose.

This process is user defined. user can ask for less/more than 1000 images and those can be one or multiple Task.async_stream’s…

I want to save them on disk so if the process broke or application gets a restart, I can resume the download from where it left.

:wave:

Could How to download big files? be helpful maybe?

Actually its about writing them to disk efficiently

The answer handles that part as well.

Open the destination file in raw mode with a big buffer (512KB at least, if not a few megabytes even). This also avoids the file operations going through a single internal process – this happens to all non-raw-opened files.

Okay then I coudnt find in that link :slight_smile:

How to open a file in raw mode? and with a big buffer as well?

You can try File.open(filename, [:append, {:delayed_write, buffer_size, delay}]) and then write to it with IO.binwrite, the post I linked above describes the last part better.

Try with the the standard File.open docs. :yum:

And yeah, @idi527 is right.

Since I am back at my machine:

file = File.open!(path, [:raw, :write, {:delayed_write, 524_288, 2_000}])

Which basically gives you a much faster writing speed, utilising a generous 512KB buffer and 2 seconds of potential delay when the data will eventually make it to the disk (which should be plenty enough even on slow-ish servers).

From then on you can use IO.write or IO.binwrite on the returned handle. And don’t forget to call File.close at the end!

A file opened in raw mode wouldn’t work with IO.write and IO.binwrite, I think?

iex(1)> {:ok, fd} = File.open("some.file", [:raw, :write])
{:ok,
 {:file_descriptor, :prim_file,
  %{
    handle: #Reference<0.3732712904.755892237.167770>,
    owner: #PID<0.105.0>,
    r_ahead_size: 0,
    r_buffer: #Reference<0.3732712904.755892226.167656>
  }}}
iex(2)> IO.write(fd, "hello")
** (FunctionClauseError) no function clause matching in :io.request/2

    The following arguments were given to :io.request/2:

        # 1
        {:file_descriptor, :prim_file,
         %{
           handle: #Reference<0.3732712904.755892237.167770>,
           owner: #PID<0.105.0>,
           r_ahead_size: 0,
           r_buffer: #Reference<0.3732712904.755892226.167656>
         }}

        # 2
        {:put_chars, :unicode, "hello"}

    (stdlib 3.12.1) io.erl:565: :io.request/2
    (stdlib 3.12.1) io.erl:63: :io.o_request/3

:file.write can be used instead.

iex(2)> :file.write(fd, "hey")
:ok

Also there are some ownership complications similar to sockets where only the process that opened the file can write to it, not sure if it’s a problem for the Task.async_stream approach described in OP (it shouldn’t be a problem if the file is opened and worked with within the same task process).

1 Like

Actually yeah, you’re right. Apologies.

Since it sounds like you don’t need to persist the files in memory, you could mount a ramdisk so the files never have to touch the disk (assuming you have enough ram)

1 Like