Slowness issue due to logger_file_backend

Recently I started using logger_file_backend to log the info level logs to the file which then is getting used for Kibana through Elasticsearch.
While doing performance testing we encountered a slowness issue when we enable the logger_file_backend.

I am able to reproduce the same issue in my local with the following Logger configuration:

    config :logger,
      backends: [
        :console,
        {LoggerFileBackend, :test}
      ]
    
    config :logger, :console,
      format: "[$level] $message\n",
      level: :info,
      sync_threshold: 100,
      discard_threshold: 500
    
    config :logger, :test,
      metadata: [:request_id, :user_id, :api],
      path: "./log/test.log",
      level: :info

I have created a simple “hello world” API to check the performance and ran it using Apache benchmarking tool : ab -k -c 500 -n 500 http://localhost:4000/api/v1/hello

In the output I can see following response time:

[info] Sent 200 in 30ms
[info] Sent 200 in 30ms
[info] Sent 200 in 30ms
[info] Sent 200 in 30ms
[info] Sent 200 in 30ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms 
[info] Sent 200 in 32ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms
[info] Sent 200 in 31ms

When I am disabling the LoggerFileBackend in :backends then for the same test I am getting following results:

[info] Sent 200 in 4ms
[info] Sent 200 in 5ms
[info] Sent 200 in 4ms
[info] Sent 200 in 4ms
[info] Sent 200 in 5ms
[info] Sent 200 in 5ms
[info] Sent 200 in 5ms
[info] Sent 200 in 5ms

It seems like file writing is taking more time and the messages are getting piled up in the logger queue.
I have tried out with multiple settings of :sync_threshold & :discard_threshold but nothing seems to optimize the performance.

I am using following elixir and erlang version for my demo phoenix application:
elixir 1.13.4-otp-23 erlang 23.2.1

Can anyone tell how we can optimize this to even save the :debug level logs in the file as we do not want to loose the logs?
Please also suggest any other alternate way to do the same without sacrificing performance if possible.

Those settings are not for performance optimization. They exist to make sure your app doesn’t fall over when there’s to many logs to be handled. It does so by first making sure logs are written synchronously (app waits on logs to be written, making it write less logs) and if that doesn’t help by not writing logs at all (even less than before). Under normal operations you don’t want to hit those thresholds.

Not having a performance hit is not possible. Writing to disk is by definition slower than not needing to interact with the disk. I can’t speak for if the performance you’re seeing is reasonable or not, but it’s for sure expected to exist.

You can look into other ways to handle logs, which don’t require writing to disk. You might need to consider your constraints in terms of durability of logs though.

1 Like

Not necessarily but I haven’t looked under the covers of Erlang’s or Elixir’s logging. I authored a small Golang logging wrapper some months ago (for sending log events to OpenObserve.AI) and I just had it spawn a background goroutine (== Elixir process) and a logger call was just sending message to its channel, and that data was sent to the network on arrival of that channel.

There are ways to do this better and I would be somewhat puzzled if Erlang / Elixir haven’t went about it the same way. I should probably go check the code.

Was this golang library writing to disk? It doesn’t sound like it did.

No, but sending to network can be even slower. Point is, the I/O was relegated to a background worker and was never happening in-place at the time of the logger calls.

That’s what the logger does by default as well as far as I know. But it switches to be synchronous or dropping logs when the backend gets overloaded.

1 Like

That makes sense. Do you think that is what is happening to OP? Does not seem to be the case but who knows, we don’t know the entire story.

If you can update your Elixir and OTP versions, Elixir v1.15 comes with built-in support for file logging: Logger — Logger v1.15.7

How much log are we talking about, 1MB per second, 10MB per second, or 100MB per second?

I only log to console and use a logging daemon such as svlogd, which should be faster than the built-in file logger, but at very high logging throughput, you may need some custom solution.

Yes, that is the case happening with my application. The log traffic is huge which is getting piled up in Logger queue and after a sync_threshold its going into synchronous mode. I tried playing with the sync_threshold and discard_threshold settings but its not helping much after enabling file logging.

Thanks for the quick reply @josevalim. I tried upgrading the Elixir version to v1.15 and setting the deafult handler config as mentioned in the doc:

config :logger, :default_handler,
  level: :info,
  config: [
    file: ~c"system.log",
    filesync_repeat_interval: 5000,
    file_check: 5000,
    max_no_bytes: 10_000_000,
    max_no_files: 5,
    compress_on_rotate: true
  ]

config :logger, :default_formatter,
  format: "\n$time $metadata[$level] $message\n"

I agree that it gives the inbuilt functionality to log it into the file and removes the dependency of using logger_file_backend but its behaviour is almost similar to the logger_file_backend ie. it has overload protection.
Is there any way where we can keep this Logger constantly in :async mode. And what will be the consequences of doing it so?
Regards

If your average traffic rate exceeds the average disk IO rate that you can flush logs to disk then you will never catch up and logs will backup in memory (unless you are discarding) until you get an out of memory error.

The back pressure from transitioning to sync mode exists because the disk IO rate is insufficient and the only relief is to push back against the source of the overload lest you consume all memory and crash.

It sounds like you need additional disk IO throughput. Consider changing the disk type to a higher performance class or adding additional additional disks to your instance and striping your filesystem. Depending on the IOops your VM is currently pushing vs it’s nominal capacity you may need to consider using an instance type with higher IO throughput.

Fundamentally the IO is not keeping up, it’s not CPU bound and it’s not memory bound, so fix the root cause.

3 Likes