Simplify adding production-grade logging to your standard Phoenix webapp

linusdm · April 9, 2024, 6:11pm

I’m prepping a go-live for a LiveView application we’ve been working on for months now. One of the things I’ve spent more time on than expected was getting the logs just right. It took me some time to realise that modern libraries/frameworks like Phoenix and Oban expect you to hook into the Telemetry events for logging and error reporting.

I’m under the impression that setting up logging for production use is an underserved aspect of shipping a Phoenix application. The logs for development are great. It contains all the correct log messages, in a human readable format (and now even streams to the browser console too). Not too much, not too little, and without any fiddling. Great! For production use, this setup doesn’t seem usable though. I can imagine everybody having a slightly different preference here, but still, I wasn’t expecting having to delve into all these different helper packages to whip up a working solution.

At first I was hopeful that Logster was going to be my one-stop solution for this. This library was inspired by the Ruby Lograge package. But it turns out it’s doing its own encoding into the logfmt format. Although I like the logfmt format, it’s difficult to make sure other libraries that are not instrumented by Logster produce a similar structure. I like the library for having an opinion on which logs are useful 95% of the time. But I don’t like that it also takes on the responsibility of formatting the messages. I think it’s more idiomatic to defer formatting decisions to the Logger.Formatter system.

We’re also using Oban, which requires its own handlers for logging and error reporting. In the end I settled on hooking up my own handlers for all the relevant Telemetry events. But I’m sure I’m missing some important ones, as I’m not really experienced with this operational concern.

Is everybody having a similar struggle? If so, this might indicate that there is still some room for improvement, so that sensible (opinionated) production-grade logging can be added easily to any project, in a matter of minutes. That probably would have saved me a lot of hours researching this, and piecing together the packages.

I’m curious to learn about how other people are tackling this aspect, and how you think things might be improved. Please let me know if I’m missing something obvious here!

D4no0 · April 9, 2024, 6:31pm

Personally I think that using logs in text format is by far not the best solution when it comes on keeping visibility of the system.

Besides the fact that if you log a lot of things you will reach a point where you will start losing logs because the system will not be able to persist them so fast, you also have a big mess of text logs afterwards that you have navigate through and somehow make sense of them.

I have used metrics and time series databases like InfluxDB before to have a much better outcome when it comes to visibility on system/subsystems and introspection. This also goes perfectly hand-to-hand with how telemetry in elixir works, because I can decide ultimately the shape of output data.

If you are set on just outputting text from your telemetry events to console logs, just make a generic handler that will just log everything, should be no more than 10 lines of code.

LostKobrakai · April 9, 2024, 7:26pm

I‘m wondering what exactly you except from „production grade logging“. If stuff is fine in dev, then how does production change that?

Though I‘d agree that logs are are just a tiny piece of observability. Metrics are easier to visualize and tracing (especially ad hoc) is imo way more useful for actual figuring out issues, as there‘s less of a need for it to exist in advance.

linusdm · April 10, 2024, 7:18am

Logs are only a small part of the observability puzzle, I agree. But I still want to have them as good as they can be.

If you are set on just outputting text from your telemetry events to console logs, just make a generic handler that will just log everything, should be no more than 10 lines of code.

I don’t know how that would look like. You still have to declare which telemetry events are interesting to you in every project, and not all telemetry events translate one-to-one to a log event (you still need to choose which metadata to extract for example).

I‘m wondering what exactly you except from „production grade logging“. If stuff is fine in dev, then how does production change that?

Good question. I think it’s too verbose right now (depending on the log level). One line per request would suffice (a bit like the selling proposition of Lograge). For LiveView requests are less relevant, and can be translated to handle_x callbacks.

dimitarvp · April 10, 2024, 8:26am

The easy way out is to limit production logging to the :info level while having everything else be at :debug?

ibarch · April 11, 2024, 3:24pm

The industry seems to be moving towards adoption of structured logs and OpenTelemetry.

My Elixir apps write structured logs to journald. These logs are then received, processed, and exported to an external vendor by the OpenTelemetry Collector. This setup is not suitable yet for mission-critical apps at this moment, but it works just fine for me.

OpenTelemetry Collector supports receiving logs from many sources which is really convenient.

dimitarvp · April 11, 2024, 3:49pm

Do you have any snippets handy? I’ll be doing this very soon and I’m not looking forward to the research.

ibarch · April 11, 2024, 8:30pm

Sure, but I’m afraid you’ll still have to investigate OpenTelemetry spec and configuration unless you have a dedicated team for that purpose.

First, you need to add systemd (kudos to @hauleth for such a great library) to the project.

Add default (previously known as console) formatter to config.exs:

config :logger,
  level: :info,
  default_formatter: [
    format: "$metadata[$level] $message\n",
    metadata: [:request_id]
  ]

Disable default formatter in prod environment in runtime.exs:

if config_env() == :prod do
  # INVOCATION_ID should be set automatically by systemd
  # If the app is being run outside of systemd we keep the default handler
  if System.get_env("INVOCATION_ID") do
    config :logger, default_handler: false

    config :my_app, :logger, [
      # this handler implements journald protocol for structured logs 
      {:handler, :systemd, :systemd_journal_h,
       %{
         config: %{
           fields: [
             # default metadata
             :syslog_timestamp,
             :level,
             :priority,
             {"SYSLOG_IDENTIFIER", ~c"my_app"},
             {"SERVICE_VERSION", System.get_env("RELEASE_VSN")},
             # custom metadata
             :request_id,
             :current_user_id,
             :my_other_var
           ]
         },
         formatter:
           Logger.Formatter.new(format: "$metadata[$level] $message", metadata: [:request_id])
       }}
    ]
  end
end

Add the new journald log handler in application.ex:

def start(_type, _args) do
  Logger.add_handlers(:my_app)

  children =
    [
      ...
      :systemd.ready()
    ]

  opts = [strategy: :one_for_one, name: MyApp.Supervisor]
  Supervisor.start_link(children, opts)
end

We’re done with the Elixir’s part of the puzzle. Open ssh connection to the target machine and verify that logging works as expected:

# verify that REQUEST_ID and other fields are present
journalctl --unit my_app --reverse --output verbose

# filter by :emergency, :alert, :critical and :error priority levels
journalctl --unit my_app --priority 0..3

# filter by REQUEST_ID field
journalctl --unit my_app REQUEST_ID=0c3a38d5b7b3731863490d9de68a77c5

If it does then proceed with installing OpenTelemetry Collector Contrib. I have an Ansible role for that, but skipped posting it for brevity. Let me know if you’re interested.

Otel Collector config in /etc/otelcol-contrib/config.yaml could look like:

receivers:
  journald:
    units:
      - my_app
    operators:
      - type: severity_parser
        parse_from: body.LEVEL
        preset: none
        mapping:
          debug: debug
          info: info
          info2: notice
          warn: warning
          error: error
          error2: critical
          fatal: alert
          fatal2: emergency
      - type: move
        from: body
        to: attributes.body
      - type: flatten
        field: attributes.body
      - type: move
        from: attributes.MESSAGE
        to: body
      - type: move
        from: attributes._HOSTNAME
        to: resource.host.name
      - type: move
        from: attributes.SYSLOG_IDENTIFIER
        to: resource.service.name
      - type: move
        from: attributes.SERVICE_VERSION
        to: resource.service.version
  otlp:
    protocols:
      http:
        endpoint: "localhost:4318"

exporters:
  otlphttp/openobserve:
    endpoint: https://api.openobserve.ai/api/my_organization_123
    headers:
      Authorization: Basic foobar=
      stream-name: default

processors:
  attributes:
    actions:
      - pattern: ^_.+
        action: delete
      - pattern: ^SYSLOG_.+
        action: delete
      - pattern: ^CODE_.+
        action: delete
      - key: PRIORITY
        action: delete
      - key: LEVEL
        action: delete
  transform:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - replace_pattern(body, "^.*\\[.+\\] ", "")
  batch:

extensions:
  health_check:

service:
  extensions: [health_check]
  pipelines:
    logs:
      receivers: [journald]
      processors: [attributes, transform, batch]
      exporters: [otlphttp/openobserve]
      # add traces and metrics here

gmile · October 15, 2024, 9:05am

@ibarch fantastic comment, thanks so much for putting this together!