How to make sense of logs? (Log Analysis?)

debugging
troubleshooting
logs

#1

Hi!

In my quest of becoming the best Elixir dev I can be, I saw one aspect in my
career that I’d like to improve upon. This is language agnostic, but since
Elixir has an awesome community, I’m sure you’d be able to help, specially
since you may have experience with distributed systems and logging in those can
only be crazier.

So, have you read/studied material on how to log/review logs?

This is somewhat related to debugging and you have parallels with other
professions as well. This is not only for programmers. Forensic accounting
comes to mind when dealing with multiple information sources and trying to
piece things together.

I believe it’s an art, to review logs, build a timeline, know how things work
from the logs themselves. I think folks learn over time, but it is such a
powerful weapon that I see the best ones have. I have not found good material
on it and was wondering if you could help me with this.

This goes hand in hand with debugging in a way. I’ve seen josevalim on Twitch
and picked up a ton of little things that people just don’t really talk about,
not by malice by any means, just seems like these “things” have to be learned
over time. Well, they add up and knowing them + using them as tools is life
changing. I guess that’s what you call experience…

I do think there is a systematic approach that many of the best share. Maybe
folks don’t even know about, or they do and I don’t know about it so I thought
I should ask.

Thanks for the help in advance!

Paulo

PS:

Here are a couple of resources that I think are related to this. Sorry if I’m not too clear, maybe your answers can help me focus my questions better.

Reverse engineering example:

I found a debugging book, unsure if would help with this, but it may:


#2

You can check this project for traces:

The official elixir debug page also has some nice tools for debugging:

As far as books go, I am not really aware of any, hope this helps :stuck_out_tongue:


#3

Thanks for the lib suggestions. My question is more about after you have access to logging. There are great tools that help with aggregation, filtering, but I’m more interested in learning more about what to do with the logs once i have them.

With that said, there is probably a lot to learn about when and how to log, so thanks for the libs!


#4

Ahh, log aggregation?
To do that I recommend Graylog. it’s free and you have a lot of guides and a public forum to help you out.

It is a powerful tool that I really enjoyed using back in the day. As an alternative you can also create your ELK stack, which you can tailor to your specific needs but needs a lot more investment upfront.

As for what to log, don’t log everything. Log only the things you need as. Define with your team a set of metrics that you need to keep and eye on and log only what you need. Other rules also apply, such as don’t log private data and such, but that’s for another topic.

What to do when you have the logs? I used to rely on the previous tool (Graylog) to perform aggregations and queries. Really was a life saver. I don’t think there is much you can do without a tool for that matter, but that’s my personal opinion.


#5

Here is my effort to share in short what I learned by experience about logging. The most important thing, in my opinion, is that your log messages should be:

  1. Informative: while concise, they should contain all the important information about the event, not just say “event X happened”. It helps to think: should I even look for this event in the logs, what information would I probably want to know? A practical example: a log event about a purchase might include the user ID that made the purchase.

  2. Discoverable: it is useless to log a lot of information, if it is not possible to easily find it when you most need it. This is not only a problem of having the right tools, but has actually to do with logging in such a way to enable future retrieval. Should you need to search for logs related to a certain occurrence, what search terms are you likely to use? Also, when debugging, you often start looking for some event in the log, but soon want to correlate it with other events happening around it. A practical example: in a web app, including a unique “request ID” in all logs pertaining to the same HTTP request makes it possible to easily correlate all log events that happened within the same request. Including a user ID, or a hash of it, makes it possible to find everything logged regarding a specific user (just pay attention not to log sensitive user data though). Tagging logs by service is also a common strategy. Multi-service architectures often pass around a “correlation ID” for the same request, that all services include in log messages, in order to follow the request flow across services.

  3. Reviewed and maintained: when debugging an issue, take note of what would have helped you, and what could be improved. Every application is different in terms of what is interesting to log and what is not. Keep improving it, add missing information or events, remove noise, and debugging will get better and better for you and your team.

What do you think? I hope this helps


#6

Ciao @lucaong, that’s great advice. It talks about how to log and maintain logging sane, this is critical to this task.

Now, help me think: imagine you have great logging already since I followed your instructions. Now what? There is something to be said about what to do with the logs after you have them. I feel that it is almost like looking at a painting: some people know what they are looking at and others see some paint on a canvas. (sorry for the analogy :slight_smile:, best I could do)

As usual, it depends on many factors. But, in general, I do think there is a process, some type of step by step that we do, maybe even without knowing we do it. Maybe they are as follows:

  • Do a first read in chronological order to see if anything jumps out as you are looking for clues (cc: @fhunleth)
  • Anything weird? Warnings, errors?
  • Use clues in logs to build a timeline of events
  • If not enough logging, follow @lucaong’s advice and add. Deploy and monitor.
  • (Imagine this is a bug for a second) Find error in log. Look at codebase for said log message. Go up the call stack and find root cause.

Something like that? Anyways, I don’t want to make this into an esoteric discussion, I just wanted some guidelines on how to do it properly, I want to understand that painting! :slight_smile:

Thank you for the reply, really good feedback, exactly the type of discussion I was looking for.


#7

Hi @pdgonzalez872,
the answer I have in mind is very general, and has to do with debugging more than with specifically analyzing logs, but I hope it can still be useful.

In my experience, debugging is an activity that is best approached with the scientific method.

Imagine you have a weird bug. Something that is not immediately obvious, but rather puzzling and defying explanation. Say that application instances crash, for no apparent specific reason. Assume that you collected the initial evidence, but you are still clueless about the root causes. Here is where the scientific method comes into play:

  • First of all, before you even start digging through the logs, formulate hypotheses. A good hypothesis is one that produces testable predictions. For example, if I suspect that the crash is related to a hardware problem, I should predict that all crashing instances are located on the same physical machine. If you instead suspect a memory leak, you might predict that we should find “out of memory” log entries. These are things we can practically validate.

  • Even more important, hypotheses are falsifiable: if we observe instances crashing on different hardware nodes, my hypothesis about a hardware failure becomes extremely unlikely, so we can archive it and move on to the next one. I should not get attached to my hypothesis: if it turns out to be false, we still made a step forward in understanding.

  • Only after knowing what you are looking for, you look at your logs, metrics, etc. It’s very important to stick to testing hypotheses, instead of spending too much time randomly looking for patterns. Even though serendipity can sometimes help in simple cases, our brain is too prone to see some pattern even when there is really none. If we really suspect that a pattern is more than a coincidence, we need to find a way to test that.

  • Typically, this goes in cycles: a hypothesis suggests a test, which brings some answers, which in turn bring more questions, more hypotheses, more tests, etc. If the test result is inconclusive, we need to go back to the drawing board, with either new hypotheses or new ways to test. This testing cycle drives your investigation.

  • The best engineers I had the pleasure to work with are egoless: they won’t focus on proving themselves right, or on showing off their knowledge. They can use this method so efficiently, that it might seem they can always guess right. The truth is often the opposite: they are good precisely because they don’t guess, but rather follow a method.

Does this make sense?


#8

Wow, fantastic answer man.

I do think deep down there is a process. After reading your answer, I agree, this is an awesome way to approach issues and I bet you that the really good ones do this, whether they realize it or not. Sometimes people forget how they got to where they are :slight_smile:

Thank you for spending the time to write it. This is awesome.


#9

Great suggestions ! Thanks for sharing !