What is your experience going to production with elixir/phoenix?

I couldn’t find a good overview post of people’s experience in production with elixir. I’m also wondering if there are common adjustments that get made, which should be ported back into the base template. While researching going to prod for elixir, I gathered numerous links I figured would help the next person.

Here are some thoughts, discussion points, and questions:

Context

  • Been using elixir/phoenix off and on for past 7 years, but all for hobby projects
  • I now have an app with real customers
  • Using LiveView, Ash 3.0, tailwind, and hosting on fly.io
  • 20+ years running large sites in prod across numerous languages and frameworks (php, rails, node, .net)

The reconnecting flash of death for LiveView

  • I have to assume every “big” company using LiveView overrides it?
  • This very specific flash is why I have trouble believing anyone is really using LiveView, beyond hobby projects. Unless they are all customizing it, but if that is the case, should we just update the core one?
  • The red is really scary to end users…
  • Also, reconnecting will often get stuck, and simply not work. This is across browsers and devices. And on my laptop (Chromium and Brave), it will spike the CPU until I forcibly reload the page.
    • This happens locally, and on my fly machines (with rolling and bluegreen strategies)

Logging

  • Something as basic as the client ip should just work without googling. Shouldn’t need to find a plug for it…
  • I found request_id somewhat useless, so I added an “id” to the session, and I log that instead which makes it easier to follow a single user in the logs.
  • It is very useful to see basic user and tenant info in the logs, and this turned out to be a bit of a journey to get them in there.
  • Mostly because I also wanted to see logs for navigation within a liveview that don’t require a page reload. (A customer would tell me they are on a page, but I didn’t see any log event. And then it clicked in my head that some of our navigations use LV)
  • Very curious what others do to monitor their app early on…
    • I have Sentry setup for errors, and amplitude.com for monitoring user sessions
    • But good ol’ tail logs is still my goto

Drift from phoenix template

  • As the base template changes over time, how do people bring in new best practices?
  • Same with core_components. Mine are heavily modified, so any changes to the base will never get pulled in. The update to LiveView 1.0 looks like a bit of a pain.

Plugs

  • Any plugs or response headers, beyond CSP, that people change or add for production?

Erlang flags

  • Are there any erlang flags that are generally tweaked once an app is in prod with traffic?
  • I spent a lot of time in the past tweaking node, ruby and java flags, but maybe erlang is close to “it just works”?
  • :erlang.system_flag(:fullsweep_after, 20) is the only one I saw blogged about

hexdocs.pm search

temporary_assigns for LiveVIew

  • (I’m thinking out loud for this one…)
  • Shouldn’t everything be temp by default to keep memory low?
  • We should specifically have to say what is sticky instead, no?
  • If developers are more likely to bloat their servers’ memory with assigns, should we instead be forced to specifically call out what we need?
  • We would still be able to “move fast” coding, but the resulting code would be significantly more future proof

Ash Queries

  • When writing Ecto queries, I would always (easily) specifically select only the attributes needed, even from joins
  • But with Ash, it is a bit cumbersome to select attributes in a query, resulting in a “select *” query, which sends shivers down my back
  • Are other people who use Ash in prod just selecting everything all the time?

Ecto pool_size

  • Ecto.Repo — Ecto v3.12.2
  • Ecto pool_size? - #4 by bobbypriambodo
  • I would instead love a “pool_size_strategy” that sets the pool size based on best practices according to how many cores were available
  • I scaled the machines my app was using and forgot to adjust the pool size
  • “pool_size_strategy: :best_practice”
  • Make me learn as little as possible about the nuances of your library when possible

Database connection didn’t reconnect

  • fly.io recently had a network issue (Sept 1, 2024), and my app didn’t reconnect to the db once the issue was resolved
  • I had to stop and restart, which seems like a default ecto setting is wrong?

CSS oddities

  • This happens to me on fly.io occasionally (maybe once a month), as well as my site
  • I will switch over to a tab with a site running phoenix, and it will look as if CSS never loaded
  • Easily fixed with a reload, but it would be scary to end users

A bunch of sites i perused over the past few weeks

25 Likes

Awesome post! I’ll specifically address the Ash querying parts.

It should be pretty straightforward to select attributes when running Ash queries. For example:

Ash.Query.select(query, [:foo, :bar, :baz])

If you are using a code interface function, you can provide a query option:

DomainOrResource.code_interface_function(..., query: Ash.Query.select(Resource, [:foo, :bar, :baz]))

Or you could add a select in the action:

read :home_page do
  prepare build(select: [:foo, :bar, :baz])
end

I could add a select option to code interface functions if that would make it more ergonomic to select individual fields, i.e DomainOrResource.code_interface_function(..., select: [:foo, :bar, :baz])

2 Likes

Interesting write-up, @lardcanoe

The reconnecting flash of death

This is a bit of a problem at my current company though it never seems to hang. We solved it by putting a delay. There is discussion around this here and I believe this will be available as an option in the next release of LV.

Having said, that, this is something I have always wondered about and never get too much feedback on it. Many people say it’s never a problem, others say it is sometimes, others say often. I do wish there was more constructive discussion around this. Maybe there is and I’ve missed it. I do spend an obscene amount of time reading this forum but of course said discussion could exist on other channels.

Another thing I will say is that I was running a project app on Fly and saw this constantly whereas my one other friend using the app said they rarely saw it. I’m currently running a client app on AWS and it’s not been much of a problem at all. I do find it odd that it continually seems to be a problem on Fly when it’s a Phoenix-heavy service. Maybe it has to do with the plan level? I have noooooo idea.

And of course I must point out to address the other part of your concern: you can re-style it to not be so scary!

Temporary_assigns

This is another area I think about a lot, though I think the default is good as it is. It is our state after all so I think the expectation that the default is to store is a good one. I haven’t worked on a LiveView site with tons of traffic before but I do wonder how much of a memory burden storing all that state server-side really is. Like, so long as you’re being relatively conservative, I figure it can’t be worse than all the temporal memory used by stacks that re-build the entire world on every request. Of course this is just a hypothesis and is something I’d love to see more discussion around (mostly hearing peoples’ experiences).

I do this sometimes and I don’t (yet) use Ash. I either need everything or use multiple schemas point at the same table to grab smaller slices of it for different scenarios. But this doesn’t really answer your question about Ash.

hexdocs.pm search

I agree here, though I’m going to throw a lil’ bit of your 'tude back at ya and say “feel free to get the ball rolling yourself to fix it” :upside_down_face: At least comment on the ongoing issue in the ex_doc repo to improve search. José and co are incredibly welcoming.

In any event, thanks for the post and I hope some good discussion comes out this! :v:

3 Likes

Ah, that is the way to select while using the code interface. It isn’t too bad. I’ll switch over to that and see how it goes for the next few weeks.

Being able to pass in select: is the ideal path, but don’t change anything since the query approach works.

But what has been the best practice approach for all the projects that you use Ash for? The ones that pay the bills. Do you just select everything? Do you mostly select the attributes that you need, and if so, what approach do you use?

I’ve been torn the past few months because there are some things I’ve learned you just do, but I’ve been kicking the can down the road since they weren’t so obvious how to do, or were overly verbose/tedious. Changing code that is doing select * to only selecting attributes is really hard to change, and very error prone. I should not have let myself go this long doing it.

Honestly it’s hard to say. It depends entirely on the real characteristics of the app I’m building. Like is this an admin tool for some low number of users to munge some data? Don’t bother with selecting. Is this a consumer facing money-making app that has to sail and feel good to use? Then yeah, applying a select always is good practice. You could theoretically even enforce it with something like this:

preparations do
  prepare fn query, _ -> 
    Ash.Query.before_action(query, fn query -> 
      if query.select, do: query, else: Ash.Query.select(query, [])
    end)
  end
end

That makes selecting a requirement.

3 Likes

What I do see is LiveView clients reconnect and they are on an older version of the CSS bundle, so might be missing some classes.

There’s static_changed?/1 to address exactly that need. By default the CSS and JS bundles are tracked, but the page is not auto-reloaded.

Made me chuckle… :laughing:

3 Likes