Heroku performance problems - response times quadrupled and many timeouts (now solved)

phoenix
heroku
troubleshooting
Tags: #<Tag:0x00007fbcb2b723b8> #<Tag:0x00007fbcb2b71d50> #<Tag:0x00007fbcb2b71be8>

#1

Hi,

We recently experienced weird problems after een deployment that really only changed a bunch of JavaScript.

Response times quadrupled and we would get many timeouts.

We use this buildpack in Heroku: https://github.com/HashNuke/heroku-buildpack-elixir

We did several ENV var changes, playing around with the DB poolsize, all to no avail. In the end we changed one line in the buildpack:

# Always rebuild from scratch on every deploy?
always_rebuild=true

Setting this to true instead of false fixed our problems.

So this must means we had a rogue build or something. I must say that this sometimes (very rare) happens to my or my co-workers. Our phoenix app in dev mode will all of a sudden have terrible (seconds) response times. Cleaning the _build folder and recompiling then fixes everything. It seems a similar thing happened on heroku.


#2

Are you running a release or via mix? I’m wondering if sometimes the protocol consolidation artifacts aren’t being generated properly and so the requests end up using unconsolidated protocols, which are very slow (they hit the code server).


#3

We just use mix, no releases. How would we find out if what you’re saying about unconsolidated protocols happened to us?

Running this on heroku works:

iex(3)> Protocol.consolidated?(Enumerable)
true

#4

You could consider having that line printed via a Logger on application boot to ensure that it’s always the case, your problems seem to be intermittent so perhaps it will intermittently be false.

Another possible explanation simply has to do with the load on your app when it starts. Mix lazily loads code, so the first several requests that hit your app will have much higher response times while the code is being loaded. If there’s a lot of load, this could lead to timeouts.


#5

thanks @benwilson512

Another possible explanation simply has to do with the load on your app when it starts. Mix lazily loads code, so the first several requests that hit your app will have much higher response times while the code is being loaded. If there’s a lot of load, this could lead to timeouts.

The problems were ongoing. Not just after deployments. Peak loads were between 500 - 1000 req/s. I really think it was some problem with compiled code.

At least I have some pointers now. We could perhaps run this Protocol.consolidated?(Enumerable) check and submit an error to our error service