Sometimes getting UndefinedFunctionError for a protocol

tomekowal · October 18, 2022, 8:34am

I am on Elixir 1.13.3 (compiled with Erlang/OTP 23), and I have a strange issue.

I have a relatively large project. Sometimes when I run the mix test (around 1 out of 100 runs), I get:

** (UndefinedFunctionError) function MyApp.Clients.PaymentsAPIClient.create_cancel/2 is undefined or private

That is definitely not true. The function is there in the 99 successful runs. The MyApp.Clients.PaymentsAPIClient is a protocol, but it is defined normally in an .ex file inside lib directory, so nothing wrong with paths. I even checked that the .beam file is there.

ls _build/test/lib/my_app/ebin | grep PaymentsAPIClient
Elixir.MyApp.Clients.PaymentsAPIClient.beam

There is another weird issue. The test file has nine tests using that protocol. The first one says only that the function is undefined or private and then the next eight tests also suggest using the same function that was not found:

     ** (UndefinedFunctionError) function MyApp.Clients.PaymentsAPIClient.create_cancel/2 is undefined or private. Did you mean:

           * create_cancel/2

It is almost as if during the execution of that file, the information about functions in this protocol was not yet present, but when printing error messages, it is already there, so the error message suggests using the same function that it previously hasn’t found.

To reproduce the issue, I need to run while mix test; do :; done;. It is not the first run, so the files are already compiled.

But I haven’t played with the consolidate_protocols flag, and I thought all protocols in lib are consolidated after compilation and before running tests.

Does anybody have an idea how to debug that? Or did someone have a similar issue?

Qqwy · October 18, 2022, 8:49am

What a strange issue. These kinds of things are very difficult to debug. I’ve encountered weird situations like this in projects written in ‘super dynamic’ languages where code might be redefined at every moment, but not yet in Elixir projects, as we usually resort to more structured techniques to modify program behaviour.

Some questions that maybe might help (but who knows; probably not )

Have you tried re-running the tests with exactly the same test order seed? If so, does the problem appear every time, or is it still only 1/100 runs even then?
Does the problem only show up when you run your whole test suite, or also when you run this test as an isolated one? There might be some interdependency between tests. Maybe some other weird code causes the protocol to be redefined?
- You can also try to run a ‘binary search’ where you halve the amount of tests you run every time to try to hone in on which tests might be conflicting.
How large is your codebase? How large is your test suite?

Sebb · October 18, 2022, 9:12am

Just a shot in the dark, no idea if that has anything to do with your problem.

Protocol consolidation is applied by default to all Mix projects during compilation. This may be an issue during test. For instance, if you want to implement a protocol during test, the implementation will have no effect, as the protocol has already been consolidated. One possible solution is to include compilation directories that are specific to your test environment in your mix.exs:

https://hexdocs.pm/elixir/1.14/Protocol.html#module-consolidation

tomekowal · October 18, 2022, 12:01pm

Thanks, but I don’t believe that is it. The docs are about implementing a protocol during tests, but both the protocol and the only implementation live in lib, which means they are compiled ahead of time. I am not implementing the protocol during the test at any point.

tomekowal · October 18, 2022, 1:13pm

* Have you tried re-running the tests with exactly the same test order seed? If so, does the problem appear every time, or is it still only 1/100 runs even then?

Unfortunately, using the same seeds does not reproduce the issue

* Does the problem only show up when you run your whole test suite, or also when you run this test as an isolated one? There might be some interdependency between tests. Maybe some other weird code causes the protocol to be redefined?

I am going to run while mix test the_file.exs do :; done; to check if running just this test produces the same problem.

* You can also try to run a ‘binary search’ where you halve the amount of tests you run every time to try to hone in on which tests might be conflicting.

Next in line, will be very time consuming

* How large is your codebase? How large is your test suite?

1598 tests

dimitarvp · October 18, 2022, 2:23pm

Probably not a helpful suggestion but just for the sake of experiment, have you tried other OTP versions?

al2o3cr · October 18, 2022, 2:48pm

Are you using any other libraries that manipulate module loading? Those can cause unexpected “what do you mean that’s not defined” weirdness like this if they’re used in tests with async: true.

tomekowal · October 18, 2022, 3:21pm

I just tried running only the one failing module for more than an hour and they passed. You might be right that it is an issue between two test suites. I’ll investigate more tomorrow.

michallepicki · October 18, 2022, 4:40pm

I had a similar issue, turned out to be caused by misconfiguration. Maybe this helps:

tomekowal · October 18, 2022, 6:48pm

Thank you all, fellow alchemists!

With your combined help, I’ve managed to locate the issue. I wasn’t aware, but we are using Mock library that uses :meck under the hood that reloads the module on the fly for mocking. We had a test that:

a) used Mock with MyApp.Clients.PaymentsAPIClient
b) had async: true

dimitarvp · October 18, 2022, 7:23pm

So what’s the solution, mark that test suite with async: false, or?

tomekowal · October 18, 2022, 7:34pm

Yes, that is it, just use async: false.

But not on the test suite that was failing. I needed to look for the other suite that uses with_mock MyApp.Clients.PaymentsAPIClient. The one failing was safe to use with async: true. The one that needed fixing passed the tests OK because it correctly mocked what it needed. Those libraries that mock stuff using code reloading are poisonous.

dimitarvp · October 18, 2022, 8:08pm

Super informative, thank you.

I’ve been using mock sparingly because I really liked its developer experience but this post changes things. I’ve been mostly using mox anyway, now I’ll just use it 100% of the time.

tomekowal · October 18, 2022, 8:26pm

For http requests, I very much like using Tesla. It has adapters for numerous HTTP clients: Finch, Gun, Hackney, Https, Ibrowse and Mint and has nicely composable middleware.

For tests, it defines a special Tesla.Mock adapter that does mocking through process dictionary (very similar to what mox does). However, it does not require defining a behaviour which is just a tiny bit less boilerplate in the end The downside is that it is only for mocking http requests.

dimitarvp · October 18, 2022, 8:42pm

Yep I use Tesla more and more lately myself. Didn’t know it had a small shortcut for mox, that’s even better!

Thanks.