Ecto Multi - Random connection closed errors from Db_Connection

pcharbon · February 28, 2019, 11:03am

I am puzzled at some random errors we are getting in staging. This is a new B2B website project with the back-end HTTP API using the Elixir / Ecto / Phoenix stack. We are getting close to release date and this is worrying us to no-end.

All queries work fine and we have gotten no errors whatsoever during testing and / or on staging website.
The problem seems to only be during actual insertion of new data via Ecto.multi.

I am personally pretty new to Elixir itself so I am at a bit of a loss at where to look. Here is an example transaction:

 multi =
      Multi.new
      |> Multi.insert(:salesorder, salesorder_changeset)                                      # Create a sales order operation
      |> Multi.run(:address, fn %{salesorder: salesorder} ->                                 # Create address changeset 
                    address_changeset =
                      %Salesorderaddress{salesorderid: salesorder.id}
                      |> Salesorderddress.changeset(soldto)
                    Repo.insert(address_changeset)
                    end )
    |> (more operations)

This sometimes work or not depending on the server mood it seems…
Here is the error we are getting in the console:

[error] #PID<0.453.0> running CmiserverWeb.Endpoint (cowboy_protocol) terminated
Server: localhost:4000 (http)
Request: POST /api/v1/auth/checkout/submitorder
** (exit) an exception was raised:
    ** (Mssqlex.Error) connection_closed
        (db_connection) lib/db_connection.ex:802: DBConnection.transaction/3
        (ecto) lib/ecto/repo/queryable.ex:21: Ecto.Repo.Queryable.transaction/4
        (cmiserver) lib/cmiserver/orders.ex:578: Cmiserver.Orders.submit_multi/4
        (cmiserver) lib/cmiserver_web/controllers/auth/checkout/checkout_submitorder.ex:22: CmiserverWeb.Auth.Checkout.SubmitOrderController.index/2
        (cmiserver) lib/cmiserver_web/controllers/auth/checkout/checkout_submitorder.ex:1: CmiserverWeb.Auth.Checkout.SubmitOrderController.action/2
        (cmiserver) lib/cmiserver_web/controllers/auth/checkout/checkout_submitorder.ex:1: CmiserverWeb.Auth.Checkout.SubmitOrderController.phoenix_controller_pipeline/2

I looked at the code of both the Ecto MS SQL Adapter (mssql-ecto) and its MS SQL driver (mssqlex) and from what I could figure out (definitely beyond my skills Elixir-wise) the error seems to come from Db_connection.

Due to our use of MSSQL (not a choice!) as the database server we are currently limited in dependencies to these versions:

      {:ecto, "~> 2.2.0"},
      {:mssql_ecto, "~> 1.2.0"},
      {:mssqlex, "~> 1.1.0"}

So far we have tried changing the :pool_size and the :timeout on the Repo without success either.
I am completely at a loss as where to look next. I wish we could go with Postgrex but changing the back-end of the in-house accounting system (C#, EF, etc) would be also a huge project.

Any help would be much appreciated.

Thank you in advance

dimitarvp · March 1, 2019, 10:49pm

Not sure we should blame db_connection. Looks like a connection is being closed before it has to.

I’m afraid your only option would be to try and communicate with MSSQL with hand-crafted code? Not experienced with it but this does look like a showstopper.

axelson · March 1, 2019, 10:56pm

For what it’s worth I often see a similar error with Postgres. But not at a high enough frequency to make it worthwhile for me to investigate deeply

pcharbon · March 2, 2019, 9:35am

I was afraid I would have to go the hand-crafted way and use db_connection directly. I am not sure I have the Elixir skills for it as this is my first project with it but we will look into it. I am sure db_connection is just reporting the connection closing which then bubbles up to mssqlex and its ecto-adapter. The problem I seem to have currently is also that with the number of layer you loose some control on what is happening.

Thank you guys.

wanton7 · March 2, 2019, 10:44am

I’m also planning to use Microsoft SQL Server so this worries me. Are you running same runtime version in production where this happens? I read there was race condition that was fixed in OTP 21.0.2 https://github.com/elixir-ecto/db_connection/issues/127

wanton7 · March 2, 2019, 12:36pm

Also what operating system are you running in production? Because if you running C# you might be running in it running it on Windows. If you are then you might be hitting some Windows specific issue. I’ve never read anyone running Elixir in production on Windows.

pcharbon · March 4, 2019, 10:17am

Indeed it is on a Windows Server 2012 R2 machine. I had read about that issue with the race condition but I was not sure it was the case. I upgraded the OTP version on some dev machine (Windows 10) and it still had same issue locally.
The system has not been released in production yet, only to a staging area. The weird thing is really that everything works fine Query wise but not when doing transactions. I guess it might be checking out multiple database connections from Poolboy for transactions while using a single one for queries?

We delayed the launch for now and are considering different options, including switching to Postgreslq and thus be able to work with latest release of Ecto and adapters.

Theoretically changing the database adapter for for the accounting system should be doable but that could also open a whole new bunch of issues.

Thanks for the link to the issue.

I will you know if we figure this out.

wanton7 · March 4, 2019, 2:12pm

It’s really odd that queries don’t have same connection closed problem if both are using same pool. Your error looks to me like code trying to create database transaction but can’t because connection is already closed.

From here Error handling

Connection termination - If a connection is terminated in an abnormal way, or if you try to use a connection that you have already terminated in a normal way by calling disconnect/1, the return value will be{error, connection_closed}. A connection could end abnormally because of an programming error in the Erlang ODBC application, but also if the ODBC driver crashes.

So it could be caused by ODBC driver crash. Maybe updating your ODBC driver will help? https://www.microsoft.com/download/details.aspx?id=56567

From here Erlang -- odbc

If trace mode is turned on this tells the ODBC driver to write a trace log to the file SQL.LOG that is placed in the current directory of the erlang emulator. This information may be useful if you suspect there might be a bug in the erlang ODBC application, and it might be relevant for you to send this file to our support. Otherwise you will probably not have much use of this.

Maybe you should turn on that log and check it if that reveals more information from your problem.

Also if you are running Windows 10 in dev machine and same problem exists, you could test with Linux using Windows Subsystem for Linux (WSL), if same problem exists? This should help Install WSL | Microsoft Learn
Then just go to cmd and write bash enter and you get Linux (Ubuntu shell). Then install Elixir to that Linux side and test. This blog might help as well https://medium.com/@colinrubbert/installing-elixir-phoenix-in-windows-10-w-bash-postgresql-ead9c1ce595c
Or you could just install Linux into virtual machine like VirtualBox and test it there. At least then you know if problem you are experiencing is Windows related or not.

wanton7 · March 4, 2019, 3:20pm

After more research I’ve seen some texts recommending to use ODBC driver SQL Server Native Client 11.0 when using Ecto. Like here https://github.com/findmypast-oss/mssql_ecto but from what I’ve read it’s deprecated (end of mainstream support) https://blogs.msdn.microsoft.com/sqlreleaseservices/snac-lifecycle-explained/. So if you are using that old driver maybe you shouldn’t and use latest ODBC Driver 17 for SQL Server instead

pcharbon · March 4, 2019, 3:50pm

I have actually tried with both the ODBC Driver 17 For SQL Server as well as ODBC 13.1 for SQL Server locally, but not on server.
It did not seem to have any effect unfortunately. I will try your suggestion and turn on Erlang ODBC logs and see if there is some information in there.

I saw that db_connection itself allows to pass a different timeout to the connection pool itself but did not see where it could be passed down from MSSQLEX.

It seems someone has started a branch to update the db_connection version to 2.0 in MSSQLEX and hopefully that can eventually help with this issue.

Thanks for the help.

Much appreciated

wanton7 · June 13, 2019, 9:31pm

Not sure if you still fighting with this one.

I found out there is a TDS driver https://github.com/livehelpnow/tds that you can use instead ODBC driver to connect to SQL Server 2008+. It supports TDS versions 7.3 and 7.4, that’s why it can only connect SQL Server 2008 and up. It seems to supports Ecto 3.1.0+.

Medium article related to it https://medium.com/asolvi/update-azure-sql-with-elixir-c77da7cc4197

pcharbon · June 13, 2019, 11:30pm

Never found a solution so will have a look at that. Thanks for letting me know about it.