I think Elixir 2.0 should drop structs

darkmarmot · August 19, 2021, 3:14pm

Yeah, thanks! I’m talking to my team now to see if we can build a project to reproduce what we thought we had been experiencing (to make sure we weren’t just deluding ourselves and misinterpreting data as well as to provide feedback here since it sounds like there are technical reasons we shouldn’t have experienced what we thought we were). And it’s totally possible we were wrong.

It might take a bit, though, as we’re in the middle of a bunch of stuff – but the basic setup will need to send structs between 2 nodes on different versions of a codebase using OTP 22 cluster, rpc and gzip. I’ll update this thread when we have something definitive.

Thanks!

lukaszsamson · August 20, 2021, 12:40pm

This is something that bit elixir-ls lately Breaking changes in Range on 1.12 · Issue #11110 · elixir-lang/elixir · GitHub
TLDR; elixir standard lib itself was matching on struct fields and structs created with code compiled on earlier elixir broke on newer.

@darkmarmot what you are experiencing is quite expected in any situation where data schema may change over time. Data schema versioning issues pop up everywhere (e.g databases, rpc protocols, even memory layout in C structs). That’s why code dealing with such data needs to be extra careful to not expect things that may not be there and all changes to the schema should be additive rather than destructive (for forwards and forwards compatibility.

bitwalker · August 21, 2021, 7:24pm

I second @wojtekmach on this, there is nothing about structs that is inherently more difficult to deal with when it comes to hot upgrades. As pointed out, pattern matching on structs is already basically duck typed, there is nothing stopping you from creating a struct by hand, e.g. %{__struct__: DateTime, foo: bar} and then passing it around, though obviously things will explode pretty quick, but as long as matches are only looking at the __struct__ field, i.e. %DateTime{} = %{__struct__: DateTime, foo: :bar}, nothing will fail until you actually access one of the missing fields.

I think the trouble with hot upgrades in general is that it is very difficult to carefully reason about how the upgrade process will occur, which is why testing them is so important. You can of course look at the upgrade script to get an idea of what order things will happen, but it won’t tell you if you have any old versions of data structures hanging around, regardless of whether its a struct, a record, or just a plain old tuple/list/map. Obviously anything you create post-upgrade will work properly, and anything you hold in process state can be upgraded predictably, but if you stuffed a struct in an ETS table, do a hot upgrade, then fetch that struct out of ETS, you’re going to get the old version. So you need to make sure that upgrading the data in that table happens as part of the overall ugprade as well, which can be tricky to say the least, primarily for public/protected tables since you can have readers/writers that are not upgraded yet and may choke on the new schema. Taking into account local vs external function calls, and intra/cross-node messaging is just another set of layers on the problem.

It takes an enormous amount of effort to properly orchestrate a system that uses hot upgrades. It’s an awesome capability to have at our disposal, but not only do you have to manage this complexity for your own code, but that of all your dependencies as well, since it is extremely rare that any of them even bother to plan for hot upgrades, let alone write appups. I’d argue that it is rarely ever worth the effort to use them, except for very small, purpose-built components which are deployed separately from the rest of your system, and can be carefully managed. For example, its ridiculous (IMO) to build a web application or backend API that uses hot upgrades. But let’s say that the web application provides an interface for a control plane that has some crazy high uptime requirement - the web application itself doesn’t need hot upgrades, but the control plane might, so you build them as separate deployments and design the web application to talk to the control plane using a protocol that rarely, if ever, changes.

I’m digressing wildly from the point here I guess, but what I’m really getting at is that I don’t think Elixir has done anything to make hot upgrades more difficult than they already were - these problems were all very much present before Elixir existed, and due to the dynamic nature of both Erlang and Elixir, I’m not sure its even possible to build tooling that makes it substantially easier. The closest thing to “automatic” appups were what I built into Distillery, but that was only ever intended as a starting point for building out a hot upgrade, since it didn’t do any of the manual stuff that I mentioned earlier. More often than not people would to try and use them for hot upgrades without any manual auditing.

Elixir itself would need to define appups for each application, for every release, much like Erlang does, and do some level of testing to ensure they work, for there to be any chance of hot upgrades not ending badly anyway. Luckily, the bulk of Elixir is library code, but there are some things that would need to be hot-upgradeable (e.g. Registry). Hard to say whether the core team has the bandwidth for that though, which means if you are using hot upgrades and building on top of Elixir, you need to be writing the appups for things in Elixir that you use which require upgrading. I suspect very few people using hot upgrades are doing this.

IvanR · August 22, 2021, 10:29am

@darkmarmot Yep, the struct’s definition changes can be challenging in the context of hot code reloads, especially when you don’t own the struct like Range from the standard library. Luckily the Elixir core team took care of and gave typespecs for all structures in the standard library.

Having these, it’s possible to detect non-compatible changes of the struct shapes in GenServer.code_change/3 callback at the runtime of hot code reloading. That can be done by validating the conformance of the struct value to the updated typespec f.e. with the Domo library (this is a preconceived opinion because I’m the author ) like the following:

defmodule MyStruct do
  use Domo

  defstruct [:range]

  @type t :: %__MODULE__{range: Range.t() | nil}

  # Domo adds the following funcitons:
  # new!/1
  # new_ok/2
  # ensure_type!/1
  # ensure_type_ok/2
  # typed_fields/1
  # required_fields/1
end

# Running with Elixir (1.11.0)

iex(1)> struct = %MyStruct{range: 1..5}
%MyStruct{range: 1..5}
iex(2)> File.write!("state.bin", :erlang.term_to_binary(struct))      
:ok
iex(3)> File.read!("state.bin") |> :erlang.binary_to_term() |> MyStruct.ensure_type_ok()
{:ok, %MyStruct{range: 1..5}}

# Then running the same instruction with Elixir (1.12.2) gives the following error:

iex(1)> File.read!("state.bin") |> :erlang.binary_to_term() |> MyStruct.ensure_type_ok()
{:error,
 [
   range: "Invalid value %Inspect.Error{message: \"got FunctionClauseError with \
message \\\"no function clause matching in Inspect.Range.inspect/2\\\" while \
inspecting %{__struct__: Range, first: 1, last: 5}\"} for field :range of %MyStruct{}. \
Expected the value matching the %Range{first: integer(), last: integer(), step: pos_integer()} \
| %Range{first: integer(), last: integer(), step: neg_integer()} | nil type."
 ]}

The next step is to migrate the state that became invalid as suggested by @lukaszsamson or even discard it depending on what is better for concrete workflow.