I think Elixir 2.0 should drop structs

Someone recently asked “What feature would you most like removed from your language?”

And, for me, it has to be structs or the way Elixir currently handles them.

In my job, we do a lot of deployments on distributed clusters, either pushing hot code updates against running servers for small changes or running multiple versions of our code side-by-side in the cluster as we do rolling deployments for large changes.

Thus far, we’ve had zero downtime running with a cluster of roughly 30 servers (knock hard on wood) over the last year and a half (since we went into production).

Our biggest difficulty (and danger) has been Elixir’s structs. While Erlang was designed for our kind of environment, it feels like Elixir broke away from it with its struct implementation.

To explain, functions that handle structs don’t happily duck-type them. If you make changes to a struct definition and move it between nodes, you can’t match on it unless all the fields are in perfect agreement. So there’s no easy way to update them in the running system.

We don’t let any data that moves between nodes contain them.

And this is where it gets really ugly. A lot of Elixir’s base data types are implemented as structs, such as Range and DateTime.

So, what seems like a simple non-breaking change for the language, as when the recent version of Elixir added step to Range, could actually cause catastrophic chaos on our distributed system.

I would propose that Elixir consider making structs act as duck-typed maps in the future as I love the language, but I hate to see it limiting the power of the original VM and ecosystem.

11 Likes

Rather than removing structs I would be interested in hearing how structs and hot-upgrades could be improved to work better together. I think there’s value in both, and removing either would be a huge change that would likely make a large proportion of users (most users?) of Elixir unable to upgrade to this hypothetical version.

I would expect Erlang to be worse as records are even harder to match on structurally than Elixir’s structs are.

8 Likes

I would be happy to have things under the hood use the __struct__ field for things like protocols and types. I would just like to be able to mostly use them more like maps in general. As for records, they aren’t something that’s ever come up as an issue for us as we mostly don’t use them.

Removing structs = no protocols :cry:

1 Like

but protocols are just “type”(thinking of structs as types) based dependency injection. I’d rather like an explicit options dependency injection. I think the only place they’re not interchangeable is in nested stuff.

1 Like

Why can’t the language just dispatch on matching %{__struct__: some_name} and ignore the rest of the fields?

They come up rather a lot in Erlang libraries and codebases in my experience, so I think Elixir may do better than Erlang here.

This would not work if a field is added as it would crash when the field is found to be missing. That or you would have corrupt data after the upgrade due to missing fields.

The struct pattern is effectively an assertion to say that your data conforms to the schema. If these assertions do not find the problem from the upgrade that did not correctly migrate the state to the new format then it would likely cause errors later when the data is attempted to use.

3 Likes

Yeah, that could definitely be from my comparative lack of experience in Erlang.

Perhaps structs could also include a __version__ number such that code could handle structs with different formats?

3 Likes

There’s not any way to know when the version needs to be bumped or what should happen when there is a version mismatch. At best we can check the structure of the struct and crash if the data doesn’t match the schema, which is what structs currently do.

Until someone makes a sufficiently powerful type checking tool that could be run against both versions of the codebase the business of upgrading state in hot upgrades is going to be a manual process for the programmer to implement via the appropriate OTP callbacks. It’s a very challenging job, and I recall reading once that Ericsson spend as much time testing and developing their upgrades as their application code.

2 Likes

We often do daily multi-version deployments or hot code upgrades on our cluster. For us, the key is that the data transferred between nodes has to be to inferred (no structs, just maps) and we avoid hot code loads that modify GenServer states (these are usually based on modifying pure functions).

1 Like

Our setup is covered in some detail in this talk: KEYNOTE: Using the Beam to fight COVID-19 - Bryan Hunter | Code BEAM V - YouTube

P.S. – Your work on Gleam is inspiring!

5 Likes

I think this versioning problem is a really interesting problem actually. In Haskell there is a library called SafeCopy that attempts to solve the same problem; it is used for restoring binary serialized data in an updated version of the “same” logical type. Basically you have a separate type for each version that has existed, and a typeclass (similar to protocol) that is responsible for migrating the data from the previous version to some later version. I think something conceptually similar could be done in an Elixir library though the details would be completely different (there is no multi-type protocol in Elixir for one thing, but you could manipulate structs as raw maps in a migration function).

It is a very small fraction of the Elixir community that even uses hot code loading so I wouldn’t expect language changes to address it, but I think it could be done in a library and even be pretty usable with the right macros.

5 Likes

I must be missing something because I don’t see where the problem is. Please bare with me. This Elixir code:

def f(%Date{} = date) do
  date
end

compiles to this Erlang code:

f(#{'__struct__' := 'Elixir.Date'} = _date@1) ->
    _date@1.

so at runtime there are no checks on anything besides the __struct__ field.

Of course, if you choose to write f(%Date{year: year}) and you pass something other than a map
with %{__struct__: Date, year: year} then that’s gonna be a match error.

This functions:

def f(%Date{} = date) do
  date.year
end

compiles to:

f(#{'__struct__' := 'Elixir.Date'} = _date@1) ->
    case _date@1 of
        #{year := _@1} -> _@1;
        _@1 when erlang:is_map(_@1) ->
            erlang:error({badkey, year, _@1});
        _@1 -> _@1:year()
    end.

so there are no additional checks either.

If you don’t want this behaviour of date.year then call Map.get(date, :year)? Or implement
Access for your structs? (cannot implement it for things you don’t own.)

Is it less about your code where you can choose not to use structs and more about libraries that you want to use which do use them, and thus you run into a danger of a runtime errors when things change?

I’m really curious, what dd you exactly mean by structs becoming more like duct typed maps. Which semantics of the structs are you proposing to change? How should they behave?

12 Likes

i think it’s about the internal elixir structs.

I probably should’ve gotten with my team first to put together a list of places where stuff has gone badly for us here. I think we’ve gotten a bit of PTSD about using them at this point, maybe undeservedly in some places.

2 Likes

I have a vague recollection of problems with :erlang.binary_to_term and structs with different definitions, but they don’t seem to happen on 1.12 + OTP24:

iex(1)> defmodule Foo do
...(1)>   defstruct [:foo]
...(1)> end

iex(2)> %Foo{foo: "bar"}
%Foo{foo: "bar"}

iex(3)> %Foo{foo: "bar"} |> :erlang.term_to_binary()
<<131, 116, 0, 0, 0, 2, 100, 0, 10, 95, 95, 115, 116, 114, 117, 99, 116, 95, 95,
  100, 0, 10, 69, 108, 105, 120, 105, 114, 46, 70, 111, 111, 100, 0, 3, 102,
  111, 111, 109, 0, 0, 0, 3, 98, 97, 114>>

---RESTART IEX---

iex(1)> defmodule Foo do
...(1)>   defstruct [:foo, :bar]
...(1)> end

iex(2)> binary = <<131, 116, 0, 0, 0, 2, 100, 0, 10, 95, 95, 115, 116, 114, 117, 99, 116, 95, 95,
...(2)>   100, 0, 10, 69, 108, 105, 120, 105, 114, 46, 70, 111, 111, 100, 0, 3, 102,
...(2)>   111, 111, 109, 0, 0, 0, 3, 98, 97, 114>>

iex(3)> :erlang.binary_to_term(binary)
%{__struct__: Foo, foo: "bar"}

---RESTART IEX---

iex(1)> defmodule Foo do
...(1)>   defstruct [:bar]
...(1)> end

iex(2)> binary = <<131, 116, 0, 0, 0, 2, 100, 0, 10, 95, 95, 115, 116, 114, 117, 99, 116, 95, 95,
...(2)>   100, 0, 10, 69, 108, 105, 120, 105, 114, 46, 70, 111, 111, 100, 0, 3, 102,
...(2)>   111, 111, 109, 0, 0, 0, 3, 98, 97, 114>>

iex(3)> :erlang.binary_to_term(binary)
%{__struct__: Foo, foo: "bar"}

Even adding @enforce_keys didn’t change that last result :thinking:

Well, the explanation is simple - binary_to_term/1 knows nothing about structs. When you use %Foo{foo: "bar"} what Elixir logically does is to check in the compiler cache whether it is allowed construct and then create map with all values. When you do something like struct(Foo, foo: "bar") then it will “behind the scenes” call the hidden “magical” function that will validate all that data. binary_to_term/1 is “dumb” deserialiser and doesn’t know anything about Elixir structs, so it happily decode the encoded map. There is no way to circumvent that, as that would additionally make it really hard to use Distributed Erlang with structs. So it is all law of leaky abstractions.

@darkmarmot only case when the stuff could go ugly after upgrade is if you have renamed or removed keys from the struct, otherwise it would work perfectly fine, as it is the reason why Elixir uses map-backed structs instead of tuple-based records as it is done in Erlang. Whole reason for that is that you can “painlessly” add new or reorder fields in the struct without need to change all pattern matching and recompile modules that use this structure. So you either done some real magic that broken that, or you have done something that would broke the invariants independently whether you would use other constructs. So show us what change have caused that mayhem, because now we are flying dark with explanations.

8 Likes

@darkmarmot It seems in this thread there’s some disagreement or at least confusion about in what ways things can break due to changing struct shapes.

Would you be willing to share a specific piece of code (or distilled example) and elaborate on in what sense it broke? (The specific runtime error that happened, or whatever else happened.)