Ruby 2.2 and up garbage collect Symbols: Could Erlang do the same?

Qqwy · November 6, 2018, 11:23pm

On today’s Ruby/Elixir meetup here in Groningen, someone told me that since Ruby 2.2, Ruby garbage-collects symbols (the Ruby equivalent of Elixir’s atoms). This was very interesting to me, since it is a common DoS-attack opportunity that many Ruby and also Erlang/Elixir-applications suffered from, where symbols are created based on user input, which would fill the atom/symbol table and therefore use more and more memory.

So my question: Would it be possible to use a similar approach as the one used in Ruby (splitting symbols in two kinds: one GC-able because only ruby code inside the current scope touched it, and one non-GCable because some native code touched it and might have kept a reference to it somewhere) in the BEAM VM as well?

lpil · November 7, 2018, 12:36am

This would have performance implications as comparing atoms would no longer be comparing two integers, instead it would be comparing binaries or lists, which would require them to be both traversed and longer atoms to be more expensive. We compare atoms a lot in Elixir/Erlang so it doesn’t seem very appealing to me. I would much rather be limited to never generating atoms at runtime

fuelen · November 7, 2018, 5:23am

This is the problem, not the symbols/atoms itself. Why WM should create crutches based on incorrect using of the feature by users?

michalmuskala · November 7, 2018, 8:03am

Of course it’s possible, there’s even an Erlang Enhancement Proposal from 2008 about doing it - EEP 20. I can only speculate that the only reason we don’t have it is that there are other things that were deemed more important and nobody spent cycles actually implementing it.

garazdawi · November 7, 2018, 8:37am

There have been suggestions to do it in Erlang for quite some time: http://erlang.org/eeps/eep-0020.md

However, the advantages have not outweighed the disadvantages yet. There is bound to be some more discussions about the pros/cons in the EEP mailing list archive if you are interested.

hubertlepicki · November 7, 2018, 8:58am

I think it’s going to be harder in Erlang because of the way GC works there versus how Ruby’s GC works.

In Ruby, GC is running globally, within GIL. In Erlang, it runs within the scope of single process, freeing the memory of the single process (roughly speaking). I know you know that, just explaining for others btw.

Freeing up symbols would be a special edge case for GC, so not something it’s designed to do in principle and I suspect it’d be quite tricky to do without putting strain on the performance of the system. The proposal @michalmuskala links to is really smart (local/global atoms split), but it’s also tricky and I do not think it solves all cases. And there’s a value in keeping each code base simple. We might get it some time or never get it, I think, as a result.

Qqwy · November 7, 2018, 9:15am

Because it is an example of a leaky abstraction, where the end developer (the application developer) has to constantly be mindful of what they are doing, to not end up with an insecure system. The problem also is incredibly late in announcing itself: You could have a codebase running for years before someone figures out that there is some unsafe string->atom conversion going on, which brings down your system.

@michalmuskala What a great proposal! I disagree with @hubertlepicki, and think that implementing this proposal in Erlang really can help without complicating things for the end user. As stated in the proposal:

I should confess that this proposal doesn’t entirely avoid the crashes and hangs problem. If an Erlang system can be persuaded to load modules from an untrustworthy source, it can still be made to try to create enough atoms to get into trouble. […] However, anyone who loads modules
from untrustworthy sources should KNOW they are doing that; it is an obviously dangerous thing to do. list_to_atom/1 is NOT an obviously dangerous function, and it should not be any more dangerous than list_to_binary/1 .

which sums up my opinion about this matter as well.

hubertlepicki · November 7, 2018, 9:18am

I agree it helps, does not eliminate problems completely but yes, it’d be a progress

fuelen · November 7, 2018, 9:47am

I thought it is his professional obligation

Qqwy · November 7, 2018, 11:15am

In theory, it is. In practice, the following situations are only far too common (which is the reason for much of the buggy and insecure software of today):

Developers don’t know about a specific problem because they have not encountered it before. I was recently hit by this problem, even though I have been developing web-apps for 8+ years.
Because customers do not see and cannot validate the (in)security of a system or a service, many companies have a focus on developing fast rather than secure. Thus there is a treasure trove of IoT-devices hackers can happily incorporate in their botnets, or also things like IP cameras that are so frequently hacked that there’s a public directory of them now.. This is a very unfortunate effect of our current capitalism-driven software ecosystem.
We are humans, rather than deterministic machines, meaning that even when we try very hard to prevent it, we can and do make mistakes all the time. The Law of Leaky Abstractions means that there are many internal details of our code that from time to time sprout up to make life for us hard, and we cannot possibly keep all of those internal details in mind all the time while programming. So enforcing us to make it either impossible or at least less likely to make certain mistakes is a quality-of-life improvement at the least, and a severe improvement of overall security and productivity of a stack at best.