Erlang opcode counting at runtime

Hello Elixir forum,

We at Purple are trying to build the first scalable decentralized computing platform and we chose Elixir for two reasons:

  • The Erlang vm and it’s hot-code loading capabilities can be leveraged for compiling and executing code dynamically.
  • Decentralized software needs to be bulletproof and the Erlang vm is just good at this.

The only problem is that in order to prevent malicious software from running on nodes i.e. infinite recursion, there has to be way to count the opcodes called by the Erlang vm at runtime by a piece of code.

We have not yet found an API for doing such a thing, is it even possible?

2 Likes

You can analyse .beam files and the ‘Code’ chunk in particular. I have built an extension for VS Code that disassembles the beam files.

Source analysis only gives all the possible opcodes that a certain program can call. What we are trying to do is to count the number of opcodes that have been called by a script while it runs. The problem is that scripts have to be killed if the number of opcodes they call exceeds a specific limit.

I’m not sure if you can do that in the BEAM, but since you have fired it up in some kind of sandbox anyway, you can simply kill the sandbox after X seconds.

We cannot use timers since the script is run on every node that is connected to the network and consensus on the execution of the script is required among nodes with varied hardware and different owners. Opcode counting allows for a script to always stop at a specific call on each node, regardless of the amount of time it took for that node to compute the script.

You can probably generate an instrumented version for each beam file. It will wrap calls to the original functions. That way you can measure at a function level. Just guessing here.

1 Like

Another alternative would be to have custom version of the VM that will handle the counting.

Thanks for your help. We hope that we do not have to fork the vm in order to be able to do this.

Would the reduction count (as returned by process_info) be good enough for your use case?

I don’t quite understand how you plan to stop a process after an exact # of instructions without modifying the schedulers, though.

Note also that the opcode count may be different between VM versions, even with the same compiled files. The VM does some optimizations at load time.

Hmm, this would work. Wouldn’t the process be able to read it’s own reduction count and exit if it exceeds the imposed limit?

Or rather, assign another process to count the reductions and to send a kill signal to the first process when it hits the limit.

It seems to me that you are considering running user code inside the same BEAM environment as your management code. This is a bad idea; it will be very difficult to properly isolate user-code from breaking the security restrictions you might want to put on it, because the BEAM does not do this for you.

However, spinning up BEAM instances inside some virtual sandbox environment and having these communicate with your management-code and with each-other is definitely possible. You can hidden-remote-shell-connect an instrumentation node into a running cluster to perform monitoring that way.

1 Like

That would definitely be possible. You have one process which monitors the processes could run malicious code and keep track of how many reductions they do. There is no problems with counting reductions as everything is done with calling functions so it is a reasonable way to keep track of how work is done. For example Erlang has no loops as such and so looping is done with function calls.

As @Qqwy pointed out it is very difficult, in fact impossible, to safely restrict what code, any code , can do. Basically any code is allowed to do anything it wishes and there is no way to make an internal sandbox which could limit what a process and access or do. The BEAM was never designed to be “safe” in this way.

The only way is to run separate BEAMs inside their own virtual sandboxes to be able to completely restrict them. A thing to note here is that if you run distributed Erlang then you are basically opening up all the nodes in the distributed system to each other. You can restrict access but if is difficult and I wouldn’t trust it. This means that opening up your system to allow remote shells is definitely a no-no. This even if you are hidden as you can find hidden nodes when they are connected.

4 Likes

The network talks via a gossip protocol and nodes only need to know the ips of other nodes. The user code is passed around in udp packets so the code can be analysed before it is loaded into the system.

In order to prevent malicious code from interacting with the vm the plan is to expand it’s ast and check if there are calls being made to any erlang function or elixir functions which interact with the vm. If this is the case, the nodes would simply reject the received source code.

Regarding opening up erlang nodes to each other, since nodes speak via gossip, wouldn’t setting a different erlang cookie for each node effectively prevent access from other nodes?

You can hide those calls behind dynamically building atoms, variables and apply. And you do not even need to use apply, as there are many other ways to inject arbitrary calls.

1 Like

And then you could load in a new module which is already compiled and bypasses all the checks you may put in the ast.

2 Likes

Yes, but then you disallow all access through distribution, for example running remote shells. So it’s an all-or-nothing deal with distribution and cookies. Then you would have to decide whether to run the nodes alive or not at all. There is no reason to have them alive if you are not going to allow distribution.

What I do in a couple of apps with user code (one in lua another in a few things simultaneously) is I run their code in another process that must stop within a certain bound time or it is brutally killed (and a report sent out to tell the user their code is borked). It has worked well for me so far, but it is definitely not opcode-count-specific.

Yeah this definitely becomes harder. In that case have you thought about just running a custom language that handles it’s own opcode counting? That is what I did with my safe_script library (complete enough for my use, but I doubt it’s complete enough for anyone else’s as obvious features are still missing). I just count the instructions and ‘suspend’ after a certain amount of instructions (returning a continuation object). It’s not ‘too’ hard to write actually, especially if it is a pure whitelisted interaction (no direct beam calls)…

Yeah if you could ‘inject’ calls periodically into the running user code that do this and test, that would work well.

Precisely yes.

As long as it is fully whitelisted only code it is easier to handle by far (though watch your interface properly!).

I would not run straight-beam code on the beam to do this though, I’d still use a higher level language on top…

And this is why.

The BEAM is a VM, not a sandbox of any real form.

And this is only one of many injection points.

However, please answer this:

What precisely are you trying to accomplish and why. Not ‘how’ are you trying to accomplish it, but what endgoal are you trying to accomplish? There might be a better way…

We are trying to implement smart contracts on a decentralized ledger. Normally, a virtual machine would need to be written in order to achieve this.

However, this is no easy task and since erlang already provides tools for loading code at runtime, we are exploring the posibility of implementing smart contracts without the need of writing an entire vm from scratch.

1 Like

I’m not sure I would do this then. At the very least it would be very easy to run out of atoms if you are starting to talk about loading that code at runtime. There are quite a variety of VM’s out there, but one customized for the (seemingly restrictive) language would be best for this task I’d bet.