Is this a safe read/write structure for a globally accessed ETS data table?

I am running some design tests. The example here is not a real system but just some rough parts to test how things can fit together.

APPLICATION STRUCTURE

Hypothetically imagine a structure like this:

     APPLICATION
     |__SUPERVISOR
        |__SUPERVISORTEST
           |__DATACACHETEST
        |__WEBSOCKET

The DataCacheTest is a GenServer that starts an ETS table which can be read/written by anyone anywhere. Read/write is not passed through the call/cast system before the ETS. Yes, there are race conditions.

WebSocket is an HTTP router and/or websocket upgrade by which users might access or manipulate the data in the ETS table.

PROBLEM

The problem I am trying to figure out is:

What happens if the DataCacheTest GenServer crashes? It will be restarted, but in the mean time, queries on the ETS data from the HTTP system will fail. (ETS table won’t exist.) What happens in that moment? ie. While the ETS table GenServer is restarting?

1) CHECK IF EXISTS?

One idea is you could do a check on the GenServer existing like inside DataCacheTest:

def exists do
        pid = Process.whereis(__MODULE__)
        is_pid(pid) # return true or false
end

And then use that before any read/write operations on the module. ie. running in websocket:

if DataCacheTest.exists() do
    DataCacheTest.read_value()
end

But I feel like this is wrong. Because you could theoretically pass the exists() check and then still fail at the read_value() if it dies between that time. Or it dies but the Registry isn’t updated yet.

You don’t want the error reaching the HTTP socket module as I’m not sure how crashing works but you certainly don’t want this to crash as all your users would be disconnected.

2) TRY/RESCUE ALL ETS OPERATIONS?

My best idea is then to (knowing that any ets operations can fail in this case) wrap any :ets.lookup and :ets.insert functions inside DataCacheTest like so:

    result = try do 
        :ets.lookup(:doesntexist, :doesntexist) 
    rescue 
         ArgumentError -> IO.puts("Caught ETS failure") 
        @failure_value
    end

ie.

    try do :ets.lookup(:doesntexist, :doesntexist) rescue ArgumentError -> IO.puts("Caught ETS failure") end
    try do :ets.insert(:doesntexist, {:doesntexist, :whatever}) rescue ArgumentError -> IO.puts("Caught ETS failure") end

These two operations do seem to catch the errors without the red error test when I test them in the console. I’m guessing any time you see red error text, something may be crashed? I’m not sure. Is that the case?

Is this the right idea?

Thanks for any thoughts.

TEST CODE

1) Application

Hypothetical Application that starts the system hierarchy above.

defmodule MyApplication.Application do

    #use Application
    def start(_type, _args) do
        web_socket = {Plug.Cowboy, plug: TestWebSocket, scheme: :https, options: [ port: 4100, ] }
        supervisor_test = {SupervisorTest, nil}
        children = [web_socket, supervisor_test]
        opts = [strategy: :one_for_one, name: MyApplication.Supervisor]
        supervisor = Supervisor.start_link(children, opts)
    end
end

2) ETS Supervisor

Not really needed, I suppose, but I was just thinking maybe it is nice to have a different supervisor here for the ETS Genserver so tested creating this.

# test run as Supervisor.start_link([{SupervisorTest, nil}], strategy: :one_for_one)
defmodule SupervisorTest do
    use Supervisor

    def start_link(args) do
        Supervisor.start_link(__MODULE__, args, name: __MODULE__)
    end

    # how to initialize supervisor on start up
    def init(_args) do
        children = [
            {DataCacheTest, nil}
        ]
        Supervisor.init(children, strategy: :one_for_one)
    end

end

3) ETS Table Owner

This GenServer creates an ETS server and has some static (non-call/cast) functions for accessing that data anyone anywhere can use.

defmodule DataCacheTest do 

    # Supervisor.start_link([{ DataCacheTest, nil}], strategy: :one_for_one)
    # DataCacheTest.read_value()
    # DataCacheTest.increment_value()
    # DataCacheTest.crash()

    use GenServer

    @key_name :ets_key
    @value_name :ets_value
    @table_name :data_cache_test_ets_table
    @default_value 1

    def start_link(args) do
        GenServer.start_link(__MODULE__, args, name: __MODULE__)
    end

    #===============================================
    # (i) INITIALIZATION FUNCTION (CREATE ETS TABLE)
    #===============================================
    def init(args) do
        create_table()
        {:ok, nil}
    end
    def create_table() do
        case :ets.info(@table_name) do
            :undefined ->
                :ets.new(@table_name, [:set, :public, :named_table, {:read_concurrency, false}, {:write_concurrency, false} ])
                # RETURNS :global_ets_table
            _->
                nil
        end
    end

    #=====================================
    # ii) ETS ACCESSOR FUNCTIONS - intentionally not using call/cast and not atomic, yes there will be racing
    #=====================================
    def read_value() do
        result = :ets.lookup(@table_name, @key_name) #=== WRAP IN TRY/RESCUE???
        case result do
            []-> #doesn't exist, insert default
                :ets.insert(@table_name, {@key_name, @default_value})  #=== WRAP IN TRY/RESCUE???
                @default_value
            [{@key_name,  value}]-> #found key/value
                value
            _->
                #not sure why we are here, create table again maybe
                IO.puts("FAILED READING VALUE")
                @default_value
        end
    end
    def increment_value() do
        current_val = read_value()
        new_val = current_val + 1
        :ets.insert(@table_name, {@key_name, new_val})  #=== WRAP IN TRY/RESCUE???
        new_val
    end

    #================
    # (iii) CRASH
    #================
    def crash() do
        pid = Process.whereis(__MODULE__)
        Process.exit(pid, :crash)
        raise "intentionally crashed"
    end
end

4) Websocket/Server

This wants to run queries on the ETS Data directly without going through call/cast

defmodule TestWebSocket do
    use Plug.Router

    plug Plug.Logger
    plug :match
    plug :dispatch

    get "/get_value" do
        value = DataCacheTest.read_value() # WHAT HAPPENS IF ETS GENSERVER IS CRASHING?
        conn
        |> put_resp_content_type("text/plain")
        |> send_resp(200, to_string(value))
    end

    get "/increment" do
        value = DataCacheTest.increment_value()  # WHAT HAPPENS IF ETS GENSERVER IS CRASHING?
        conn
        |> put_resp_content_type("text/plain")
        |> send_resp(200, to_string(value))
    end

    match _ do
        send_resp(conn, 404, "not found")
    end
end

If your GenServer literally only creates an ETS table and and acts as its owner, and the GenServer process does not have anything else done inside of it, there is no reason to assume that it would crash. If you start it early in your supervision tree, you can safely assume that access to the table will succeed. Anything else would be trying to prematurely handle errors that should never happen, complicating your code unnecessarily.

If you wrap any access in magic failure codes, you will complicate things as you now have to handle those instead. If you don’t, you aren’t any further than not handling them in the first place. That’s one part of what is commonly referred to as „let it crash“: Don’t try to handle things you’re not expecting. Your very simple GenServer process crashing is definitely not something to expect and if it happened, something else would be very wrong in your system.

2 Likes