General Strategies for Creating/Populating a SQLite Database as a static/read only asset

peck · May 14, 2023, 2:30am

Hi everyone,

I’ve got a little hobby project (https://plaintexttides.com) that I’m working on and playing with sqlite a bit with it. I’m really only using a database to save myself some computation and API calls and speed up things, so I’m looking to use a sqlite database to essentially do that “work” ahead of time by bulk downloading the predictions at dev time vs runtime…

Essentially I’m bulk downloading and storing the tide predictions from NOAA for all the tide stations in the US for this calendar year. 4 tides, 365 days, and about 3300 stations. Not a huge amount of data, but instead of reaching out to NOAA for each time I have a user request to show the info, I have a little (~100Mb) sqlite database to query from using ecto.

I currently have it live in “priv/” and all seems to work nicely, but I don’t have a great strategy for how to “build” that database from the ground up. Using migrations, which is what I have now, doesn’t seem like quite like the “right” approach since I’ll never be doing it at runtime, but maybe it is.

Currently this is the only Repo in my app, but there’s a possibility that I could add a postgresql (or another sqlite) in the future so I want to keep options open, and since it’s a hobby project I don’t have to worry as much about YAGNI.

I was wondering if any of you had suggestions or approaches that you’ve had success with in building sqlite databases as something like static assets, maybe as a mix task or otherwise?

Thanks for any suggestions and thoughts.

evadne · May 14, 2023, 2:48am

Etso may be useful?
Bundling a 100MB file in priv over Hex would be quite heavy
Would suggest checking how tzdata handles distribution of its database. Also check approach used by evision, exla.

Eiji · May 14, 2023, 6:58am

The best way is to have SQLite/CSV/JSON/XML in priv directory and fetch all data at compile-time. Using said data you could generate Elixir code, so you can fetch specific data using patter-matching.

I wonder if storing those data in a directory with CSV files (each for every table) wouldn’t be better for tasks like that. It would be not only easier to correct data manually, but also regenerate it using Elixir code.

# config
  live_reload: [
    patterns: [
      ...,
      ~r"priv/data/.*(csv|json|sqlite3|xml)$"
    ]
  ]

# code:
defmodule Example do
  alias NimbleCSV.RFC4180, as: CSV

  path_to_posts = :my_app |> :code.priv_dir() |> Path.join(~w[data posts.csv])

  @external_resource path_to_posts

  fetched_posts = path_to_posts |> File.stream!() |> CSV.parse_stream()

  for post <- fetched_posts do
    def get_post(unquote(post.id)), do: unquote(post)
    # or:
    def get_post(unquote(post.slug)), do: unquote(post)
  end

  # …
end

I would recommend to read this article: Welcome to our blog: how it was made! - Dashbit Blog

peck · May 14, 2023, 12:49pm

These are great suggestions.

I do like the relational nature, and overall niceness that is ecto, so I hadn’t thought about doing the code-generation approach for this project, though I have used it successfully in the past.

To give a little more context on what my files and models look like I’ll drop them here since like I said it’s just a hobby project to give myself a playground to stay relatively up to speed on phoenix, and to scratch my own itch of being very annoyed with all the tide prediction sites covered with ads and user-hostile patterns.

defmodule Tide.Station do
  use Ecto.Schema
  import Ecto.Changeset
  @primary_key {:id, :string, autogenerate: false}
  import Ecto.Query

  schema "stations" do
    field :name, :string
    field :latitude, :float
    field :longitude, :float
    field :time_zone_correction, :integer
    field :time_zone_name, :string
    field :distance, :float, virtual: true
    has_many :predictions, Tide.Prediction

  end
end

defmodule Tide.Prediction do
  use Ecto.Schema
  import Ecto.Changeset
  import Ecto.Query
  @primary_key {:id, :string, autogenerate: false}


  schema "predictions" do
    belongs_to :station, Tide.Station
    field :timestamp, :utc_datetime
    field :volume, :float
    field :type, :string
  end
end

The data files, that I currently have living in /priv/predictions, are just dumps from the nice folks at https://api.tidesandcurrents.noaa.gov/api/prod/datagetter and look like this:

[
{"t":"2023-01-01T03:46:00Z","type":"L","v":"0.078"},
{"t":"2023-01-01T09:56:00Z","type":"H","v":"2.820"},
{"t":"2023-01-01T16:36:00Z","type":"L","v":"0.301"},
{"t":"2023-01-01T22:09:00Z","type":"H","v":"2.347"}
...
]

And to wrap things up my currently seed script is just seeds.exs that does this:

for station <- Tide.Repo.all(Tide.Station) do
    if !File.exists?("priv/predictions/#{station.id}.json") do
      {:ok, predictions} = Tide.tide_predictions(station.id, ~D[2023-01-01])
      json_data = Jason.encode_to_iodata!(predictions)
      File.write!("priv/predictions/#{station.id}.json", json_data)
    end
end

for file_name <- File.ls!("priv/predictions/") do
    [station_id, _extension] = String.split(file_name, ".")

    changesets = File.read!("priv/predictions/#{file_name}")
    |> Jason.decode!()
    |> Enum.map(fn(x) ->

      y = %{
        "type" => x["type"],
        "volume" => x["v"],
        "timestamp" => x["t"],
        "station_id" => station_id
      }

      %Tide.Prediction{}
      |> Tide.Prediction.changeset(y)
      end)

    Tide.Repo.transaction(fn ->
      Enum.each(changesets, fn cs -> Tide.Repo.insert(cs) end)
    end)
end

Hasn’t been cleaned up yet, so a little messy and probably some duplicative logic in there.

But anyway at the end up it I end up with a closer to 200Mb /priv/tide.db sqlite database. I’m not mad about that and don’t mind the size or anything, though obviously lots of places for optimization. What I’d love to understand are some general patterns and approaches for generating static resources.

Eiji · May 14, 2023, 2:33pm

So why not to make your code work nicely like ecto? You can collect even more data at compile-time or write your own preload functions. Just take a look at code below and see yourself how simple it is:

defmodule Example do
  # some post data
  posts = [
    %{"contents" => "Contents #1", "id" => 1, "title" => "Title #1"},
    %{"contents" => "Contents #2", "id" => 2, "title" => "Title #2"},
    %{"contents" => "Contents #3", "id" => 3, "title" => "Title #3"},
    %{"contents" => "Contents #4", "id" => 4, "title" => "Title #4"},
    %{"contents" => "Contents #5", "id" => 5, "title" => "Title #5"}
  ]

  # some comment data
  comments = [
    %{"id" => 1, "post_id" => 1, "contents" => "Contents #1 - #1"},
    %{"id" => 2, "post_id" => 1, "contents" => "Contents #1 - #2"},
    %{"id" => 3, "post_id" => 2, "contents" => "Contents #2 - #1"},
    %{"id" => 4, "post_id" => 2, "contents" => "Contents #2 - #2"},
    %{"id" => 5, "post_id" => 3, "contents" => "Contents #3 - #1"},
    %{"id" => 6, "post_id" => 3, "contents" => "Contents #3 - #2"},
    %{"id" => 7, "post_id" => 4, "contents" => "Contents #4 - #1"},
    %{"id" => 8, "post_id" => 4, "contents" => "Contents #4 - #2"},
    %{"id" => 9, "post_id" => 5, "contents" => "Contents #5 - #1"},
    %{"id" => 10, "post_id" => 5, "contents" => "Contents #5 - #2"}
  ]

  grouped_comments =
    for %{"id" => id, "post_id" => post_id} = comment <- comments, reduce: %{} do
      acc ->
        # define a function to get comment by id
        def get_comment(unquote(id)), do: unquote(Macro.escape(comment))
        # and add comment to accumulator
        Map.update(acc, post_id, [comment], &[comment | &1])
    end

  # in case we want to work more on runtime
  # instead of fetching all comments
  # we simply define the relation
  # just like intermediate table in ecto
  for {post_id, comments} <- grouped_comments do
    comment_ids = Enum.map(comments, & &1["id"])
    def get_comment_ids(unquote(post_id)), do: unquote(comment_ids)
  end

  for %{"id" => id} = post <- posts do
    comments = Enum.sort_by(grouped_comments[id], & &1["id"])
    post_with_comments = Map.put(post, "comments", comments)
    # define a function to get post only
    def get_post(unquote(id)), do: unquote(Macro.escape(post))
    # or define a function to get post with comments
    def get_post_with_comments(unquote(id)), do: unquote(Macro.escape(post_with_comments))
  end

  # preloading a belongs to is dead simple
  # since in map we already have id
  def preload(comment, :post) do
    comment |> Map.fetch!("post_id") |> get_post() |> then(&Map.put(comment, "post", &1))
  end

  # on the other side preloading a has many requires some association data
  # like the one we already defined
  def preload(post, :comments) do
    post
    |> Map.fetch!("id")
    |> get_comment_ids()
    |> Enum.map(&get_comment/1)
    |> then(&Map.put(post, "comments", &1))
  end
end

evadne · May 14, 2023, 3:30pm

It’s precisely the goal of Etso… or ETSo

Fill the ETS table at startup then query it with Ecto

Eiji · May 14, 2023, 3:46pm

Not always … fetching few libraries just for simple for loops like above does not makes much sense.

axelson · May 15, 2023, 2:49am

I’d handle this with two parts:

Use an in-memory write-through cache (such as cachex or nebulex) to to serve requests from memory and also write the results of the request to your sqlite db
A GenServer that is part of your supervision tree that populates the sqlite db with all the results for this calendar year

Personally I like that better than a pure memory-only solution because when you restart the server you won’t need to make 3000 requests to NOAA’s API.

krasenyp · May 15, 2023, 5:33am

I would do what @axelson proposed but also think in a slightly different direction. If you can afford to use cached external resources, and if you expose an HTTP interface, why don’t you employ an HTTP cache, like Varnish. You can make use of cache-control and etag headers and lift the responsibility off your application.