Is it a good idea to create new database engine as Elixir library?

Eiji · November 18, 2016, 6:25pm

At start I want to point that I don’t know so much about creating database engines and I know that it’s not a “5 minute” project.
What I think is to create a database engine in Elixir. Such a project with good documentation could in the future be easy-to-learn alternative for other databases.
Is it a good idea to create new database engine as Elixir library?

For example:

defmodule Example do
  alias MyLibrary.DB

  def get_database do
    DB.select_db(:db_name)
  end

  def setup do
    get_database
    |> DB.create_table(:items)
    |> DB.add_column(:name, &is_bitstring/1)
  end

  def fill do
    get_database
    |> DB.insert_to_table(:items, %{name: "First item"})
  end

  def queries do
    get_database
    |> DB.all(:items)
    # ...
    get_database
    |> DB.count(:items)
    # ...
    get_database
    |> DB.where(:items, [name: {:like, "* item"}])
  end
end

What do you think about it?

andre1sk · November 18, 2016, 7:02pm

Haha was about to post a similar topic. I think having DB as a lib would be a very good idea there is often a need for simple persistent store I think it would need to be compatible with Ecto to be really useful. I personally wouldn’t care
about performance or having a lot of features we all know what to use for demanding cases but having no pain simple option for simple projects would be very cool.

Eiji · November 18, 2016, 7:24pm

I don’t want to create a SQL String parser in my database engine. Support with Phoenix would be must have. If someone will want to write Ecto adapter then he/she just do it. I want to implement more features than Ecto adapter could implement, so I don’t care about it too much.

If you wrote that it’s a good idea then I can start work on this project.
Can you select a good way to read and save data to database file?
I just need some info / tips to make it fast and use as low memory as possible. After I can understand core (read/write) I can write other methods (filtering, selecting, sorting …).

andre1sk · November 18, 2016, 7:49pm

I am not a DB engine person would not be a good source for advice.

Eiji · November 18, 2016, 8:18pm

I have a question related to this topic to @josevalim and / or other more experienced developers.
I found an old stack overflow answer for read and save HashDict.
Is it a good way to write a code like this for Map with all data I need to read and save database data (database info + data)?
For example:

defmodule MyLib.DB do
  defmodule MyDB do
    defstruct [config: %{encoding: :utf8}, tables: []]
  end

  def create_db do
    data =
      %MyDB{}
      |> :erlang.term_to_binary
    File.write!("path_to_db_file", data)
  end

  def select_db do
    File.read!("path_to_db_file")
    |> :erlang.binary_to_term
  end

  # + query methods on %MyDB{} struct ...
end

Will not be a problem with large databases (lots of GB)?

OvermindDL1 · November 18, 2016, 8:34pm

It will definitely not scale to many gigs no. ^.^

Current database have a few things they do to make lookups more efficient, including having a header/footer in the storage file (or a side file) that maps indexes to data (or indexes to other indexes). They tend to use a journal file that gets synchronized to the main file on occasion. They tend to use a bloom filters to lookup positions faster (in the general case), b-tree’s for lookup speed (and many special cases), etc…

I’d probably start with outputting two files, one of indexes, the other of raw data. The raw data one only ever append to it (compact out and remove old entries in sweeps later, a ‘row’ can just be an erlang serialized term). The indexes one you could probably get away with serializing erlang terms for now, just have it be indexes that point to either other indexes or to a location in the data file and focus on optimizing this file first, then on the data file to make compaction easier (maybe even add de-duplication?). If you want an immutable system like postgresql and such (and makes making transactions easier) then as you ‘update’ entries do not overwrite old data blobs but rather append new ones, just update the index file to point to the new location, else if not immutable then update in-place if same size or smaller, else append. Etc…

EDIT: You can also make cache’s to keep often-accessed parts of the files in memory as well.

OvermindDL1 · November 18, 2016, 9:09pm

Also, https://github.com/uwiger/sext might be useful if you want to serialize to the output but need the same term ordering as erlang/elixir itself, otherwise don’t use this as it has overhead.

brightball · November 18, 2016, 9:09pm

There’s a part of me that wonders strongly if you are wanting to create a new database ENGINE or a new interface to existing engines that could be used within the runtime potentially?

Mnesia, Riak and CouchDB come to mind.

andre1sk · November 18, 2016, 9:24pm

I am not sure about original posters intent but I’d argue that something much simpler would be useful. Think sites with few hundred to low 10s of thousands of records.

Eiji · November 18, 2016, 9:52pm

hmm
As I pointed it already I’m new in database engines. I need something like block diagrams - step-by-step short info what to do.

I expected that - just I found that old answer

It’s exactly that part I’m not understand very well.
If I good understand what you wrote I need to do something like:

First file (CSV?) like:
table_name,row_id,line_number
where line_number references to second file line number
Second file (CSV?) like:
first_column_binary_value,second_column_binary_value

In that way I can easily get all rows from table and row(s) by selected id(s), am I right?

btw. If I good understand database engines save a data to binary format. How they are fast filtering results?

Eiji · November 18, 2016, 10:40pm

I probably create my own database engine, because I want to implement lots of special features.

For example:

defmodule Example do
  alias MyLibrary.DB

  def get_database do
    DB.select_db(:db_name)
  end

  def query do
    get_database
    |> DB.where(:users, [name: {:like, "* Kowalski"}]) # first filter main results
    # fetch email - other table in other schema:
    |> DB.fetch_references(:emails, schema: :private_data)
    # filter emails list by length of e-mail address:
    |> DB.where(:emails, [address: [length: {:bigger_than, 5}]])
    # filter users where by filtered emails count:
    |> DB.where(:emails, [_: [count: {:bigger_than, 0}]])
    # at end select specific columns:
    |> DB.select([:first_name, :last_name, :emails])
  end

Of course it’s not real example and of course it’s not final API. Just a simple example of complex search. I don’t know if this example (and lots, lots more) will work in existing database and also will support all Elixir data types.
Access to CouchDB is available only by REST API. I’m not database engineer, but i don’t think it is a fastest database engine and it is not supporting all Elixir data types (it returns simple json - for example: list or tuple).
Riak is only key-value data store.
btw. I don’t wrote it before. I want to implement two ways of saving data: classic database tables and (like Riak) key-value entries. For example site settings in phoenix model - it needs only one row and a lot of columns, instead this I will provide key-value data storage.
I also want to implement relations (fetching methods) and of course relations between key-value pairs and tables.

OvermindDL1 · November 18, 2016, 10:42pm

If you are wanting something to ‘use’ and not making it to learn, you really should use PostgreSQL.

Eiji · November 18, 2016, 11:05pm

No, I’m not going to “only use” like Windows
It’s a time to learn some things from this topic. I think that I can use some new knowledge from this. Further in the future I will think about other projects and algorithms used in this topic may be really useful.
Also I’m not alone with this idea.
Did I understand you correctly (with CSV files)?

gon782 · November 18, 2016, 11:09pm

If you need a “not much hassle” DB out of the box you could just use mnesia. It’s simple, stores all your erlang terms and just does the job. It’s criminally underused in the Elixir community, though, it seems.

andre1sk · November 18, 2016, 11:22pm

Good point will try it out

OvermindDL1 · November 19, 2016, 12:01am

I use it! I have 8 tables loaded when my app loads for various purposes. ^.^

/me comes from the Erlang-world though…

Eiji · November 19, 2016, 12:26pm

@andre1sk, @gon782 and @OvermindDL1: Is not mniesia database limited to 2GB?
For example ext4 filesystem has limit for file size to 16 TiB.
And one more thing: mnesia saves just tuple of any data, right? Is there any type checking (like &is_bitstring/1)?

DB I’m looking (or want to create):

should be limited only by hardware
of course need to be fast
should have Elixir features like fault-tolerant
need to save any Elixir term
need easy and well documented Elixir API
need allow to group databases in schemas (like in PostgreSQL)
should have type checking for column value
should allow save data in database table or in database key-value storage and allow create relations between them, for example: store only one site settings, but some fields, are relations (has_many) for translations table (site description)
should have methods to fetch relations (in most databases we need to build joins SQL queries)

defmodule Example do
  def sample do
    # column :emails value in :users table:
    %DB.Reference{column: :id, schema: :private_data, table: :emails, value: 5}
    users_query # ...
    |> fetch_reference(:emails)
  end
end

I think that if I’m going to create DB I need to have directory structure like:

# database info:
"#{path_to_base_dir}/#{database_name}/info.#{extension}"
# table info:
"#{path_to_base_dir}/#{database_name}/#{schema_name}/#{table_name}/info.#{extension}"
# Note: I know that today I will not reach ext4 file limit (16 TiB)
# but history shows how fast our data gets bigger and bigger
# key-value data:
"#{path_to_base_dir}/#{database_name}/#{schema_name}/#{kv_name}/#{number}.#{extension}"
# table data:
"#{path_to_base_dir}/#{database_name}/#{schema_name}/#{table_name}/#{number}.#{extension}"
# + cache and other additional files

I was searching how databases saves their files, but I found only definitions.
For example: I know that databases uses cache mechanism. I know what is cache, but I never write any database cache mechanism, so I don’t know how to properly start and make it fast and useful.
I will explore some github projects with simple database implementation too see an example algorithms used to save, read and filter data.
Did you know any simple projects (to learn from - preferably in Ruby)?

gon782 · November 19, 2016, 12:43pm

The only thing I can say with respect to the original topic/question: No, it’s not a good idea. What you make will not be better than what’s out there and a combination of mnesia/postgres should cover everything from “super simple” to “reasonably interesting without being crazy”. I would stress that mnesia is at its simplest and best when handling fairly ephemeral data such as sessions that you need to look up fairly fast, but doesn’t actually take up much space. You’ll be able to map them easily to user-IDs and the like, so lookup should be fast. This also fits in nicely with using ram_copies and disc_copies in your DB specification, so that you can both persist things as well as load from RAM without going to external sources.

As a learning project, I’m all for creating things, of course, so if that’s your objective I’d suggest you just get to working on it. As something that is meant to fill a void in the community I don’t see any real voids for you to fill.

mnesia has a wrapper in Amnesia, which shouldn’t be very hard to use.

The disc copy size limit of 4 GB (more information here, it’s not 100% correct to simply say 4 GB is the limit) from DETS applies to mnesia. In terms of how you store something, it does indeed store tuples, which you can simply create/manage by way of records.

I wouldn’t use mnesia for everything, but for simple data storage that I won’t have to query in super complicated ways I’d reach for it fairly fast. Everything else means I’ll probably just reach for postgres and in the unlikely event that that choice is not the right one I certainly wouldn’t implement anything new, but instead go elsewhere.

In general, I would urge you to try more alternatives and get comfortable with them before making something yourself, or make that something out of these tools instead of creating the underlying mechanisms. It’s more reasonable to build comfortable wrappers on top of something else so that you have a good base to build on top of.

Eiji · November 19, 2016, 1:43pm

1st section: thx for your opinion

2nd section: it’s most I want (learning), don’t know what it could be in future, but I’m not starting new project with 0 knowledge and call it best
I just ask if somebody is interested in and have some skills in this topic.
Of course I will start from read some book and projects about creating database engines.

3rd section: yes, I already saw a github project

4th and 5th section: yes, 4GB - my mistake. Now I understand correctly mnesia targets, thx

6th - yes, I already saw some database comparisons and agree that create a wrapper is good at start, but I want to learn mostly about reading, writing and binary search in huge files rather than creating a SQL code. I just feel that it would be really helpful (or maybe important) in further future.

At now it’s not a foreground project for me. I want to write some other projects first, but after it I want to have much more bigger knowledge about database engines algorithms and using binary files.

andre1sk · November 19, 2016, 1:58pm

It seams Mnesia is overkill for simple situations so I’d say there is still room for something much simpler.