Most performant way to search a file for a string?

BenFenner · October 1, 2018, 6:45pm

I’m still working on password validation, and part of that is making sure a user doesn’t try to use a common password. I’ve downloaded a list of common passwords.

I’m trying to compare the user’s desired password to the text file containing 1 million common passwords (one password per line).

The code below works great, and is relatively fast. I’m wondering if I can get it to perform any better though? Preferably without loading the entire file into memory at once, but if that will give a large performance boost, I could be convinced…

  #
  # Password rarity is checked against the largest list of common passwords
  # found here: https://github.com/danielmiessler/SecLists/tree/master/Passwords/Common-Credentials
  #
  defp validate_password_rarity(changeset, field) do
    proposed_password = get_field(changeset, field)
    proposed_password = "#{proposed_password}\n" # Append line break here to avoid needing to remove 1M line breaks from the stream.

    common_password_stream = File.stream!(Path.join(:code.priv_dir(:my_app), "static/10-million-password-list-top-1,000,000.txt"))
    |> Stream.filter(fn common_password -> proposed_password == common_password end)
    |> Enum.to_list

    if length(common_password_stream) == 0 do
      changeset
    else
      add_error(changeset, field, "The given password is too common. Please choose a less common password.")
    end
  end

alukasz · October 1, 2018, 8:19pm

Hi, I had similar problem and after some benchmarking I found out that the fastest way for me was to store sorted data in ETS and use binary search. Here is an example https://gist.github.com/alukasz/7dcf1dc097e321565c372dc0de5ea1fe

Name                          ips        average  deviation         median         99th %
ETS + binary search       24.14 K      0.0414 ms    ±15.77%      0.0400 ms      0.0690 ms
stream + filter         0.00162 K      616.14 ms     ±5.05%      622.06 ms      641.31 ms

Comparison: 
ETS + binary search       24.14 K
stream + filter         0.00162 K - 14876.58x slower

joeerl · October 1, 2018, 8:41pm

Why don’t you want all the password entries in memory - 1M entries is probably only a few MB’s and most machines have GBytes of memory.

If all the entries are not in memory then getting the file content into your program will be an I/O bottleneck.

If you do this on a modern machine the OS will cache the files anyway, so they will be in memory - but it would be better to do build you own hash table that repeated thash the I/O system and garbage collector.

Put the password entries in a hash table (say ets) and do a hash lookup no search is involved - this is O(1) and independent of the number of entries in the table.

And yes the hash table takes space (a few MB) but not using a hash table takes space (in memory buffers in the OS) and puts repeated pressure on the I/O and GC

The best way is to read the entire file atomically, split it into lines and inject them into a resident hash table.

I guess I’ve assumed that you want to check passwords often and read the big table infrequently. If you only onto to check one password every few hours then it’s irrelevant how you do it since anything you write will be sufficiently fast.

If you want to check thousands of passwords per second then you must build an in-memory hash table.

It all depends upon what you want to do

ibarch · October 2, 2018, 10:55am

I guess the most elegant and run-time performant way is to read the file at compile time and generate a function for every record:

defmodule CommonPass do
  @external_resource passwords_path =
                       Path.join(
                         :code.priv_dir(:common_pass),
                         "10-million-password-list-top-100000.txt"
                       )

  for line <- File.stream!(passwords_path) do
    pass = String.trim_trailing(line)

    def common?(unquote(pass)), do: true
  end

  def common?(_), do: false
end

The problem is, a list with 100_000 passwords takes some time to compile and a list with 1_000_000 records takes about forever to compile, so I don’t have benchmarks.

BenFenner · October 2, 2018, 12:36pm

Joe, the use-case is for initial user setup (setting a password for the first time) and for when users want to change their own password (or possibly when an administrator wants to change a user’s password). This is amongst a pool of maybe 10,000 - 50,000 users who will be performing the task manually as needed. So the passwords will likely get checked in groups and clusters as a user tries to set a new password, possibly dozens of times a day, but absolutely nothing as frequent as the thousands of passwords per second.

The list of common passwords is 8.5MB uncompressed text.

I like your hash table idea for my specific needs, and am working on an implementation. I realize now that it was silly to try to avoid loading the entire list into memory all at once.
If you (or anyone else) have any insight for the implementation, I could use the help. I am very new to Elixir. Regardless, I will post my solution and then maybe I could get some feedback if it is way off the mark.

Cheers!

massimo · October 2, 2018, 12:52pm

Put them in sqlite

idi527 · October 2, 2018, 1:57pm

Have you considered using FST? I’d probably just do that with https://github.com/BurntSushi/fst via rustler. I expect this approach to be more memory efficient (the FST for the 1m passwords you linked is 6MB) and more performant than both ets and module heap (I get ~1.3m ops/second for lookups).

Operating System: macOS"
CPU Information: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
Number of Available Cores: 4
Available memory: 8 GB
Elixir 1.7.3
Erlang 21.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 7 s


Benchmarking FST.contains/2...

Name                     ips        average  deviation         median         99th %
FST.contains/2        1.32 M        0.76 μs    ±83.11%           1 μs           1 μs

Or using @alukasz’s benchmark:

Operating System: macOS"
CPU Information: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
Number of Available Cores: 4
Available memory: 8 GB
Elixir 1.7.3
Erlang 21.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 21 s


Benchmarking ETS + binary search...
Benchmarking FST.contains/2...
Benchmarking stream + filter...

Name                          ips        average  deviation         median         99th %
FST.contains/2           157.73 K        6.34 μs    ±93.33%           6 μs           9 μs
ETS + binary search       20.15 K       49.62 μs    ±20.30%          48 μs          84 μs
stream + filter         0.00022 K     4584302 μs     ±7.67%     4584302 μs     4832922 μs

Comparison:
FST.contains/2           157.73 K
ETS + binary search       20.15 K - 7.83x slower
stream + filter         0.00022 K - 723087.63x slower

But that’s a naive implementation where the FST is behind a mutex. It’s probably possible to speed up it if the mutex was lifted …

BenFenner · October 2, 2018, 3:17pm

I had not considered a FST solution. I’m not sure how I would even begin to implement something like that in Elixir (even after reading your links) but those benchmark shave me interested.

In the mean time, I’ve been working on the hash table solution mentioned by Joe. I came up with the code below, which works the first time, but then fails on a second run (it looks like it tries to create the ETS table when it already exists). If I wait some time for the process to time out and the ETS table is deleted, the code works again. I assume I can do some sort of check to see if the table exists before creating a new one with whereis/1 but I can’t quite figure that out yet.

  defp validate_password_rarity(changeset, field) do
    proposed_password = get_field(changeset, field)
    proposed_password = "#{proposed_password}\n" # Append line break here to avoid needing to remove 1M line breaks from the source.
    
    common_password_cache = :ets.new(:common_passwords, [:set, :named_table, read_concurrency: true])

    Path.join(:code.priv_dir(:my_app), "static/10-million-password-list-top-1,000,000.txt")
    |> File.stream!
    |> Enum.each(fn common_password -> :ets.insert(common_password_cache, {common_password}) end)

    password_is_common = :ets.lookup(:common_passwords, proposed_password)

    if length(password_is_common) == 0 do
      changeset
    else
      add_error(changeset, field, "The given password is too common. Please choose a less common password.")
    end

When run once, the code above takes ~1.6s to complete. I would think the subsequent runs would go much faster if I could figure out how to get it to run a second time.
For comparison, the code I posted in the original post takes ~1.1s to run on my machine.

Sorry, I am not familiar with the proper benchmarking techniques. =(

idi527 · October 2, 2018, 3:27pm

You’d probably want to create the ets table right after the app starts.

In your application.ex:

# in start/2
children = [
  BadPasswords,
  # ...
]
# ...

defmodule BadPasswords do
  use GenServer

  @table __MODULE__

  @spec contains?(String.t()) :: boolean
  def contains?(password) do
    :ets.member(@table, password)
  end

  @doc false  
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end

  @doc false
  def init(_opts) do
    @table = :ets.new(@table, [:named_table])
    send(self(), :populate)
    {:ok, []}
  end

  @doc false
  def handle_info(:populate, state) do
    :code.priv_dir(:rums)
    |> Path.join("static/10-million-password-list-top-1,000,000.txt")
    |> File.stream!()
    |> Enum.each(fn common_password ->
      # note that you can also do batch inserts
      :ets.insert(@table, {String.trim(common_password)})
    end)

    {:noreply, state}
  end
end

# in your changeset validation:
defp validate_password_rarity(changeset, field) do
  proposed_password = get_field(changeset, field)
  if BadPassowrds.contains?(proposed_password) do
    add_error(changeset, field, "The given password is too common. Please choose a less common password.")
  else
    changeset
  end
end

But I don’t think this is what Joe suggested … You might want to hash the strings before inserting them into ets.

christophermlne · October 2, 2018, 3:48pm

This is a common problem, and your solution would make a great open source contribution when you’re finished.

idi527 · October 2, 2018, 3:50pm

But I don’t think this is what Joe suggested … You might want to hash the strings before inserting them into ets.

Although even that approach is probably good enough:

Name                               ips        average  deviation         median         99th %
BadPasswords.contains?/1      467.86 K        2.14 μs  ±1189.88%           2 μs           3 μs
FST.contains?/2               164.73 K        6.07 μs    ±59.90%           6 μs           9 μs
ETS + binary search            20.63 K       48.47 μs    ±17.74%          47 μs       77.89 μs
stream + filter              0.00022 K  4630160.50 μs     ±9.05%  4630160.50 μs     4926331 μs

Comparison:
BadPasswords.contains?/1      467.86 K
FST.contains?/2               164.73 K - 2.84x slower
ETS + binary search            20.63 K - 22.68x slower
stream + filter              0.00022 K - 2166275.09x slower

Note that the FST.contains?/2 is still unoptimized.

I’m not sure how I would even begin to implement something like that in Elixir (even after reading your links) but those benchmark shave me interested.

I’m just using BurntSushi/fst via rustler. I can post a demo project with all these benchmarks on github if you want.

BenFenner · October 2, 2018, 4:02pm

Thanks for all of the help. I’m looking into setting up the ETS table at app start-up like you suggested. I’m still working with Joe’s solution of a O(1) lookup.

I’d love to do this without bringing in additional resources like rustler if I can, and the one-function-per-password technique sounds like compile time would be a bear, so I’m still focusing on Joe’s technique.

idi527 · October 2, 2018, 4:03pm

Great, note that I posted a possible solution with just ETS tables in my previous comments.

if I can, and the one-function-per-password technique sounds like compile time would be a bear

Yeah, it only works when you have very small sets, like <20k. And <1k ideally. And FSTs are great when you have very large sets with mostly similar elements. For everything in-between, hash maps are usually enough.

FSTs also support fuzzy matching and range queries, but since you probably don’t need those, ETS is enough.

BenFenner · October 2, 2018, 4:08pm

I see your code about setting up the ETS when the app first loads. I thought that was all it was, but I believe you’re saying it also describes an Erlang/Elixir-only FST solution? I haven’t parsed it yet, so I just assumed it was the direct-lookup solution…

I’ll study it until I understand what it’s doing and report back.

idi527 · October 2, 2018, 4:09pm

No, it doesn’t. For FSTs I called into rust from elixir. Sorry for the confusion.

BenFenner · October 2, 2018, 4:11pm

Thanks for clearing that up.

In your latest example you note about being able to do batch inserts into the ETS table. I saw that in the docs, but just was not able to take advantage of it. I am so new to Elixir I just didn’t know how to get the contents of the file into a list for batch insertion. And would it be faster?

Right now we’re streaming the file, but we could open it all at once, and Enum.to_list and batch insert?
Sorry for being so green!

Let me first see about implementing your suggestion. That’s my first step…

idi527 · October 2, 2018, 4:12pm

Might be. But with 1m elements, it won’t make much of a difference, I think. Both FSTs and ETS loads in much less than a second on my laptop during benchmarks.

Right now we’re streaming the file, but we could open it all at once, and Enum.to_list and batch insert?
Sorry for being so green!

Sure, you can open the file at once, since it’s just 8MB, no Enum.to_list is needed though:

bad_passwords = 
  filepath
  |> File.open!()
  |> String.split("\n")
  |> Enum.map(fn bad_password -> {String.trim(bad_password)} end)

:ets.insert(@table, bad_passwords)

But I’d still try and use a stream, maybe with some batching:

filepath
|> File.stream!()
|> Stream.map(fn bad_password -> {String.trim(bad_password)} end)
|> Stream.chunk_every(1000)
|> Enum.each(fn bad_passwords ->
  :ets.insert(@table, bad_passwords)
end)

But again, I don’t expect it to speed up the loading process considerably, since it’s already quite fast.

BenFenner · October 2, 2018, 4:26pm

Thanks for all of your help! I followed that just fine.

I’ve implemented your pre-load the ETS table solution and load times of the app didn’t increase much that I could tell, so that’s good. The password checking now takes ~0.7s on my machine which is quite nice, and faster than the other two solutions I tried (~1.6s and ~1.1s respectively).

Keep in mind my time benchmarks come straight from a Phoenix app, so they include other unrelated tasks, and everything related to sending a new HTML page, etc.

Edit: I realized I should see how long the page takes to load without running this code. Apparently it takes the same ~0.7s with or without the validate_password_rarity check, which makes sense.

joeerl · October 3, 2018, 1:54pm

Just for fun I did this in Erlang - the code and arguments are pretty much the same for Elixir

First I read the password file, split it into lines and insert into an ets table and dump the ets table. Note: these file-at-a-time operations are by far the most efficient way to get data into and out of the system.

build() ->
    {ok, B} = file:read_file("10-million-password-list-top-1000000.txt"),
    Parts = binary:split(B,<<"\n">>,[global]),
    Tab = ets:new(passwords,[]),
    [ets:insert(Tab,{I,[]}) || I<- Parts],
    ets:tab2file(Tab, "passwords.tab").

It doesn’t matter how long this takes since I only do it once.

To check a password we read the table back into memory and do a lookup

lookup(Password) ->    
    {ok, Tab} = ets:file2tab("passwords.tmp"),
    case ets:lookup(Tab, Password) of
    [] -> false;
    _  -> true
    end.

I can’t think of a faster way to do this.

Now for some timings - I measured the time to do ets:file2tab this was 0.75 seconds
I timed the lookups by looking up every password in the list - this took 0.4 microseconds per lookup. (this is on a 3.1GHz Intel Core i7 macbook pro) so you could check 2.2 Million passwords/second.

Reading the password file takes 0.75 seconds so the “read file” + single check could do 4800 passwords per hour and you’d be transferring about 40 GBytes of data per hour into your program - That might be acceptable - it depends what else you are doing. It will use a fair bit of CPU reading the ets file.

Keeping your ets table in memory costs virtually nothing (in terms of GC) - you don’t need a gen server to access it since ets tables are shared anyway.

ckampfe · October 3, 2018, 4:41pm

ETS is a great solution. Another one to consider is to use a database.

If you’re already using something like Postgres or Mysql this problem basically evaporates. Databases are incredibly good at random lookups/reads, especially on sets that don’t change or rarely change and have no/few writes.

You can do:

CREATE TABLE passwords (id SERIAL PRIMARY KEY, password varchar(50))

and

CREATE INDEX passwords_password ON passwords(password)

Then, create a prepared statement for

SELECT id FROM passwords WHERE password = ?

and query it like you would anything else.

(I am not taking into consideration whether you are comparing hashes or something else, the above is just raw passwords which may or may not be ok for your security profile. Customize the above to fit your specific use case.)

What are the benefits of this approach?

Simple application code. A single database lookup. This is probably the biggest. You don’t have to manage anything at all, really.
Databases are incredibly good at random lookups, this has consistent performance that will be comparable with ETS, even if it is slightly slower due to a database network hop.
Postgres is about as stable/predictable/reliable as it gets, I can’t conceive of a scenario where something breaks your code here.
The password list is persistent in your database. That is to say, you load it once, and forget about it. It’s not going anywhere, and you don’t have to reload it or worry about that.
Easy to add new password lists to this list if you need to. Just insert them like anything else.
Any app in your cluster can use this list it if they can talk to your database.
Password list takes no main memory space in your app. This isn’t a big one in this case, as 1,000,000 items is tiny no matter how you slice it, but if you were to combine this to 10 or 100 million things, then a database will scale without breaking a sweat.

Drawbacks?

Depends on your performance profile. If you really, really need this to be microsecond-speed, then having the data in-memory is going to be faster. A database SELECT with an index scan is incredibly performant and predictable, but there is still a network hop to the database. Anything that needs microsecond latency will use a more boutique solution, but this should be fast enough for any web app ever made.
Nothing else immediately comes to mind. Postgres “just works”

Anyway, this is the approach I would go for if someone had me code it today. ETS is great, this is another option. This for me falls into the “set it and forget it” bucket, which is my preferred bucket when it come to code