Most performant way to search a file for a string?

Great, note that I posted a possible solution with just ETS tables in my previous comments.

if I can, and the one-function-per-password technique sounds like compile time would be a bear

Yeah, it only works when you have very small sets, like <20k. And <1k ideally. And FSTs are great when you have very large sets with mostly similar elements. For everything in-between, hash maps are usually enough.

FSTs also support fuzzy matching and range queries, but since you probably don’t need those, ETS is enough.

1 Like

I see your code about setting up the ETS when the app first loads. I thought that was all it was, but I believe you’re saying it also describes an Erlang/Elixir-only FST solution? I haven’t parsed it yet, so I just assumed it was the direct-lookup solution…

I’ll study it until I understand what it’s doing and report back.

No, it doesn’t. For FSTs I called into rust from elixir. Sorry for the confusion.

1 Like

Thanks for clearing that up.

In your latest example you note about being able to do batch inserts into the ETS table. I saw that in the docs, but just was not able to take advantage of it. I am so new to Elixir I just didn’t know how to get the contents of the file into a list for batch insertion. And would it be faster?

Right now we’re streaming the file, but we could open it all at once, and Enum.to_list and batch insert?
Sorry for being so green!

Let me first see about implementing your suggestion. That’s my first step…

Might be. But with 1m elements, it won’t make much of a difference, I think. Both FSTs and ETS loads in much less than a second on my laptop during benchmarks.

Right now we’re streaming the file, but we could open it all at once, and Enum.to_list and batch insert?
Sorry for being so green!

Sure, you can open the file at once, since it’s just 8MB, no Enum.to_list is needed though:

bad_passwords = 
  filepath
  |> File.open!()
  |> String.split("\n")
  |> Enum.map(fn bad_password -> {String.trim(bad_password)} end)

:ets.insert(@table, bad_passwords)

But I’d still try and use a stream, maybe with some batching:

filepath
|> File.stream!()
|> Stream.map(fn bad_password -> {String.trim(bad_password)} end)
|> Stream.chunk_every(1000)
|> Enum.each(fn bad_passwords ->
  :ets.insert(@table, bad_passwords)
end)

But again, I don’t expect it to speed up the loading process considerably, since it’s already quite fast.

2 Likes

Thanks for all of your help! I followed that just fine.

I’ve implemented your pre-load the ETS table solution and load times of the app didn’t increase much that I could tell, so that’s good. The password checking now takes ~0.7s on my machine which is quite nice, and faster than the other two solutions I tried (~1.6s and ~1.1s respectively).

Keep in mind my time benchmarks come straight from a Phoenix app, so they include other unrelated tasks, and everything related to sending a new HTML page, etc.

Edit: I realized I should see how long the page takes to load without running this code. Apparently it takes the same ~0.7s with or without the validate_password_rarity check, which makes sense. :+1:

Just for fun I did this in Erlang - the code and arguments are pretty much the same for Elixir

First I read the password file, split it into lines and insert into an ets table and dump the ets table. Note: these file-at-a-time operations are by far the most efficient way to get data into and out of the system.

build() ->
    {ok, B} = file:read_file("10-million-password-list-top-1000000.txt"),
    Parts = binary:split(B,<<"\n">>,[global]),
    Tab = ets:new(passwords,[]),
    [ets:insert(Tab,{I,[]}) || I<- Parts],
    ets:tab2file(Tab, "passwords.tab").

It doesn’t matter how long this takes since I only do it once.

To check a password we read the table back into memory and do a lookup

lookup(Password) ->    
    {ok, Tab} = ets:file2tab("passwords.tmp"),
    case ets:lookup(Tab, Password) of
    [] -> false;
    _  -> true
    end.

I can’t think of a faster way to do this.

Now for some timings - I measured the time to do ets:file2tab this was 0.75 seconds
I timed the lookups by looking up every password in the list - this took 0.4 microseconds per lookup. (this is on a 3.1GHz Intel Core i7 macbook pro) so you could check 2.2 Million passwords/second.

Reading the password file takes 0.75 seconds so the “read file” + single check could do 4800 passwords per hour and you’d be transferring about 40 GBytes of data per hour into your program - That might be acceptable - it depends what else you are doing. It will use a fair bit of CPU reading the ets file.

Keeping your ets table in memory costs virtually nothing (in terms of GC) - you don’t need a gen server to access it since ets tables are shared anyway.

10 Likes

ETS is a great solution. Another one to consider is to use a database.

If you’re already using something like Postgres or Mysql this problem basically evaporates. Databases are incredibly good at random lookups/reads, especially on sets that don’t change or rarely change and have no/few writes.

You can do:

CREATE TABLE passwords (id SERIAL PRIMARY KEY, password varchar(50))

and

CREATE INDEX passwords_password ON passwords(password)

Then, create a prepared statement for

SELECT id FROM passwords WHERE password = ?

and query it like you would anything else.

(I am not taking into consideration whether you are comparing hashes or something else, the above is just raw passwords which may or may not be ok for your security profile. Customize the above to fit your specific use case.)

What are the benefits of this approach?

  • Simple application code. A single database lookup. This is probably the biggest. You don’t have to manage anything at all, really.
  • Databases are incredibly good at random lookups, this has consistent performance that will be comparable with ETS, even if it is slightly slower due to a database network hop.
  • Postgres is about as stable/predictable/reliable as it gets, I can’t conceive of a scenario where something breaks your code here.
  • The password list is persistent in your database. That is to say, you load it once, and forget about it. It’s not going anywhere, and you don’t have to reload it or worry about that.
  • Easy to add new password lists to this list if you need to. Just insert them like anything else.
  • Any app in your cluster can use this list it if they can talk to your database.
  • Password list takes no main memory space in your app. This isn’t a big one in this case, as 1,000,000 items is tiny no matter how you slice it, but if you were to combine this to 10 or 100 million things, then a database will scale without breaking a sweat.

Drawbacks?

  • Depends on your performance profile. If you really, really need this to be microsecond-speed, then having the data in-memory is going to be faster. A database SELECT with an index scan is incredibly performant and predictable, but there is still a network hop to the database. Anything that needs microsecond latency will use a more boutique solution, but this should be fast enough for any web app ever made.
  • Nothing else immediately comes to mind. Postgres “just works” :smiley:

Anyway, this is the approach I would go for if someone had me code it today. ETS is great, this is another option. This for me falls into the “set it and forget it” bucket, which is my preferred bucket when it come to code :smiley:

5 Likes

It would be very interesting to time Postgres and compare with the ets figures.

What’s the max rate for database queries per second? - any figures here?

1 Like

The most performant way to solve this problem definitely is to use a Bloom filter.

You might get a small number of false positives, but this is not a problem for your use case (disallowing a couple of inputs not actually in the list of bad passwords would not harm anyone).

Once the bloom filter has been created, consulting it runs in constant(or at least, far faster than linear) time.

To do this in Elixir, I know there are a couple of libraries such as Bloomex, although I haven’t used it myself yet.

1 Like

I’m seriously wondering why try to solve a problem that has already been solved by very smart engineers already
If it is a simple lookup and nothing more, SQLite is your friend
(or ETS or DETS if you need persistence)
On the pro sides, SQLite is a file that you can move around, it has no network latency, it can’t go down, and can be queried using ECTO (or standard SQL) and you won’t need to add Rust to the dependencies

https://www.sqlite.org/fasterthanfs.html

To answer your why question - I for one would like to know which of ets or sqlite is the faster - and also if a bloom filter is faster.

Without having performed an experiment running code on the same machine it is impossible to predict which is fastest. Whenever I predict performance without actually running the code I’m almost invariably wrong.

3 Likes

I started adding different approaches mentioned here to this benchmark.


UPDATE: Couldn’t get esqlite and ets:tab2file-based lookup to give reasonable results. Will try again tomorrow.

2 Likes

I’ve just tried sqlite on 1M passwords (not 10M) and can now directly compare with ets.

Name       Build (sec)   FileSize (MB)  Lookup Time (micro sec) 
ets        0.68          24             0.455 
sqlite3    719           41             53

Build is the time to build the ets table (or sqlite3) database the FileSize is the on-disk size of the ets-table/database Time is the lookup time (microseconds)

I suspect that the differences are due to the fact that ets is in memory and I understand what I’m doing, whereas the sqlite table is on disk and I haven’t a clue what I’m doing - it’s my first attempt at using sqlite so I’m probably doing it all wrong. Even if sqlite is in memory for the lookups there are an awful lot of round-trips between Erlang and sqlite and there is a semantic mismatch between the internal representations of data in Erlang and sqlite and conversion always takes time.

aside: First time I got sqlite working as well, so I’m quite pleased.

It’s basically (in SQL)

 create table test (pwd text primary key); 
 insert into test values ('password1);
 insert into test values ('password2);
 insert into test values ('password3);
  ...
 insert into test values ('password1000000');

then to lookup

 select * from test where pwd='something'

Seems to work - but rather slow.

What is more disturbing is the memory usage - 30 MB for 1M passwords would indicate 300MB for 10M passwords - now 300MB is getting to the point where I worry about space efficiency. So while a hash table is fastest it’s not space efficient - I suspect that bloom filters might be the way to go.

14 Likes

Thank @joeerl for taking the time to test it

Even without the numbers I was sure ETS would be faster for lookups, I suggested to use sqlite because one of the conditions was

Preferably without loading the entire file into memory at once,

But if that’s not an issue, ETS is totally the way to go

and with DETS you can persist and then transparently reload the data in memory

SQLite has the advantage of being more portable, you can query the file everywhere there is a SQLite client available (it’s just a lib)

Since you already wrote the code and I’m lazy (:stuck_out_tongue:) can you run it again loading SQLite in memory?

You just have to change the file path to ‘:memory:’

https://sqlite.org/inmemorydb.html

Even in memory db would be several times slower (~20 times slower) due to sql parsing and serialization. Or at least it was when I tried it yesterday … And I disagree that using sqlite is more convenient than just transferring plaintext passwords around, which is what you’d do if you were to use in memory ets, fst, or bloom filters. I still think fst is the most usable approach here since it’s very fast, “lossless”, and additionally compresses the data.


There are ways to speed up sqlite significantly by using different backends (lmdb, for example), and side stepping the sql parser, and batching reads, but that’s probably not the point of these benchmarks.

I mentioned Postgres being a reasonable choice for a broad variety of problems, so I figured I’d back it up. Results first:

# these benchmarks made with Benchee

clark$> PORT=4000 MIX_ENV=prod mix run bench.exs
mixwarning: found quoted keyword "test" but the quotes are not required. Note that keywords are always atoms, even when quoted, and quotes should only be used to introduce keywords with foreign characters in them
  mix.exs:58

Operating System: macOS"
CPU Information: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Number of Available Cores: 12
Available memory: 16 GB
Elixir 1.7.3
Erlang 21.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 14 s


Benchmarking check password ecto query builder...
Benchmarking check password postgrex prepared...

Name                                        ips        average  deviation         median         99th %
check password postgrex prepared        10.12 K       98.79 μs    ±10.25%          96 μs         139 μs
check password ecto query builder        8.47 K      118.04 μs    ±11.38%         114 μs         169 μs

Comparison: 
check password postgrex prepared        10.12 K
check password ecto query builder        8.47 K - 1.19x slower
warning: unused import Ecto.Query
  bench.exs:1

Next, benchmark setup:

import Ecto.Query                                                                                       
                                                                                                        
query_prepared = "SELECT id FROM passwords WHERE password = $1"                                         
                                                                                                        
{:ok, _pid} =                                                                                           
  Postgrex.start_link(hostname: "localhost", username: "clark", password: "", database: "passtest_dev", 
pool: DBConnection.Poolboy, pool_size: 20, name: DB)                                                    
                                                                                                        
{:ok, prepared} = Postgrex.prepare(DB, "get_password", query_prepared, pool: DBConnection.Poolboy)      
                                                                                                        
Benchee.run(%{                                                                                          
  "check password ecto query builder" => fn ->                                                          
    Ecto.Query.select("passwords", [p], p.id)                                                           
    |> Ecto.Query.where([p], p.password == "austin")                                                    
    |> Passtest.Repo.one!                                                                               
  end,                                                                                                  
  "check password postgrex prepared" => fn ->                                                           
     Postgrex.execute!(DB, prepared, ["austin"], pool: DBConnection.Poolboy)                            
  end                                                                                                   
})

I generated a standard phoenix app, created a database, and then ran this migration:

defmodule Passtest.Repo.Migrations.CreatePasswords do                                                   
  use Ecto.Migration                                                                                    
                                                                                                        
  def change do                                                                                         
    create table("passwords") do                                                                        
      add :password, :string, size: 50                                                                  
    end                                                                                                 
                                                                                                        
    create index("passwords", [:password])                                                              
  end                                                                                                   
end  

The data file I used was the 1M passwords one from the link posted earlier in the thread:

clark$> wc -l ~/code/pass.txt
999999 /Users/clark/code/pass.txt

clark$> head -10 ~/code/pass.txt
123456
password
12345678
qwerty
123456789
12345
1234
111111
1234567
dragon

To be transparent, the above is with BEAM and the Postgres server instance running on my local machine. This decreases latency vs what you would experience in a production setting where databases and app servers live on their own physical/virtual hardware separated by at least one network hop. Even so, starting from a baseline 99%ile of 139 μs is encouraging. For something that isn’t a hot loop, 139 μs is blisteringly low latency. Especially something where network time is going to dominate computation time by one or two orders of magnitude.

I’m going to paraphrase a comment I read elsewhere that was put forward in a similar context: performance is rarely about what is theoretically possible. It’s almost always a function of what you can achieve with the time and resources you have available. Further, it’s a function of what is necessary for the situation. With enough time, money, and enough smart people, could we theoretically optimize this down to the fewest possible machine instructions and the least number of main memory hits? We could. Is that useful? I doubt it. The performance of common password checking is never going to be the thing that makes or sinks your product or application. For this problem, it’s likely that there is a threshold that when hit is “fast enough”, and anything else is unnoticeable by the end user and/or adds code or architectural complexity wildly disproportionate to its benefits.

Getting these passwords into Postgres and queryable by Ecto took about 15 minutes total from my first reading of the post. As I’m already going to be using both in my webapp, I add almost no code and zero dependencies I wasn’t already using. Then, I’ll deploy and monitor. Then, optimize. I hope this is useful to understanding my approach.

6 Likes

Because nobody actually went ahead and tried the bloomfilter-solution out yet, I did:

  • The given code creates the bloom filter of the 1 million most common passwords at compile time. This requires about a minute of compilation time. (and recompilations are obviously only required if you alter the code in there).
  • At runtime, it is really fast, outperforming the Postgres example in this topic by ~ a factor eight. I expect this to scale (i.e. the other solutions will slow down more than this one) when you introduce more than the one million passwords.

Benchee output: (Obviously, I did not try out the code of the other examples on my machine, but I do know that my machine currently is not the most high-spec machine there is, so I think these results are reproducible by you as well).

Benchmark.run
Operating System: Linux"
CPU Information: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Number of Available Cores: 8
Available memory: 7.62 GB
Elixir 1.6.6
Erlang 20.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 7 s


Benchmarking Bloomex.prohibited?...

Name                          ips        average  deviation         median         99th %
Bloomex.prohibited?       81.49 K       12.27 μs    ±56.09%          11 μs          34 μs

Github Repo

The compiled Elixir.PasswordBloomex.beam file that contains the bloom filter is by the way about 2.7 MBs in size.


I do find it interesting that this is still slower than the BadPasswords.contains? and FST.contains? that @idi527 mentioned. I wonder how the graph would look when we go up to even more passwords. (And maybe there is a Bloom Filter library in Rust that we could rustler to instead as well?

2 Likes

I think I tried both a rust nif (which was strangely very slow) and bloomex in my benchmark above …

results
benchmark

I didn’t add sqlite, postgres, and ets:tab2file approaches since they were too slow.


BTW, in my benchmarks, bloomex was actually faster than fst.

3 Likes

Such a search sounds like a perfect use case for the trie data structure. It would be interesting to see a comparison. An example implementation of tries in Erlang can be found here.

3 Likes