Most performant way to search a file for a string?

joeerl · October 3, 2018, 6:27pm

It would be very interesting to time Postgres and compare with the ets figures.

What’s the max rate for database queries per second? - any figures here?

Qqwy · October 3, 2018, 8:50pm

The most performant way to solve this problem definitely is to use a Bloom filter.

You might get a small number of false positives, but this is not a problem for your use case (disallowing a couple of inputs not actually in the list of bad passwords would not harm anyone).

Once the bloom filter has been created, consulting it runs in constant(or at least, far faster than linear) time.

To do this in Elixir, I know there are a couple of libraries such as Bloomex, although I haven’t used it myself yet.

massimo · October 4, 2018, 8:05am

I’m seriously wondering why try to solve a problem that has already been solved by very smart engineers already
If it is a simple lookup and nothing more, SQLite is your friend
(or ETS or DETS if you need persistence)
On the pro sides, SQLite is a file that you can move around, it has no network latency, it can’t go down, and can be queried using ECTO (or standard SQL) and you won’t need to add Rust to the dependencies

https://www.sqlite.org/fasterthanfs.html

joeerl · October 4, 2018, 2:47pm

To answer your why question - I for one would like to know which of ets or sqlite is the faster - and also if a bloom filter is faster.

Without having performed an experiment running code on the same machine it is impossible to predict which is fastest. Whenever I predict performance without actually running the code I’m almost invariably wrong.

idi527 · October 4, 2018, 3:33pm

I started adding different approaches mentioned here to this benchmark.

UPDATE: Couldn’t get esqlite and ets:tab2file-based lookup to give reasonable results. Will try again tomorrow.

joeerl · October 4, 2018, 6:11pm

I’ve just tried sqlite on 1M passwords (not 10M) and can now directly compare with ets.

Name       Build (sec)   FileSize (MB)  Lookup Time (micro sec) 
ets        0.68          24             0.455 
sqlite3    719           41             53

Build is the time to build the ets table (or sqlite3) database the FileSize is the on-disk size of the ets-table/database Time is the lookup time (microseconds)

I suspect that the differences are due to the fact that ets is in memory and I understand what I’m doing, whereas the sqlite table is on disk and I haven’t a clue what I’m doing - it’s my first attempt at using sqlite so I’m probably doing it all wrong. Even if sqlite is in memory for the lookups there are an awful lot of round-trips between Erlang and sqlite and there is a semantic mismatch between the internal representations of data in Erlang and sqlite and conversion always takes time.

aside: First time I got sqlite working as well, so I’m quite pleased.

It’s basically (in SQL)

 create table test (pwd text primary key); 
 insert into test values ('password1);
 insert into test values ('password2);
 insert into test values ('password3);
  ...
 insert into test values ('password1000000');

then to lookup

 select * from test where pwd='something'

Seems to work - but rather slow.

What is more disturbing is the memory usage - 30 MB for 1M passwords would indicate 300MB for 10M passwords - now 300MB is getting to the point where I worry about space efficiency. So while a hash table is fastest it’s not space efficient - I suspect that bloom filters might be the way to go.

massimo · October 5, 2018, 8:11am

Thank @joeerl for taking the time to test it

Even without the numbers I was sure ETS would be faster for lookups, I suggested to use sqlite because one of the conditions was

Preferably without loading the entire file into memory at once,

But if that’s not an issue, ETS is totally the way to go

and with DETS you can persist and then transparently reload the data in memory

SQLite has the advantage of being more portable, you can query the file everywhere there is a SQLite client available (it’s just a lib)

Since you already wrote the code and I’m lazy () can you run it again loading SQLite in memory?

You just have to change the file path to ‘:memory:’

https://sqlite.org/inmemorydb.html

idi527 · October 5, 2018, 8:58am

Even in memory db would be several times slower (~20 times slower) due to sql parsing and serialization. Or at least it was when I tried it yesterday … And I disagree that using sqlite is more convenient than just transferring plaintext passwords around, which is what you’d do if you were to use in memory ets, fst, or bloom filters. I still think fst is the most usable approach here since it’s very fast, “lossless”, and additionally compresses the data.

There are ways to speed up sqlite significantly by using different backends (lmdb, for example), and side stepping the sql parser, and batching reads, but that’s probably not the point of these benchmarks.

ckampfe · October 7, 2018, 2:27am

I mentioned Postgres being a reasonable choice for a broad variety of problems, so I figured I’d back it up. Results first:

# these benchmarks made with Benchee

clark$> PORT=4000 MIX_ENV=prod mix run bench.exs
mixwarning: found quoted keyword "test" but the quotes are not required. Note that keywords are always atoms, even when quoted, and quotes should only be used to introduce keywords with foreign characters in them
  mix.exs:58

Operating System: macOS"
CPU Information: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Number of Available Cores: 12
Available memory: 16 GB
Elixir 1.7.3
Erlang 21.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 14 s


Benchmarking check password ecto query builder...
Benchmarking check password postgrex prepared...

Name                                        ips        average  deviation         median         99th %
check password postgrex prepared        10.12 K       98.79 μs    ±10.25%          96 μs         139 μs
check password ecto query builder        8.47 K      118.04 μs    ±11.38%         114 μs         169 μs

Comparison: 
check password postgrex prepared        10.12 K
check password ecto query builder        8.47 K - 1.19x slower
warning: unused import Ecto.Query
  bench.exs:1

Next, benchmark setup:

import Ecto.Query                                                                                       
                                                                                                        
query_prepared = "SELECT id FROM passwords WHERE password = $1"                                         
                                                                                                        
{:ok, _pid} =                                                                                           
  Postgrex.start_link(hostname: "localhost", username: "clark", password: "", database: "passtest_dev", 
pool: DBConnection.Poolboy, pool_size: 20, name: DB)                                                    
                                                                                                        
{:ok, prepared} = Postgrex.prepare(DB, "get_password", query_prepared, pool: DBConnection.Poolboy)      
                                                                                                        
Benchee.run(%{                                                                                          
  "check password ecto query builder" => fn ->                                                          
    Ecto.Query.select("passwords", [p], p.id)                                                           
    |> Ecto.Query.where([p], p.password == "austin")                                                    
    |> Passtest.Repo.one!                                                                               
  end,                                                                                                  
  "check password postgrex prepared" => fn ->                                                           
     Postgrex.execute!(DB, prepared, ["austin"], pool: DBConnection.Poolboy)                            
  end                                                                                                   
})

I generated a standard phoenix app, created a database, and then ran this migration:

defmodule Passtest.Repo.Migrations.CreatePasswords do                                                   
  use Ecto.Migration                                                                                    
                                                                                                        
  def change do                                                                                         
    create table("passwords") do                                                                        
      add :password, :string, size: 50                                                                  
    end                                                                                                 
                                                                                                        
    create index("passwords", [:password])                                                              
  end                                                                                                   
end

The data file I used was the 1M passwords one from the link posted earlier in the thread:

clark$> wc -l ~/code/pass.txt
999999 /Users/clark/code/pass.txt

clark$> head -10 ~/code/pass.txt
123456
password
12345678
qwerty
123456789
12345
1234
111111
1234567
dragon

To be transparent, the above is with BEAM and the Postgres server instance running on my local machine. This decreases latency vs what you would experience in a production setting where databases and app servers live on their own physical/virtual hardware separated by at least one network hop. Even so, starting from a baseline 99%ile of 139 μs is encouraging. For something that isn’t a hot loop, 139 μs is blisteringly low latency. Especially something where network time is going to dominate computation time by one or two orders of magnitude.

I’m going to paraphrase a comment I read elsewhere that was put forward in a similar context: performance is rarely about what is theoretically possible. It’s almost always a function of what you can achieve with the time and resources you have available. Further, it’s a function of what is necessary for the situation. With enough time, money, and enough smart people, could we theoretically optimize this down to the fewest possible machine instructions and the least number of main memory hits? We could. Is that useful? I doubt it. The performance of common password checking is never going to be the thing that makes or sinks your product or application. For this problem, it’s likely that there is a threshold that when hit is “fast enough”, and anything else is unnoticeable by the end user and/or adds code or architectural complexity wildly disproportionate to its benefits.

Getting these passwords into Postgres and queryable by Ecto took about 15 minutes total from my first reading of the post. As I’m already going to be using both in my webapp, I add almost no code and zero dependencies I wasn’t already using. Then, I’ll deploy and monitor. Then, optimize. I hope this is useful to understanding my approach.

Qqwy · October 7, 2018, 2:28pm

Because nobody actually went ahead and tried the bloomfilter-solution out yet, I did:

The given code creates the bloom filter of the 1 million most common passwords at compile time. This requires about a minute of compilation time. (and recompilations are obviously only required if you alter the code in there).
At runtime, it is really fast, outperforming the Postgres example in this topic by ~ a factor eight. I expect this to scale (i.e. the other solutions will slow down more than this one) when you introduce more than the one million passwords.

Benchee output: (Obviously, I did not try out the code of the other examples on my machine, but I do know that my machine currently is not the most high-spec machine there is, so I think these results are reproducible by you as well).

Benchmark.run
Operating System: Linux"
CPU Information: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Number of Available Cores: 8
Available memory: 7.62 GB
Elixir 1.6.6
Erlang 20.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 μs
parallel: 1
inputs: none specified
Estimated total run time: 7 s


Benchmarking Bloomex.prohibited?...

Name                          ips        average  deviation         median         99th %
Bloomex.prohibited?       81.49 K       12.27 μs    ±56.09%          11 μs          34 μs

Github Repo

The compiled Elixir.PasswordBloomex.beam file that contains the bloom filter is by the way about 2.7 MBs in size.

I do find it interesting that this is still slower than the BadPasswords.contains? and FST.contains? that @idi527 mentioned. I wonder how the graph would look when we go up to even more passwords. (And maybe there is a Bloom Filter library in Rust that we could rustler to instead as well?

idi527 · October 7, 2018, 5:10pm

I think I tried both a rust nif (which was strangely very slow) and bloomex in my benchmark above …

results
benchmark

I didn’t add sqlite, postgres, and ets:tab2file approaches since they were too slow.

BTW, in my benchmarks, bloomex was actually faster than fst.

michalmuskala · October 19, 2018, 2:54pm

Such a search sounds like a perfect use case for the trie data structure. It would be interesting to see a comparison. An example implementation of tries in Erlang can be found here.