Brute - A way to create streams of combinations

uri · February 1, 2017, 8:13pm

Hey guys,

I recently created a 0.1.0 version of my project Brute. Brute is a way of generating various combinations for a given character set. For example it can generate the the combinations of the set {a, b} of length five with the following:

iex>  Brute.generic('ab', 5) |> Enum.to_list
["aaaaa", "baaaa", "abaaa", "bbaaa", "aabaa", "babaa", "abbaa", "bbbaa",
  "aaaba", "baaba", "ababa", "bbaba", "aabba", "babba", "abbba", "bbbba",
  "aaaab", "baaab", "abaab", "bbaab", "aabab", "babab", "abbab", "bbbab",
  "aaabb", "baabb", "ababb", "bbabb", "aabbb", "babbb", "abbbb", "bbbbb"]

You can also provide ranges

iex>  Brute.generic(?0..?4, 2) |> Enum.to_list
["00", "10", "20", "30", "40", "01", "11", "21", "31", "41", "02", "12", "22",
  "32", "42", "03", "13", "23", "33", "43", "04", "14", "24", "34", "44"]

And if you provide a range for the “depth” parameter, it will give you a the sets ordered by depth

iex>  Brute.generic('ab', 1..3) |> Enum.to_list
["a", "b", "aa", "ba", "ab", "bb", "aaa", "baa", "aba", "bba", "aab", "bab",
  "abb", "bbb"]

I plan on creating a cache for these sets so for example, if a set of depth 10 is requested it might be sped up if sets 1-9 are already present in the cache.

I’m sure there are tons of improvements I can make before a 1.0.0 release, so I would greatly appreciate any feedback!

edit: Woops forgot to add the actual link!

Eiji · February 1, 2017, 10:46pm

Here are my suggestions:

first_data = ["a", ?b, 'c', <<100>>]
Brute.generic(first_data, :all) # Enum.count
Brute.generic(first_data, 2)
Brute.generic(first_data, 2, return_type: :map_first_to_second)
Brute.generic(first_data, 2, return_type: :map_second_to_first)
Brute.generic(first_data, 2, return_type: :map_any_to_any)
Brute.generic(first_data, [2, 3, 5])
Brute.generic(first_data, 2, return_type: :list)
Brute.generic(first_data, 2, return_type: :tuple)
Brute.generic(first_data, 2, method: :generic_custom, module: MyCustomBrute)

second_data = [a: 5, b: 10]
Brute.generic(second_data, 2, method: :generic_keys, return_type: :list)
Brute.generic(second_data, 2, method: :generic_values, return_type: :tuple)
Brute.generic(second_data, 2, method: :generic_keys_and_values)

third_data = "abcd"
Brute.generic(third_data, 2) # String.graphemes
Brute.generic(third_data, 2, return_type: [:list, :tuple])

fourth_data = {:a, :b, :c}
Brute.generic(fourth_data, 2) # Atom.to_string + String.to_atom

uri · February 2, 2017, 4:24pm

Thanks for your suggestions @Eiji. I’ll see how the project evolves and consider these changes.

mgwidmann · February 2, 2017, 7:13pm

I would like to see use cases for this in which a typical for loop would not work well, especially since I think the for loop Elixir provides is fairly powerful. I think without that, it will be a hard sell to get people to pick this up.

jmitchell · February 2, 2017, 8:12pm

Cool project!

In case you’re interested, our projects may eventually benefit from each other. Backtrex implements a generic backtracking algorithm, and your project is a specialization of that problem over ranges and acceptable lengths. Brute’s input ranges correspond to Backtrex’s values callback, and similarly an acceptable length in Brute corresponds to the number of unknowns specified in Backtrex’s unknowns callback. Your API is helping me realize I should consider supporting a variable amount of unknowns. Currently Brute doesn’t seem to do any additional filtering like Bactrex’s valid? callback, but it may not be all that helpful since users can simply pipe to Stream.filter/2.

For now Backtrex only returns the first solution (in Brute’s case the head of the output Stream.t), but there are plans to return a Stream.t like Brute does. Eventually I plan to support parallel search at the expense of deterministic order, which could be handy for brute as well if it proves efficient enough. Then again, Brute doesn’t have to assign a unique ID for every unknown, so maybe it would cause unnecessary overhead.

jmitchell · February 2, 2017, 8:20pm

I may be mistaken, but doesn’t the variable length specification for each acceptable output go beyond what for already naturally can?

I agree, though, that features to distinguish from for will ultimately be the selling point. Perhaps allowing for constraints like “no repeated values” (allow “abc” but not “aaa”) would help. Although possible to express in a for expression it could get unwieldy for high upper bounds on the length.

Eiji · February 2, 2017, 8:32pm

I would like to have one more method:

Brute.similar('something') # defaults: 1 and  'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQSTUVWXYZ'
# polish letters:
Brute.similar('something', 2, 'aąbcćdeęfghijklłmnńoópqrstuvwxyzźżAĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUVWXYZŹŻ')
Brute.similar("something", 1,  "aąbcćdeęfghijklłmnńoópqrstuvwxyzźżAĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUVWXYZŹŻ")
# where:
# 1st parameter is query list/string with one of more possible typos
# 2nd parameter is a number of allowed typos for query
# 3rd parameter is character list/string that could be replaced with possible typos

This could be helpful for search engines.

jmitchell · February 2, 2017, 8:44pm

That’s a nifty idea! Generate all strings within a certain hamming distance from a specified string. A couple interesting edge cases:

If the alphabet specified doesn’t include every character in the specified string, do you merge the missing ones into the alphabet or not?
In this context are we interchanging (ASCII) characters, unicode glyphs, …? Should this be a configurable option?

mgwidmann · February 2, 2017, 9:39pm

Yes, his example is not straight forward to write with for, but is also more of an academic exercise than a real world problem. I was suggesting that he showed something similar to how it could be used in the wild.

Eiji · February 2, 2017, 9:43pm

In this case these characters should be removed before start real work on it. Optionally we could add a way to merge them, but I believe that in most cases we will going to remove them.
Think about search for something that you already saw like our forum tags.
We need to find a tag in know List of characters (here is: english: a-z + “-” character"). Here any non-English character is bad and should be omitted.
Depends on source data where user is searching for. In one case user is searching English tags and in other Chinese articles. Think also about multi-language projects.

Here is my proposition for options:

Special characters: characters that should not be replaced (in our tags example it’s “-” character).
We could implement two methods.

In first
if option on_unknown_character: is set to :skip (default we will emit warning message about skipped character(s) + skip them else if that option is changed to :merge then we will not emit warning (because we already expect unknown characters) and merge characters

In second:
(method that’s name ends with “!” character) we will throw error when found unknown character

Note: Known characters are: specified characters and special characterrs (if set).
An option to set language (like: :pl_PL, so we do not need to pass whole alphabet to characters. We would pass lang + other characters (if any) + special characters.

Eiji · February 2, 2017, 9:51pm

Examples:

Random data generation - for testing algorithms, scalability etc
Search engines helper methods (like in my example)
Other projects/libraries that need generic data

It’s not for every Phoenix blog. It’s a API for optimised algorithms to generate data in known ranges/rules.
I hope that helps.

uri · February 3, 2017, 11:21pm

Comprehensions (for keyword in Elixir) are really cool, but how would you use it to generate combinations? For example, if i asked for all of the combinations of length 4 for the character-set ‘abcd’ how would you do it?

uri · February 3, 2017, 11:26pm

Awesome, I’ll check it out!

uri · February 3, 2017, 11:29pm

I really like the idea of Brute.similar/2! I’ll probably start exploring an implementation.

OvermindDL1 · February 3, 2017, 11:31pm

Well for a quick permutation:

iex> data = 'abcd'
iex> for a<-data, b<-data, c<-data, d<-data, do: [a, b, c, d]
['aaaa', 'aaab', 'aaac', 'aaad', 'aaba', 'aabb', 'aabc', 'aabd', 'aaca', 'aacb',
 'aacc', 'aacd', 'aada', 'aadb', 'aadc', 'aadd', 'abaa', 'abab', 'abac', 'abad',
 'abba', 'abbb', 'abbc', 'abbd', 'abca', 'abcb', 'abcc', 'abcd', 'abda', 'abdb',
 'abdc', 'abdd', 'acaa', 'acab', 'acac', 'acad', 'acba', 'acbb', 'acbc', 'acbd',
 'acca', 'accb', 'accc', 'accd', 'acda', 'acdb', 'acdc', 'acdd', 'adaa', 'adab',
 ...]

uri · February 3, 2017, 11:55pm

Yeah that works but for a set of length 5 you’d have to add an additional generator, which for a simple case is totally fine. I wonder if you can use macros to generate these comprehensions based on a param.

OvermindDL1 · February 4, 2017, 2:21am

Pretty easily done yeah, as long as you know the size at compile-time.