Representation of bit-vector in Ecto (Postgresql)

Hello,

I’m working on an efficient way to represent user activity within an organization for sales analytics (emphasis on efficient). All sales want is the ability to see if a specific user in an org was doing a specific action on a given day and the percentage of users performing some action.

I was thinking of representing the information using a bit-vector where each user is a specific bit position. For example:

1> add_activity = fn(id, activity) -> Bitwise.bor(activity, Bitwise.bsl(1, id)) end
#Function<41.105768164/2 in :erl_eval.expr/6>
2> activity = add_activity.(10, 0)
1024
3> activity = add_activity.(17, activity)
132096
4> Integer.to_string(activity, 2)
"100000010000000000"
5> user_active? = fn(id, activity) -> Bitwise.bor(Bitwise.bsl(1, id), activity) == activity end
#Function<41.105768164/2 in :erl_eval.expr/6>
6> user_active?.(10, activity)
true
7> user_active?.(18, activity)
false

Big numbers for sure, but also pretty efficient.

Now I’d like to save this data, where a schema may look like:

  schema "user_activity_metrics" do
    field :entity_id, Types.UUID, primary_key: true, default: Types.UUID.null()
    field :activity_date, :date, primary_key: true
    field :uploads,   ## Type???
    field :downloads, ## Type???
    field :emails,    ## Type???
end

I’d love your input on how to represent bit-vectors Type in Postgres that is Ecto friendly and efficient. Efficient meaning in terms of space, but also being able to make simple queries. I know Postgresql supports types bit and bit varying, but last time I looked you couldn’t do bitwise searching if it was larger than a word IRC.

I could use {:array, :boolean}, but not too sure on space.

Suggestions welcome

You can always create your own custom ecto type, it’s easy to do and you have full flexibility to make what kind of abstraction you want at elixir level.

This is the problem you should focus on finding the answer. I think you might be able to create your own procedures if that is not possible, however I would resort to that only if there is no other option.

It is also worth considering how much space overhead bit varying has and compare it with a simpler solution like a join table, because it might be so small that it would make no sense.

I looked at creating postgresql procedures to glue word-length chunks together…but pooping barbed wire looks more fun.

However, a custom Ecto type may be a good alternative.

The “textbook” approach for this would be to upsert into a table of unique {entity_id, user_id, date, activity} tuples every time a user does activity on an entity.

That approach has a downside: if there are MANY possible values of activity, it may need to be a large data type.

In your example, it seems like there’s only three possible values. You could use an enum (will take 4 bytes), or three booleans (1 byte each), or even pack the three into a single number 0-7.

Bit strings have several downsides:

  • querying them is expensive; there’s not much index acceleration available for bitwise operations
  • they take up extra space, especially if they are small - a bit(1) reports its size (via pg_column_size(1::bit(1))) as 9 bytes. Trying larger sizes suggests the data is stored as an 8-byte length + 1-or-more data bytes.
  • they waste significant space when users are deleted or inactive: every user that didn’t do an action on entity_id for that day needs a 0 bit somewhere. Imagine the only user that touched entity_id = 123 today is user_id = 12345: the bit string would need to write 1543 bytes of zeros before a solitary 1.
1 Like

Wow…thanks for that, I’m shocked

Postgrex supports the bitstring type which accepts a binary so I think should be quite easy to use in Ecto, maybe even as a String.t (even though its not UTF-8. bitstring maps to the Postgres [varbit[(PostgreSQL: Documentation: 16: 8.10. Bit String Types) type.

Then use binary pattern matching to extract the values you want.

2 Likes