My first hex package - parsing social components

I have just published my first hex package - social_parser! (I am still fairly new to Elixir so be nice :slight_smile:).

It’s a very small library that parses out hashtags, mentions and links for a given input string. I am looking for any feedback on the code as being fairly new I haven’t fully found my feet yet or nailed down best practices.

I am aware that it’s very basic at determining the end of a component at this point, something that I need to think about next.

3 Likes

@swelham: How about do it by regular expressions?
In configuration we could define regular expressions for target social provider like:

config :social_parser, :patterns,
  facebook: [mention: /@([a-z]+)/],
  my_new_social_site: [menion: /\+([a-z0-9_-]+)/]

# social_identifier: [type_identifier: regex]

In parse method we could send a message and social_provider and match all patterns for selected provider.
I think code like that would be simple to write and really configurable. We could define a new patterns, so we are able to quickly add support for newly created social sites and their new features. In that case code do not need to be rewritten for each social site.
What do you think about it?

1 Like

I went with the manual parsing approach for performance as it only requires iterating over the string once. I have been bitten in the past by using regex for parsing multiple patterns, especially if the regex becomes complex.

I do agree that being able to configure custom matches would be cool, maybe a performance test of manual parsing vs regex is needed!

2 Likes

In most cases performance does not matter.
In Facebook most visits are from mobile devices, so we do not expect a thousands of lines to parse.
Twitter changed rules for limiting characters from 140 to 10_000.
I think for most users 10_000 characters is too much, so I expect smaller text to match.
I recommend to do 3 tests:

  1. LOW: Only one complex sentence
  2. MED: 10 complex sentences - I think this is real case (~)maximum
  3. HIGH: 10_000 characters

I think regular expressions would be good in LOW and MED levels. Much more difference could be seen in thousands of characters. I think it’s good to implement regular expressions if I’m right with expected tests results.

What do you think about named captures (see: Regex.named_captures/3)? We can match multiple regular expressions in one call. We could also translate Keyword of regular expressions to one bigger regex making project easy to configure.

I will spend some time doing the performance tests and see how much difference there is between the two.

I also think named_captures looks great so will use that when trying the regex tests.