Extract keywords from sentence

papakay · December 24, 2017, 5:45am

Please how can i extract keywords from sentence.

I’m building a blog application, I want to be able to extract keywords from the title of each blog post.

Or is there any library that can help achieve this?

Thank you.

kokolegorille · December 24, 2017, 7:33am

Is that what You are looking for?

iex> "roses are red\n violet are blue\n" 
|> String.split("\n") 
|> Enum.flat_map(&String.split/1)  
["roses", "are", "red", "violet", "are", "blue"]

Code is coming from this videocast on genstage

What do You mean by keywords? are they some kind of tags?

From the code You could easily remove duplicates, and blacklist some common words, like articles, verbs etc.

papakay · December 24, 2017, 7:02pm

Thanks immensely @kokolegorille for this code snippet.

By keywords I mean getting the most important words in the sentence.

e.g With a topic like Do you have a Leadership Brain? Leadership and Brain are the keywords in the topic and that’s exactly what I will like to extract from the topic.

NobbZ · December 24, 2017, 7:29pm

When I do have only that sentence then “you” or “do” or any other word could be a keyword or important as well.

The probably easiest way to recognize your “keywords” is to provide an additional field in the form where users can enter them manually.

If though you really want to recognize them fully automatically you need to learn either advanced statistics, natural language processing or both…

papakay · December 24, 2017, 7:32pm

Thanks @NobbZ, i will go with your advice. I will allow users to manually specify the keywords

Eiji · December 25, 2017, 12:05pm

@NobbZ: I don’t know how big this project is. If it’s not something super-professional with Perfect Forward Secrecy etc. then he could use a simpler algorithm without learning advanced statistics.

@papakay: Here I wrote an example SQL to fetch most used words from database:

-- pure SQL does not have variables, so instead of we are using with clausule
with words_with_count as (
  select
    word,
    count(word) as count
  from (
    -- we are using lower method here
    -- it's SQL equivalent for Elixir's String.downcase/1
    -- because we do not need any extra cases
    select unnest(string_to_array(lower(title), ' ')) as word
    -- unnest makes something like List.flatten/1
    -- but works on rows of arrays
    -- and turns them all into rows
    -- we are fetching all words in all titles
    from posts
  ) as result
  where word not in ('do', 'how', 'you', 'a', 'the', 'does', 'etc')
  -- don't include any word from that list - feel free to add/remove items from this list
  group by word
)
select word
from words_with_count
where count > 1
-- limit words that appear only one time - feel free to increase
group by count, word
order by count desc
-- sort by number of duplicates
-- i.e. how many times each word appears in all titles
limit 10;
-- limit words to max 10
-- feel free to decrease/increase

To run it using ecto you could use Ecto.Adapters.SQL.query/4 and you should receive a List of mostly used words in title column of posts table.

When you fetch them then you can do something like:

database_keywords = … # use my SQL here to find them
keyword_candidates = String.split(new_post.title)
found_keywords = Enum.filter(keyword_candidates, &(&1 in database_keywords))

Note: It’s a SQL written in 5-min. - checked using psql (PostgreSQL 10.1). It’s definitely not implements any advanced algorithms, but could be helpful in smaller projects where using much more complicated algorithms is not needed. I will not be surprised if someone will find a 10x faster way with 2x less SQL code.

Note 2: I little described my SQL in comments - you can remove them or keep - comments will not affect this code. If you still have a question feel free to post it in this topic.