How can I analyse Data(keywords) in Elixir?

Hi there!
I have a tough understanding what to do:

I need to categorize a product by it’s keywords.
I have a several “Topics”, for example: “New Technologies”, “Health”.
Each of those “Topics” have several “Subtopics”, for example: "New Technologies => “Internet of Things (IoT)”, “New Technologies => 3-D Printing”

There are in average about 250 categorized products in each “Subtopic”.
I went ahead and grabbed all keywords from already categorized products, and made a list for each “Subtopic” with keywords and their frequency, for example: %{“keyword”: “stereolithography”, “uniqueCount”: 8, “totalCount”: 17}

So, long story short: I have a bunch of lists with keywords that I need to somehow “measure” and “rank” them.
I guess my next step should be creating a keywords for each “Topic” from its “Subtopic”'s combine “dictionary” of all “important” keywords… I have no idea how to do it. The keywords are overlapping among “Subtopics” but I still need to “measure” them somehow…
Is there any “cool and easy” Elixir library that can be useful in my case?
Please share any tips!

Could you provide a real example of the data structure you have and how you expect the output to be?
This will improve the quality of the answers you’ll get from the community.

Sure, I have a bunch of Excel files with this kind of structure:
For Topic: "New Technologies"

Subtopic: "Artificial Intelligence"

Keywords Unique Count Total Count
artificial intelligence 125 190
machine learning 34 65
big data 29 54
artificial intelligence ai 18 23
neural networks 16 20
social media 16 21
decision making 15 32

Subtopic: "Internet of Things"

Keywords Unique Count Total Count
internet things iot 22 38
iot 18 36
internet things 16 66
big data 12 12
cloud computing 12 19
sensors 9 10
internet 8 9
privacy 7 10

Subtopic: "Machine Learning"

Keywords Unique Count Total Count
machine learning 111 164
artificial intelligence 41 86
big data 29 64
data mining 26 43
supervised learning 23 26
neural networks 22 26
deep learning 19 26
support vector machines 19 22
internet things 17 30
machine learning algorithms 17 19

And this is for every Subtopic inside “New Technologies” Topic.

Topic Subtopic
New Technologies 4th Industrial Revolution
Sustainability (Energy) Core
Sustainability (Energy) Extended
3-D Printing
Artificial Intelligence (AI)
Augmented & Virtual Reality (& wearables)
Big Data & Data Analytics
Cloud & Fog Computing
Internet of Things (IoT)
Machine Learning
Wireless Technology

As you can see a lot of keywords are overlapping. For example: “artificial intelligence”, “big data”, “machine learning”.

Now I want to create a combined keywords list for “New Technologies” topic .
But I don’t know how to “measure” a value of the given keyword. I am really lacking in theory about this kind of stuff… For example: the keyword “big data” is in every single one those Subtopics, so I need to rank it lower because it is not unique, but I don’t know the formula or algorithm to do it…

So, it would be really nice to use some library (maybe not an Elixir library) to achieve it. But I don’t know about them as well…

And there is also a problem with the keywords themself… Some of them are meaning the same thing, like “internet things iot”, “iot”, “internet things”. All of them are basically the same thing, yet in my dataset they are different. But currently I want to treat them as unique keywords for the purpose of simplicity and MVP.

Any tips?

First step would be to normalize your data…

After this, sure You can transform your data, Elixir is good at that :slight_smile:

Hi!
I can’t understand what do you mean by “transform your data”. Can you please specify what exactly do you mean?

@pragdave explains it much better than I would do :slight_smile:

TLDR in FP, almost everyhing is expressed as input → function → output, which is transforming input into output.

I am still confused… I understand that I need to write a bunch of functions, which would transform my data. I am just clueless what algorithms should I use in those functions…
So, I have a lack of understanding “what” to do, instead of “how”.
Should I use something like GloVe for “measure” and “ranking”? Or maybe it is easier to use some python library (that I dont know about yet)… Thats kind of questions is my concern right now.

I don’t know what You really want to achieve… but what about something like?

sum(points) / number of occurence in subtopic

I want to create some simple “measure” system which can rank keywords in each Subtopic by it “importance”. My current thinking about “importance”: “if given keyword is appearing more frequently across different products inside Subtopic, then this keyword is more important…” But I don’t know what to do about keywords that are overlapping across many Subtopics, for example keyword “big data” is probably more important inside “Artificial Intelligence” than "Internet of Things. But I don’t know how to mathematically “measure” it.

I don’t know what You really want to achieve… but what about something like?

sum(points) / number of occurence in subtopic

I tried to simply sum(Unique_Count, Total_Count * 0.5) I took this 0.5 coefficient from my head, and probably there must be some kind of formula … That way I can have some ranking inside single Subtopic, but I need to have ranking inside Topic(all Subtopics). That is where I am struggling right now.

If i understand you right I would go for something like Map.merge
Like this small example

cat1 = %{a: 2, b: 3}
cat2 = %{b: 1, c: 1}
Map.merge(cat1, cat2, fn (_key, v1, v2) → v1 + v2 end)
%{a: 2, b: 4, c: 1}

And then just rank according to the total sums

Here is a solution that I used (it is Python):

https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

1 Like