How can I analyse Data(keywords) in Elixir?

MihailPertsev · November 18, 2020, 9:43pm

Hi there!
I have a tough understanding what to do:

I need to categorize a product by it’s keywords.
I have a several “Topics”, for example: “New Technologies”, “Health”.
Each of those “Topics” have several “Subtopics”, for example: "New Technologies => “Internet of Things (IoT)”, “New Technologies => 3-D Printing”

There are in average about 250 categorized products in each “Subtopic”.
I went ahead and grabbed all keywords from already categorized products, and made a list for each “Subtopic” with keywords and their frequency, for example: %{“keyword”: “stereolithography”, “uniqueCount”: 8, “totalCount”: 17}

So, long story short: I have a bunch of lists with keywords that I need to somehow “measure” and “rank” them.
I guess my next step should be creating a keywords for each “Topic” from its “Subtopic”'s combine “dictionary” of all “important” keywords… I have no idea how to do it. The keywords are overlapping among “Subtopics” but I still need to “measure” them somehow…
Is there any “cool and easy” Elixir library that can be useful in my case?
Please share any tips!

thiagomajesk · November 19, 2020, 12:57am

Could you provide a real example of the data structure you have and how you expect the output to be?
This will improve the quality of the answers you’ll get from the community.

MihailPertsev · November 19, 2020, 10:48am

Sure, I have a bunch of Excel files with this kind of structure:
For Topic: "New Technologies"

Subtopic: "Artificial Intelligence"

Keywords	Unique Count	Total Count
artificial intelligence	125	190
machine learning	34	65
big data	29	54
artificial intelligence ai	18	23
neural networks	16	20
social media	16	21
decision making	15	32

Subtopic: "Internet of Things"

Keywords	Unique Count	Total Count
internet things iot	22	38
iot	18	36
internet things	16	66
big data	12	12
cloud computing	12	19
sensors	9	10
internet	8	9
privacy	7	10

Subtopic: "Machine Learning"

Keywords	Unique Count	Total Count
machine learning	111	164
artificial intelligence	41	86
big data	29	64
data mining	26	43
supervised learning	23	26
neural networks	22	26
deep learning	19	26
support vector machines	19	22
internet things	17	30
machine learning algorithms	17	19

And this is for every Subtopic inside “New Technologies” Topic.

Topic	Subtopic
New Technologies	4th Industrial Revolution
	Sustainability (Energy) Core
	Sustainability (Energy) Extended
	3-D Printing
	Artificial Intelligence (AI)
	Augmented & Virtual Reality (& wearables)
	Big Data & Data Analytics
	Cloud & Fog Computing
	Internet of Things (IoT)
	Machine Learning
	Wireless Technology

As you can see a lot of keywords are overlapping. For example: “artificial intelligence”, “big data”, “machine learning”.

Now I want to create a combined keywords list for “New Technologies” topic .
But I don’t know how to “measure” a value of the given keyword. I am really lacking in theory about this kind of stuff… For example: the keyword “big data” is in every single one those Subtopics, so I need to rank it lower because it is not unique, but I don’t know the formula or algorithm to do it…

So, it would be really nice to use some library (maybe not an Elixir library) to achieve it. But I don’t know about them as well…

MihailPertsev · November 19, 2020, 10:55am

And there is also a problem with the keywords themself… Some of them are meaning the same thing, like “internet things iot”, “iot”, “internet things”. All of them are basically the same thing, yet in my dataset they are different. But currently I want to treat them as unique keywords for the purpose of simplicity and MVP.

MihailPertsev · November 21, 2020, 12:39am

Any tips?

kokolegorille · November 21, 2020, 2:52am

First step would be to normalize your data…

After this, sure You can transform your data, Elixir is good at that

MihailPertsev · November 21, 2020, 6:51pm

Hi!
I can’t understand what do you mean by “transform your data”. Can you please specify what exactly do you mean?

kokolegorille · November 21, 2020, 6:58pm

@pragdave explains it much better than I would do

TLDR in FP, almost everyhing is expressed as input → function → output, which is transforming input into output.

MihailPertsev · November 21, 2020, 8:24pm

I am still confused… I understand that I need to write a bunch of functions, which would transform my data. I am just clueless what algorithms should I use in those functions…
So, I have a lack of understanding “what” to do, instead of “how”.
Should I use something like GloVe for “measure” and “ranking”? Or maybe it is easier to use some python library (that I dont know about yet)… Thats kind of questions is my concern right now.

kokolegorille · November 21, 2020, 8:42pm

I don’t know what You really want to achieve… but what about something like?

sum(points) / number of occurence in subtopic

MihailPertsev · November 21, 2020, 9:11pm

I want to create some simple “measure” system which can rank keywords in each Subtopic by it “importance”. My current thinking about “importance”: “if given keyword is appearing more frequently across different products inside Subtopic, then this keyword is more important…” But I don’t know what to do about keywords that are overlapping across many Subtopics, for example keyword “big data” is probably more important inside “Artificial Intelligence” than "Internet of Things. But I don’t know how to mathematically “measure” it.

MihailPertsev · November 21, 2020, 9:22pm

I don’t know what You really want to achieve… but what about something like?

sum(points) / number of occurence in subtopic

I tried to simply sum(Unique_Count, Total_Count * 0.5) I took this 0.5 coefficient from my head, and probably there must be some kind of formula … That way I can have some ranking inside single Subtopic, but I need to have ranking inside Topic(all Subtopics). That is where I am struggling right now.

otuv · November 21, 2020, 10:57pm

If i understand you right I would go for something like Map.merge
Like this small example

cat1 = %{a: 2, b: 3}
cat2 = %{b: 1, c: 1}
Map.merge(cat1, cat2, fn (_key, v1, v2) → v1 + v2 end)
%{a: 2, b: 4, c: 1}

And then just rank according to the total sums

MihailPertsev · January 16, 2021, 3:43pm

Here is a solution that I used (it is Python):

https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794