Hi there!
I have a tough understanding what to do:
I need to categorize a product by it’s keywords.
I have a several “Topics”, for example: “New Technologies”, “Health”.
Each of those “Topics” have several “Subtopics”, for example: "New Technologies => “Internet of Things (IoT)”, “New Technologies => 3-D Printing”
There are in average about 250 categorized products in each “Subtopic”.
I went ahead and grabbed all keywords from already categorized products, and made a list for each “Subtopic” with keywords and their frequency, for example: %{“keyword”: “stereolithography”, “uniqueCount”: 8, “totalCount”: 17}
So, long story short: I have a bunch of lists with keywords that I need to somehow “measure” and “rank” them.
I guess my next step should be creating a keywords for each “Topic” from its “Subtopic”'s combine “dictionary” of all “important” keywords… I have no idea how to do it. The keywords are overlapping among “Subtopics” but I still need to “measure” them somehow…
Is there any “cool and easy” Elixir library that can be useful in my case?
Please share any tips!
Could you provide a real example of the data structure you have and how you expect the output to be?
This will improve the quality of the answers you’ll get from the community.
Sure, I have a bunch of Excel files with this kind of structure:
For Topic: "New Technologies"
Subtopic: "Artificial Intelligence"
Keywords
Unique Count
Total Count
artificial intelligence
125
190
machine learning
34
65
big data
29
54
artificial intelligence ai
18
23
neural networks
16
20
social media
16
21
decision making
15
32
Subtopic: "Internet of Things"
Keywords
Unique Count
Total Count
internet things iot
22
38
iot
18
36
internet things
16
66
big data
12
12
cloud computing
12
19
sensors
9
10
internet
8
9
privacy
7
10
Subtopic: "Machine Learning"
Keywords
Unique Count
Total Count
machine learning
111
164
artificial intelligence
41
86
big data
29
64
data mining
26
43
supervised learning
23
26
neural networks
22
26
deep learning
19
26
support vector machines
19
22
internet things
17
30
machine learning algorithms
17
19
And this is for every Subtopic inside “New Technologies” Topic.
Topic
Subtopic
New Technologies
4th Industrial Revolution
Sustainability (Energy) Core
Sustainability (Energy) Extended
3-D Printing
Artificial Intelligence (AI)
Augmented & Virtual Reality (& wearables)
Big Data & Data Analytics
Cloud & Fog Computing
Internet of Things (IoT)
Machine Learning
Wireless Technology
As you can see a lot of keywords are overlapping. For example: “artificial intelligence”, “big data”, “machine learning”.
Now I want to create a combined keywords list for “New Technologies” topic .
But I don’t know how to “measure” a value of the given keyword. I am really lacking in theory about this kind of stuff… For example: the keyword “big data” is in every single one those Subtopics, so I need to rank it lower because it is not unique, but I don’t know the formula or algorithm to do it…
So, it would be really nice to use some library (maybe not an Elixir library) to achieve it. But I don’t know about them as well…
And there is also a problem with the keywords themself… Some of them are meaning the same thing, like “internet things iot”, “iot”, “internet things”. All of them are basically the same thing, yet in my dataset they are different. But currently I want to treat them as unique keywords for the purpose of simplicity and MVP.
I am still confused… I understand that I need to write a bunch of functions, which would transform my data. I am just clueless what algorithms should I use in those functions…
So, I have a lack of understanding “what” to do, instead of “how”.
Should I use something like GloVe for “measure” and “ranking”? Or maybe it is easier to use some python library (that I dont know about yet)… Thats kind of questions is my concern right now.
I want to create some simple “measure” system which can rank keywords in each Subtopic by it “importance”. My current thinking about “importance”: “if given keyword is appearing more frequently across different products inside Subtopic, then this keyword is more important…” But I don’t know what to do about keywords that are overlapping across many Subtopics, for example keyword “big data” is probably more important inside “Artificial Intelligence” than "Internet of Things. But I don’t know how to mathematically “measure” it.
I don’t know what You really want to achieve… but what about something like?
sum(points) / number of occurence in subtopic
I tried to simply sum(Unique_Count, Total_Count * 0.5) I took this 0.5 coefficient from my head, and probably there must be some kind of formula … That way I can have some ranking inside single Subtopic, but I need to have ranking inside Topic(all Subtopics). That is where I am struggling right now.