ML Classification with string input, and turning strings to Tensors

tegmentum · April 17, 2024, 6:51pm

Hey all, i’m trying to train a classification model (Trying EXGBoost and KNN) To predict a packaging type based on order content.

My data is numerical (total volume, item count, total weight), but one important piece of data is textual, a “semi-structured” text, which essentially represents item shapes and their count in the order: count-code, code representing ‘200g can’ ex:

2-C16
3-C12, 4-A24
3-C12, 4-A24, 2-C16

With some research, I found that one approach to use that data alongside my numerical data is to turn those strings into a vector, and then normalize it.

Is there something similar to Word2Vec? The plan is to turn that into a Tensor and then normalize it.

I’m also considering turning it to binary, and turning that binary into a Tensor.

I’m not really a data scientist, so any help / direction is very much appreciated

Thanks!

hugobarauna · April 17, 2024, 7:10pm

Unfortunately, I can’t help with that.

But I’d suggest you also ask for help in the “machine-learning” channel in EEF’s Slack. The Elixir machine-learning community is quite active there.

tegmentum · April 17, 2024, 7:25pm

Thanks again Hugo! and thanks for the EEF slack, a gem

lucaong · April 17, 2024, 7:38pm

If the structured text is made of a limited alphabet that can be turned into categorical data, that’s probably the most effective thing to do. What I mean is that something like “200g can”, “1500ml bottle”, etc. can be turned into 3 attributes like “size: 200, unit: g, type: can”, or “size: 1500, unit: ml, type: bottle”, where “size” is numerical, while “unit” and “type” are categorical. Many classification algorithms like decision trees and random forests can use a mix of numerical and categorical data.

If it’s really unstructured text, things will be more difficult. Something like Word2Vec is in principle quite easy to train, but you would need a lot of data in order to train effective embeddings.

In sum, my recommendation would be to parse the structure and treat it as numerical and categorical data if possible at all, as that makes things a lot easier, and enables you to use simpler algorithms that are likely to perform better even with limited data. Only treat it as text if it’s really unstructured text.

lucaong · April 17, 2024, 8:01pm

Ah I now realize that what you have is an order, represented as a set of items each with its quantity. The challenge is to turn that multiset into a tensor. I assume that the order of the items is irrelevant: if so, sequence models designed to process text (where the order of tokens matters) are probably not the best approach.

If the total number of possible items is known in advance and not too large, you can turn each order into a vector where each element corresponds to a possible item that may be in the order, and its numerical value is the quantity of that item in the order. If, say, you have 1000 different articles, each order would become a vector of 1000 elements. Most values would be 0, meaning that such item is not in the order, while some would be set to the quantity of the corresponding item in the order.

As a simplified example, say that there are only 3 possible items: apple, oranges, and lemons. You then would need a 3-dimensional vector to represent orders, one dimension per possible article. An order that contains 3 apples and nothing else would be represented for example by the vector [3, 0, 0], one that contains 1 orange and 5 lemons would be represented by the vector [0, 1, 5], while one that contains 2 apples and 1 lemon would be represented as the vector [2, 0, 1].

I had quite good results in a similar case, where I was training a model on orders with a total of 20k possible different articles. In my case, to reduce the dimensionality of the input and to capture latent factors, I trained an autoencoder on such 20k-dimensional sparse vector, and used the encoder part to produce smaller but dense vectors to feed to my model. If such path would be promising for your case I can provide more details. Note that you still would need a suitable amount of data for the autoencoder to train meaningful representations.

tegmentum · April 17, 2024, 8:17pm

Hey Luca, thanks for getting back to me, this is really insightful.

The true article count is around 1.5k, but we previously labeled them down to around 65 different types, irrespective of contents. So that’s what i’m currently working with. ex:

A1 → 200g can
A3 → 1kg bag

So if I understand correctly, a 65 element vector would not need to be encoded, and for each order, I would build a 65 element Tensor, where each article label is mapped to a index, and the value at that index is its count. (I’m reading your edits as I type this :P)

That makes total sense to me, my next step would be learning how to feed a (2D ?) tensor as an input parameter into my models. I’m using the EXGBoost Library, and Scholar’s KNN.

Thanks a lot!

lucaong · April 17, 2024, 8:25pm

Exactly a 65-dimensional vector seems small enough that it should not need further encoding. It’s basically adding 65 numerical attributes to your dataset, which sounds ok.

I would not even create 2D tensors, but really just append the 65 numerical attributes created from the items in the order to the ones you already had, obtaining a vector of 65 + N elements, where N is the number of numerical attributes you already had.

Following from the simple example above with apples, oranges, and lemons, let’s say you also have the total volume and item counts as the first two numerical attributes. An order with an item count of 2, and a total volume of 200, containing 3 apples and 5 lemons would then be represented by the vector [2, 200, 3, 0, 5] (or, represented as a map of attributes, %{ item_count: 2, total_volume: 200, apples: 3, oranges: 0, lemons: 5 }).

I am not an expert about XGBoost, but I think it’s an ensemble of decision trees like random forests, and therefore should be able to ingest such dataset just fine. Each of those 65 additional attributes (representing the quantity of each specific “normalized item” in the order) would be treated just like the other numerical attributes like total volume or item count.

Same goes for KNN, although I assume that in this case you might benefit from normalizing the attributes before feeding them into the model.

tegmentum · April 17, 2024, 8:36pm

Got it!

Correct, XGBoost is a decision tree based algorithm like random forest. I normalize my numerical attributes for both models, but will keep the count for the attributes as is.

Would you mind sharing which algorithm you used for your problem ? maybe i’ll look into that after this.

And again thanks a lot!

lucaong · April 17, 2024, 8:44pm

In my case I was not training a classifier, but rather a sort of recommender system that would provide recommendations given a partial shopping basket (in a specific b2b domain where baskets are typically containing many hundreds of items). The best performing model was a neural network built as a mix of a variational autoencoder, plus some additional embeddings to represent the specific tendencies of different geographical areas. The real trick was the specific way we trained the autoencoder, but I don’t think I can share such detail here, and it would not be relevant for your case anyway

For classification, not knowing much about the specifics of your case, an ensemble of decision trees sounds like a great option to me: it’s simple enough to work without massive data, and involves very few hyperparameters. Neural networks like the one I used in my case often require a lot of experimentation before one finds suitable hyperparameters, therefore I would definitely not recommend them as the first choice.

tegmentum · April 17, 2024, 8:51pm

I see, thanks for sharing anyways!

If we move on to needing a neural net, I’ll reach out again

lucaong · April 17, 2024, 9:13pm

It is likely that random forest would outperform a neural network for this task, especially if you don’t have large amount of training data. I think your choice of model is already the best bet: you will need to invest much less effort and probably get equal if not better results. Moreover, EXGBoost can plot the trained trees and give you insights on its rationale for the classification, something that neural networks would hide.

That said, if you are curious, the neural net architecture that I would use in your case is rather simple, just a standard multi-class classifier. Start with an input layer with a dimension corresponding to the total number of attributes and progressively shrink the dimension in each layer until you have an output layer with the same dimension as the total number of classes. Use a suitable loss function like categorical cross entropy to train your model. Use a softmax activation for your output layer, so you can interpret the output as the probability of each class.

The issue with even such a simple neural networks is that there is a huge number of choices to make, such as the number of layers, the dimension of each layer, the activation functions, the learning rate and number of epochs, the specific optimizer, whether to use batch normalization, etc. Starting simple, with a very small number of layers (even just a single hidden layer) and no fancy trick is probably the best way to quickly evaluate if it even makes sense to follow such path.

acalejos · April 24, 2024, 8:33pm

Hey,

I’m the author of the EXGBoost library.

Just wanted to chime in here and see if you needed any help still on this.

Let me know!

tegmentum · April 25, 2024, 8:51am

Hey Acalejos, awesome of you to reach out.

It would be great if you could give me some thoughts on the following (I’m assuming you’ve followed from the top

Given the numerical attributes like weight, volume, etc, and the item count, I’m normalizing all those values together, would you say it’s better I only normalize the numerical data, and keep the counts as they are?
I’ve also been advised to try out a Random Forrest or SVM given that I only have 1800 rows, would you know where I could find implementations of those algorithms ? I’ve seen a library called Evision , but it seems a little outdated.

Thanks!

acalejos · April 26, 2024, 4:43pm

The biggest thing I would worry about with any approach where you append a large dimension vector onto existing columns of data is just having too high dimensionality and then drowning out other features (saw it mentioned that it’d only be length 65 which should be ok, just something to be aware of). You could try both normalizing the count and not. It shouldn’t be too much trouble to experiment with that. Considering the count is ordinal already and depending on how you normalize it might not make any difference.
Im not a ware of any random forest libraries, but there is an SVM implementation in Scholar, it just doesnt appear to be in an official release yet. But you can see it here: scholar/lib/scholar/linear/svm.ex at main · elixir-nx/scholar · GitHub

I’m pretty sure Evision is a vision library, so I’m not sure how that would help you. It also seems perfectly up to date to me, although I haven’t used it.