Help me convert this TabTransformer model to Nx

Hey everyone, I have a code in python that creates and train a TabTransformer model that I, then, extract its embeddings so I can use it to generate embedding vectors for tabular data that I have.

Here is the code:

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define the TabTransformer model
class TabTransformer(nn.Module):
    def __init__(self, num_features, num_classes, dim_embedding=64, num_heads=4, num_layers=4):
        super(TabTransformer, self).__init__()
        self.embedding = nn.Linear(num_features, dim_embedding)
        encoder_layer = nn.TransformerEncoderLayer(d_model=dim_embedding, nhead=num_heads, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.classifier = nn.Linear(dim_embedding, num_classes)
    def forward(self, x):
        x = self.embedding(x)
        x = x.unsqueeze(1)  # Adding a sequence length dimension
        x = self.transformer(x)
        x = torch.mean(x, dim=1) # Pooling
        x = self.classifier(x)
        return x

data = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "product_category": [1, 2, 1, 3, 2],
    "ammount": [100, 200, 150, 300, 250]
})

X = data.drop("customer_id", axis = 1)
y = data["customer_id"] - 1

# Splitting the dataset into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.0, random_state=42)
X_train = X
X_test = X
y_train = y
y_test = y

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model parameters
num_features = X_train_scaled.shape[1]
num_classes = 5  # Adjusted based on unique customer ids

# Initialize the model, loss, and optimizer
model = TabTransformer(num_features, num_classes).to(torch.device('cpu'))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Converting data to tensors
X_train_tensor = torch.FloatTensor(X_train_scaled)
y_train_tensor = torch.LongTensor(y_train.values)

# Training loop
for epoch in range(100):
    optimizer.zero_grad()
    output = model(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Evaluation
model.eval()
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_test_tensor = torch.LongTensor(y_test.values)

with torch.no_grad():
    predictions = model(X_test_tensor)
    _, predicted_classes = torch.max(predictions, 1)
    accuracy = (predicted_classes == y_test_tensor).float().mean()
    print(f'Test Accuracy: {accuracy.item()}')

# Get embedding from test data
embedding = model.embedding

embedding(X_test_tensor)

This works great, but now I want to convert this code to Elixir so I can use it in my backend to serve embeddings without needing to connect to some python code/service.

I’m very new to the ML world, so I’m sincerely not sure how I should start with doing that, especially the part that actually creates the model itself.

Update

So, I started trying to convert it, so far I was able to generate the inputs and the scaler, but I’m totally stuck in the model creation, I have no idea what are the equivalent functions in Axon for the ones used in the TabTransformer class.

Mix.install([
  {:kino_explorer, "~> 0.1.20"},
  {:axon, "~> 0.7"},
  {:scholar, github: "elixir-nx/scholar"},
  {:table_rex, "~> 4.0", override: true}
])

alias Explorer.{Series, DataFrame}
alias Scholar.Preprocessing.StandardScaler

require DataFrame
require Series

data = DataFrame.new(
  customer_id: [1, 2, 3, 4, 5],
  product_category: [1, 2, 1, 3, 2],
  amount: [100, 200, 150, 300, 250]
)

x = DataFrame.discard(data, :customer_id)
y = 
  data
  |> DataFrame.select(:customer_id)
  |> DataFrame.mutate(customer_id: customer_id - 1)

x_train = Nx.stack(x, axis: 1)
y_train = Nx.stack(y, axis: 1)
x_test = Nx.stack(x, axis: 1)
y_test = Nx.stack(y, axis: 1)

scaler = StandardScaler.fit(x_train, axes: [0])

x_train_scaled = StandardScaler.transform(scaler, x_train)
x_test_scaled = StandardScaler.transform(scaler, x_test)

{_, num_features} = Nx.shape(x_train_scaled)

# Adjusted based on unique customer ids
num_classes = 5

So, I found this nodebook in Google Colab that has a Keras implementation of TabTransform that is a little more low level and easier to understand and convert to Axon IMO.

I tried my hand on it and I think I was able to convert the first model, the baseline one which don’t contain the transform part, from the notebook.

Here is my implementation of it:

Mix.install([
  {:kino_explorer, "~> 0.1.20"},
  {:axon, "~> 0.7"},
  {:scholar, github: "elixir-nx/scholar"},
  {:table_rex, "~> 4.0", override: true}
])

alias Explorer.{Series, DataFrame}
alias Scholar.Preprocessing.StandardScaler

require DataFrame
require Series

categorical_features_vocabulary = %{
  work_class: [
    " ?",
    " Federal-gov",
    " Local-gov",
    " Never-worked",
    " Private",
    " Self-emp-inc",
    " Self-emp-not-inc",
    " State-gov",
    " Without-pay"
  ],
  education: [
    " 10th",
    " 11th",
    " 12th",
    " 1st-4th",
    " 5th-6th",
    " 7th-8th",
    " 9th",
    " Assoc-acdm",
    " Assoc-voc",
    " Bachelors",
    " Doctorate",
    " HS-grad",
    " Masters",
    " Preschool",
    " Prof-school",
    " Some-college"
  ],
  marital_status: [
    " Divorced",
    " Married-AF-spouse",
    " Married-civ-spouse",
    " Married-spouse-absent",
    " Never-married",
    " Separated",
    " Widowed"
  ],
  occupation: [
    " ?",
    " Adm-clerical",
    " Armed-Forces",
    " Craft-repair",
    " Exec-managerial",
    " Farming-fishing",
    " Handlers-cleaners",
    " Machine-op-inspct",
    " Other-service",
    " Priv-house-serv",
    " Prof-specialty",
    " Protective-serv",
    " Sales",
    " Tech-support",
    " Transport-moving"
  ],
  relationship: [
    " Husband",
    " Not-in-family",
    " Other-relative",
    " Own-child",
    " Unmarried",
    " Wife"
  ],
  race: [
    " Amer-Indian-Eskimo",
    " Asian-Pac-Islander",
    " Black",
    " Other",
    " White"
  ],
  gender: [" Female", " Male"],
  native_country: [
    " ?",
    " Cambodia",
    " Canada",
    " China",
    " Columbia",
    " Cuba",
    " Dominican-Republic",
    " Ecuador",
    " El-Salvador",
    " England",
    " France",
    " Germany",
    " Greece",
    " Guatemala",
    " Haiti",
    " Holand-Netherlands",
    " Honduras",
    " Hong",
    " Hungary",
    " India",
    " Iran",
    " Ireland",
    " Italy",
    " Jamaica",
    " Japan",
    " Laos",
    " Mexico",
    " Nicaragua",
    " Outlying-US(Guam-USVI-etc)",
    " Peru",
    " Philippines",
    " Poland",
    " Portugal",
    " Puerto-Rico",
    " Scotland",
    " South",
    " Taiwan",
    " Thailand",
    " Trinadad&Tobago",
    " United-States",
    " Vietnam",
    " Yugoslavia"
  ]
}

categorical_feature_names = Map.keys(categorical_features_vocabulary)

# Embedding dimensions of the categorical features
embedding_dimensions = 16

# Number of MLP blocks in the baseline model
mlp_blocks = 2

mlp_hidden_units_factors = [2, 1]

dropout_rate = 0.2

categorical_inputs_names = [
  :work_class,
  :education,
  :marital_status,
  :occupation,
  :relationship,
  :race,
  :gender,
  :native_country
]

numeric_inputs_names = [
  :age,
  :education_number,
  :capital_gain,
  :capital_loss,
  :hours_per_week
]

encode_categorical_inputs = 
  fn inputs_names, embedding_dimensions -> 
    Enum.map(inputs_names, fn input_name ->
      vocabulary_size = 
        categorical_features_vocabulary
        |> Map.fetch!(input_name)
        |> Enum.count()
      
      input_name
      |> Atom.to_string()
      |> Axon.input(shape: {nil})
      |> Axon.embedding(vocabulary_size, embedding_dimensions, name: "embedding_#{input_name}")
    end)
  end

encode_numeric_inputs = fn input_names ->
  Enum.map(input_names, fn input_name ->
    input_name
    |> Atom.to_string()
    |> Axon.input(shape: {nil})
    |> Axon.reshape({:auto, 1})
  end)
end

encoded_categorical_features = 
  encode_categorical_inputs.(categorical_inputs_names, embedding_dimensions)

encoded_numeric_features = encode_numeric_inputs.(numeric_inputs_names)

features = 
  encoded_categorical_features
  |> Kernel.++(encoded_numeric_features)
  |> Axon.concatenate()

Axon.Display.as_graph(features, Nx.template({1}, :u32), direction: :left_right)

{_, feed_forward_units} = Axon.get_output_shape(features, Nx.template({1}, :u32)).shape

create_mlp = fn hidden_units, dropout_rate, activation, normalization_layer, name ->
  block = fn x ->
    Enum.reduce(hidden_units, x, fn units, x ->
      x
      |> normalization_layer.()
      |> Axon.dense(units, activation: activation)
      |> Axon.dropout(rate: dropout_rate)
    end)
  end
  
  Axon.block(block, name: name)
end

features = 
  Enum.reduce(1..mlp_blocks, features, fn index, features ->
    mlp = 
      create_mlp.([feed_forward_units], dropout_rate, :gelu, &Axon.layer_norm/1, "feed_forward_#{index - 1}")
  
    mlp.(features)
  end)

Axon.Display.as_graph(features, Nx.template({1}, :u32), direction: :left_right)

mlp_hidden_units = Enum.map(mlp_hidden_units_factors, & &1 * feed_forward_units)

mlp =
  create_mlp.(mlp_hidden_units, dropout_rate, :selu, &Axon.batch_norm/1, "MLP")
  
features = mlp.(features)

Axon.Display.as_graph(features, Nx.template({1}, :u32), direction: :left_right)

model = Axon.dense(features, 1, activation: :sigmoid, name: "sigmoid")

Axon.Display.as_graph(model, Nx.template({1}, :u32), direction: :left_right)

{init_fn, predict_fn} = Axon.build(model)

params = init_fn.(Nx.template({1}, :u32), %{})

inputs = %{
  "hours_per_week" => Nx.tensor([1]), 
  "capital_loss" => Nx.tensor([1]), 
  "capital_gain" => Nx.tensor([1]), 
  "education_number" => Nx.tensor([1]), 
  "age" => Nx.tensor([1]), 
  "native_country" => Nx.tensor([1]), 
  "gender" => Nx.tensor([1]), 
  "race" => Nx.tensor([1]), 
  "relationship" => Nx.tensor([1]), 
  "occupation" => Nx.tensor([1]), 
  "marital_status" => Nx.tensor([1]), 
  "education" => Nx.tensor([1]), 
  "work_class" => Nx.tensor([1])
}

predict_fn.(params, inputs)

For now I’m ignoring the part that loads the data and normalizes it.

Now, here are some questions that I got doing that code:

  1. How can I be sure that the model I implemented is equivalent to the one in the Colab notebook? I tried running a predict on it with the same input, but it generate a different result every time the model is built, probably it is using some random number for some part of it (maybe the embeddings?)
    I did generate a graph of both of them and they do seem to be equal, but I’m not sure…

  2. The main difference between the baseline model I implemented and the TabTransform in the Colab notebook is the addition of a Multi Head Attention transformer.
    In the Colab they used keras.layers.MultiHeadAttention to create it:

attention_output = layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embedding_dims,
            dropout=dropout_rate,
            name=f"multihead_attention_{block_idx}",
        )(encoded_categorical_features, encoded_categorical_features)

The issue is that there is no equivalent function in Axon to generate the same layer.

I did find a possible implementation in the Bumblebee project here: bumblebee/lib/bumblebee/layers/transformer.ex at main · elixir-nx/bumblebee · GitHub

But, tbh, I’m not exactly sure how to use it and inject it inside my model, their inputs doesn’t seem to translate exactly to the keras ones. Any help here would be greatly appreciated!

Update:

I think I figure out how to use the bumblebee transformer to have the same effect as the keras MultiHeadAttention:

Bumblebee.Layers.Transformer.multi_head_attention(
  encoded_categorical_features,
  encoded_categorical_features,
  encoded_categorical_features,
  num_heads: num_heads,
  hidden_size: embedding_dims, ?
  dropout_rate: dropout_rate,
  name: "multi_head_attention_#{block_index}"
)

I didn’t test it yet, but from the documentation this should be the same. Now, I’m not sure if that function directly gives me the same from the keras one or if I should use Bumblebee.Layers.Transformer.block instead.