How to apply a L1/L2 penalty to layer's output in Axon?

Hey all :wave:
I’m reading this book on machine learning. I’m trying to rewrite all the numpy/Keras examples with Nx/Axon and for the first time there is one that I cannot easily reproduce.

There is this model written in Keras where the first 2 Dense layers have a penalty applied to the layer’s output via the activity_regularizer kw argument.

Quoting the Keras docs:

Regularizers allow you to apply penalties on layer parameters or layer activity during optimization. These penalties are summed into the loss function that the network optimizes.

  • activity_regularizer: Regularizer to apply a penalty on the layer’s output

And here the Keras model:

from keras.regularizers import l1

model = Sequential()
model.add(Dense(100, activation='sigmoid', activity_regularizer=l1(0.0004)))
model.add(Dense(30, activation='sigmoid', activity_regularizer=l1(0.0004)))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(lr=0.001),
              metrics=['accuracy'])

I initially thought to use a custom loss function to do that, but I don’t think it is the right way because it is a configuration of the whole training and not of the single layer.

Then, I thought to achieve that with a custom layer, something like that:

model =
  Axon.input("data")
  |> CustomLayers.dense_with_regularization(100)
  |> Axon.sigmoid()
  |> CustomLayers.dense_with_regularization(30)
  |> Axon.sigmoid()
  |> Axon.dense(2, activation: :softmax)

But I’m a bit lost here :confused: The input received from CustomLayers.dense_with_regularization is of course a tensor while to compute the penalties I would need y_true, y_pred and the weights. I’m probably overlooking something simple :see_no_evil:

To conclude, I found a reference to L2 penalty function in Axon, but it is just a mention in the docs.

Any suggestions is really appreciated, thanks :man_bowing:

Cheers :v:


Some more resources about regularizers in Keras:

Hey all :wave:
I’m still trying to understand what’s the best way to implement it, but no luck so far.

I just found out that in a previous version of Axon was possible to specify bias/kernel regularizers as option for a Dense layers.

But later on, this possibility has been removed in this commit (related GH issue).

Regularization should be a property of the optimization API and should not be connected with the model creation API.

I looked at the Axon.Optimizers and Axon.Updates APIs but unfortunately I couldn’t really understand how to compose my own optimizer to build something similar to this model in Keras :see_no_evil:

model = Sequential()
model.add(Dense(100, activation='sigmoid', activity_regularizer=l1(0.0004)))
model.add(Dense(30, activation='sigmoid', activity_regularizer=l1(0.0004)))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(lr=0.001),
              metrics=['accuracy'])

Any idea?

@seanmor5 sorry for tagging you directly, but I think you can point me to the right direction :grimacing: :pray:

Thank you in advance :bowing_man:

Take a look at these Stack Overflow discussions about this topic, machine learning - How to add a L1 or L2 regularization to weights in pytorch - Stack Overflow. python - L1/L2 regularization in PyTorch - Stack Overflow

The key takeaways I have from this discussion is L2 regularization on an individual layer isn’t necessary to train an accurate model. The optimizer can handle the regularization. That is really the optimizer’s job to keep the model training under control. I would suggest consider training a model with different optimizers and hyperparameter choices versus trying to add a regularizer on an individual layer. There is probably a reason the concept isn’t in PyTorch.

1 Like

I removed the regularization API for exactly the reasons @meanderingstream mentioned:

  1. It’s not in PyTorch, and it didn’t seem very commonly used in TensorFlow
  2. Regularization is a concern of training/optimization and not the model, so I wanted to keep those decoupled

If you really want to replicate this, you might be able to do something like this, first create a custom layer with a stateful output. In training mode, calls to Axon’s predict method will return a map %{prediction: pred, state: state}, so this will forward the output activations so you can use them in your objective function:

def activity_regularizer(input) do
  Axon.layer(&activity_regularizer_impl/2, [input])
end

deftransformp activity_regularizer_impl(input, opts \\ []) do
  opts = keyword!(opts, mode: :train)
  case opts[:mode] do
    :train ->
      %Axon.StatefulOutput{output: input, state: %{"activations" => input}}

    :inference ->
      # activity regularizer not present during training
      input
  end
end

And then wrap any layer you want to regularize in your custom layer like this:

model =
  Axon.input("data")
  |> Axon.dense(100)
  |> Axon.sigmoid()
  |> activity_regularizer()
  |> Axon.dense(30)
  |> Axon.sigmoid()
  |> activity_regularizer()
  |> Axon.dense(2, activation: :softmax)

Now you can create a custom objective function:

  defn objective(predict_fn, model_state, inputs, y_true) do
    %{prediction: y_pred, state: state} = predict_fn.(model_state, inputs)
    # traverse state for maps with keys "activation"
    penalty = get_activation_penalty(state)
    # get loss and apply penalty to loss
    loss(y_true, y_pred) + penalty
  end

  deftransformp get_activation_penalty(state) do
    # state is a nested map, we can reduce over it
    Enum.reduce(state, Nx.tensor(0.0), fn
      # apply l1 penalty to found activations in state map, you'll need to match on specific
      # layers if you want to apply this penalty differently per-layer
      {_key, %{"activation" => activation}}, acc -> l1_penalty(activation) |> Nx.add(acc)
      _other, acc -> acc
    end)
  end

Then you just need to write the optimization step and integrate that with a custom training loop. I know this probably isn’t very straightforward, given you can just call activity_regularizer in Keras and it just works; however, I think adding a similar API just brings too much coupling between model creation/execution and model training. You are better off using weight decay that’s available in the optimization API to achieve the same regularization affect.

1 Like

Hey @meanderingstream and @seanmor5 :wave:

Thank you so much for you quick and kind replies, really appreciated :bowing_man:

I would suggest consider training a model with different optimizers and hyperparameter choices versus trying to add a regularizer on an individual layer.

Yup, I wanted to replicate an example in Keras, but as you suggested, better to try with other optimizers :+1:

I think adding a similar API just brings too much coupling between model creation/execution and model training. You are better off using weight decay that’s available in the optimization API to achieve the same regularization affect.

I see, makes total sense, thanks for sharing the reason :+1:

I’ll try nevertheless to follow your suggestions and see if I can implement it, it is a good occasion to look into Axon internals :nerd_face:

Thank you everyone, I wish you a great day! :blush: