Sam Stites

Optimization stream-of-conciousness

In Stanford’s CS231n first module titled: “Optimization: Stochastic Gradient Descent,” they describe how loss for multiclass SVM without regularization is described as:

$$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right]$$

For their three-dimensional example across (W). Breaking this down across each row of (W), we get:

$$L_0 = \max(0, w_1^Tx_0 - w_0^Tx_0 + 1) + \max(0, w_2^Tx_0 - w_0^Tx_0 + 1)$$ $$L_1 = \max(0, w_0^Tx_1 - w_1^Tx_1 + 1) + \max(0, w_2^Tx_1 - w_1^Tx_1 + 1)$$ $$L_2 = \max(0, w_0^Tx_2 - w_2^Tx_2 + 1) + \max(0, w_1^Tx_2 - w_2^Tx_2 + 1)$$ $$L = (L_0 + L_1 + L_2) / 3 $$

These can be visualized as the following:

So the thing that I don’t get is… Nope. It all makes sense. I thought they wrote that you could “reorganize” the errors to get a convex shape, however this is wrong – as the giant letters “sum” clearly state. This is a sum of average loss error across all dimensions.

More interesting is this: the graph shown is not differentiable! It is a step-wise function. Mathematically, this isn’t sound, but subgradients still exist.

Other notes from “Optimization: Stochastic Gradient Descent” and “Backprop, Intuitions”: