Sam Stites

Optimization stream-of-conciousness

In Stanford’s CS231n first module titled: “Optimization: Stochastic Gradient Descent,” they describe how loss for multiclass SVM without regularization is described as:

Li=jyi[max(0,wjTxiwyiTxi+1)]L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right]


For their three-dimensional example across (W). Breaking this down across each row of (W), we get:

L0=max(0,w1Tx0w0Tx0+1)+max(0,w2Tx0w0Tx0+1)L_0 = \max(0, w_1^Tx_0 - w_0^Tx_0 + 1) + \max(0, w_2^Tx_0 - w_0^Tx_0 + 1) L1=max(0,w0Tx1w1Tx1+1)+max(0,w2Tx1w1Tx1+1)L_1 = \max(0, w_0^Tx_1 - w_1^Tx_1 + 1) + \max(0, w_2^Tx_1 - w_1^Tx_1 + 1) L2=max(0,w0Tx2w2Tx2+1)+max(0,w1Tx2w2Tx2+1)L_2 = \max(0, w_0^Tx_2 - w_2^Tx_2 + 1) + \max(0, w_1^Tx_2 - w_2^Tx_2 + 1) L=(L0+L1+L2)/3L = (L_0 + L_1 + L_2) / 3


These can be visualized as the following:

So the thing that I don’t get is… Nope. It all makes sense. I thought they wrote that you could “reorganize” the errors to get a convex shape, however this is wrong – as the giant letters “sum” clearly state. This is a sum of average loss error across all dimensions.

More interesting is this: the graph shown is not differentiable! It is a step-wise function. Mathematically, this isn’t sound, but subgradients still exist.

Other notes from “Optimization: Stochastic Gradient Descent” and “Backprop, Intuitions”: