Introduction

I was speaking to a friend recently about model complexity, when I remarked that model loss landscapes weren’t convex. The loss landscape has “tracks” that you can smoothly change your model’s parameters along without changing its loss:

There’s two reasons why:

  • Standard neural network architectures allows multiple sets of weights to implement the same “function.”Roughly speaking, imagine increasing a weight in a layer and then reducing the corresponding weights in the next layer.
  • Model training creates additional directions of travel where different implemented functions perform equally well on the dataset and have the same loss.

Example of case 1:

Let’s say we have two layers, denoted by (W1, b1) and (W2, b2).

Shapes:

  • x (dim_in)
  • w1 (dim_in, dim_l1)
  • b1 (dim_l1)
  • w2 (dim_l1, dim_l2)
  • b2 (dim_l2)
  • a (dim_l1)
  • y (dim_l2)

a=relu(xW1+b1)

y=aW2+b2

Writing out the sums:

aj=relu(b1j+ixiw1ij) yk=b2k+jajw2jk

For a given c, Let’s add ν to b1c.

ac=relu(b1c+(ixiw1ic)+ν) yk=b2k+(jcajw2jk)+acw2ck

If relu(b1j+ixiw1ij)=0 and relu(b1c+(ixiw1ic)+ν)=0:

  • There’s no impact on the function by changing ν, since the relu kills the change.

If ac increases by ν:

  • We can maintain yk by reducing b2k by ν or modifying w2ck.

Things get a little more tricky when part of the ν increase is cut off by the relu boundary, but you can informally always reduce b2k by the same amount as ac increases to maintain the same function y(x).