Why DNN Loss Landscapes aren’t Convex

Introduction

I was speaking to a friend recently about model complexity, when I remarked that model loss landscapes weren’t convex. The loss landscape has “tracks” that you can smoothly change your model’s parameters along without changing its loss:

There’s two reasons why:

Standard neural network architectures allows multiple sets of weights to implement the same “function.”Roughly speaking, imagine increasing a weight in a layer and then reducing the corresponding weights in the next layer.
Model training creates additional directions of travel where different implemented functions perform equally well on the dataset and have the same loss.

Example of case 1:

Let’s say we have two layers, denoted by (W1, b1) and (W2, b2).

Shapes:

$x$ (dim_in)
$w_1$ (dim_in, dim_l1)
$b_1$ (dim_l1)
$w_2$ (dim_l1, dim_l2)
$b_2$ (dim_l2)
$a$ (dim_l1)
$y$ (dim_l2)

$a = relu(x W_1 + b_1)$

$y = a W_2 + b_2$

Writing out the sums:

$a_j = relu(b_{1_j} + \sum_i x_i w_{1_{ij}})$

$y_k = b_{2_k} + \sum_j a_j w_{2_{jk}}$

For a given $c$ , Let’s add $\nu$ to $b_{1_c}$ .

$a_c = relu(b_{1_c} + \left(\sum_i x_i w_{1_{ic}}\right) + \nu)$

$y_k = b_{2_k} + \left(\sum_{j \neq c} a_j w_{2_{jk}} \right) + a_c w_{2_{ck}}$

If $relu(b_{1_j} + \sum_i x_i w_{1_{ij}}) = 0$ and $relu(b_{1_c} + \left(\sum_i x_i w_{1_{ic}}\right) + \nu) = 0$ :

There’s no impact on the function by changing $\nu$ , since the relu kills the change.

If $a_c$ increases by $\nu$ :

We can maintain $y_k$ by reducing $b_{2k}$ by $\nu$ or modifying $w_{2_{ck}}$ .

Things get a little more tricky when part of the $\nu$ increase is cut off by the relu boundary, but you can informally always reduce $b_{2k}$ by the same amount as $a_c$ increases to maintain the same function $y(x)$ .