In [1]:
import torch
from torch import nn

- references
    - [BCELoss — PyTorch 1.13 documentation](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)
    - [[pytorch 模型拓扑结构] 深入理解 nn.CrossEntropyLoss 计算过程（nn.NLLLoss(nn.LogSoftmax))](https://www.bilibili.com/video/BV1NY4y1E76o/)
    

## 1. BCELoss 计算过程

- inputs: 
    - 未经过 sigmoid 的 network 的输出（一个样本对应一维的输出）
- 计算过程 & output: 
    - step1：计算 sigmoid，将 1d 的 logits 转换为 p(class=1|x) 的概率
    - step2：计算 $\ell_i=-\left(y_i\log (\hat {y_i}) + (1-y_i)\log (1-\hat {y_i})\right)$
    - step3：计算均值 $\frac1n\sum_i\ell_i$

In [22]:
m = nn.Sigmoid()
loss = nn.BCELoss()
inputs = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
output = loss(m(inputs), target)

In [23]:
print(inputs, target)

tensor([-1.7723, -1.5965,  0.8863], requires_grad=True) tensor([0., 1., 0.])


In [24]:
# [0, 0, 1]
m(inputs)

tensor([0.1453, 0.1685, 0.7081], grad_fn=<SigmoidBackward0>)

$$
\hat y=\sigma(z)=\frac{1}{1+\exp(-z)}
$$

In [25]:
1/(1+torch.exp(-inputs))

tensor([0.1453, 0.1685, 0.7081], grad_fn=<MulBackward0>)

$$
\begin{split}
\ell_i&=-\left(y_i\log (\hat {y_i}) + (1-y_i)\log (1-\hat {y_i})\right)\\
&=-\left(y_i\log (\sigma(z_i)) + (1-y_i)\log (1-\sigma(z_i))\right)
\end{split}
$$

In [26]:
output

tensor(1.0565, grad_fn=<BinaryCrossEntropyBackward0>)

In [27]:
-(target * torch.log(m(inputs)) + (1-target)*torch.log(1-m(inputs)))

tensor([0.1570, 1.7810, 1.2314], grad_fn=<NegBackward0>)

$$\frac1n\sum_i\ell_i$$

In [28]:
torch.mean(-(target * torch.log(m(inputs)) + (1-target)*torch.log(1-m(inputs))))

tensor(1.0565, grad_fn=<MeanBackward0>)

### 1.1 backward

In [29]:
output.backward()

In [30]:
inputs.grad

tensor([ 0.0484, -0.2772,  0.2360])

$$
\begin{split}
\frac{\partial \ell_i}{\partial z_i}&=-\left(y_i\frac{\sigma(z_i)(1-\sigma(z_i))}{\sigma(z_i)}-(1-y_i)\frac{\sigma(z_i)(1-\sigma(z_i))}{1-\sigma(z_i)}\right)\\
&=-\left(y_i(1-\sigma(z_i) - (1-y_i)\sigma(z_i)\right)\\
&=-(y_i-\sigma(z_i))
\end{split}
$$

In [31]:
-(target - m(inputs))

tensor([ 0.1453, -0.8315,  0.7081], grad_fn=<NegBackward0>)

$$
\ell=\frac{1}3(\ell_1+\ell_2+\ell_3)\\
\frac{\partial \ell}{\partial z_i}=\frac{1}3\frac{\partial \ell_i}{\partial z_i}
$$

In [32]:
-(target - m(inputs))/3

tensor([ 0.0484, -0.2772,  0.2360], grad_fn=<DivBackward0>)

### 1.2 BCELoss vs. BCEWithLogitsLoss

- BCEWithLogitsLoss = sigmoid + BCELoss

In [33]:
loss2 = nn.BCEWithLogitsLoss()

In [35]:
# output = loss(m(inputs), target)
loss2(inputs, target)

tensor(1.0565, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

### 1.3 cross entropy loss


$$
H(p,q)=-\sum_x p(x)\log q(x)
$$

- 度量两个概率分布的距离
    - $(y_i, 1-y_i)$ vs. $(\hat{y_i}, 1-\hat{y_i})$

### 1.4 BCELoss vs. CrossEntropyLoss

- 二分类 vs. 多分类
    - 单输出，多输出
- 概率化过程：sigmoid vs. softmax
- 都用的是 cross entropy loss 来计算 loss