Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (Microsoft Research)
Hypothesis / Main methods:
When deeper networks are able to start converging, the degradation problem occurs: the accuracy gets saturated and degrades rapidly. It is easier to optimize / fine tune the residual effect of each layer on the previous layer. Thus in their architecture, each layer learns the residual to the input.
The output is element-wise addition of input
F(x) is the residual.