To demonstrate a simple neural network with 2 input nodes, 2 hidden layer nodes, and 1 output node, we will go through the process of forward and backward propagation using ReLU activation, and calibration using the inputs (2, 3) and the target output 4. We will illustrate how the network learns by adjusting weights through gradient descent over two iterations.
Network Structure
- Input Layer: 2 nodes
- Hidden Layer: 2 nodes with ReLU activation
- Output Layer: 1 node (linear activation)
Notations
- Inputs: x1,x2
- Weights between Input and Hidden Layer: w11,w12,w21,w22
- Weights between Hidden and Output Layer: w1,w2
- Biases for Hidden Layer: b1,b2
- Bias for Output Layer: bo
- Learning Rate: η
Initial Parameters
Let's initialize the weights and biases arbitrarily (small random values for simplicity):
- w11=0.1,w12=0.2,w21=0.3,w22=0.4
- w1=0.5,w2=0.6
- b1=0.1,b2=0.1,bo=0.1
- η=0.01 (learning rate)
Forward Propagation
Hidden Layer Calculation:
z1=w11x1+w21x2+b1
z2=w12x1+w22x2+b2Apply ReLU activation:
a1=max(0,z1)
a2=max(0,z2)Output Layer Calculation:
y^=w1a1+w2a2+bo
Loss Calculation
Using Mean Squared Error (MSE) as the loss function:
L=21(y^−y)2Where y is the target output.
Backward Propagation
Output Layer Gradients:
∂y^∂L=y^−yFor weights and bias:
∂w1∂L=∂y^∂L⋅a1
∂w2∂L=∂y^∂L⋅a2
∂bo∂L=∂y^∂LHidden Layer Gradients:
∂a1∂L=∂y^∂L⋅w1
∂a2∂L=∂y^∂L⋅w2ReLU derivative:
∂z1∂L=∂a1∂L⋅(1z1>0)
∂z2∂L=∂a2∂L⋅(1z2>0)For weights and biases:
∂w11∂L=∂z1∂L⋅x1
∂w21∂L=∂z1∂L⋅x2
∂w12∂L=∂z2∂L⋅x1
∂w22∂L=∂z2∂L⋅x2
∂b1∂L=∂z1∂L
∂b2∂L=∂z2∂L
Weight and Bias Update
wij=wij−η∂wij∂L
bi=bi−η∂bi∂L
wi=wi−η∂wi∂L
bo=bo−η∂bo∂LIteration 1 Calculations
Let's perform the calculations for the first iteration.
Forward Propagation
Hidden Layer Calculation:
z1=(0.1⋅2)+(0.3⋅3)+0.1=1.2
z2=(0.2⋅2)+(0.4⋅3)+0.1=2.1
a1=max(0,1.2)=1.2
a2=max(0,2.1)=2.1Output Layer Calculation:
y^=(0.5⋅1.2)+(0.6⋅2.1)+0.1=1.92
Loss Calculation
L=21(1.92−4)2=4.3264Backward Propagation
Output Layer Gradients:
∂y^∂L=1.92−4=−2.08
∂w1∂L=−2.08⋅1.2=−2.496
∂w2∂L=−2.08⋅2.1=−4.368
∂bo∂L=−2.08Hidden Layer Gradients:
∂a1∂L=−2.08⋅0.5=−1.04
∂a2∂L=−2.08⋅0.6=−1.248
∂z1∂L=−1.04⋅1=−1.04
∂z2∂L=−1.248⋅1=−1.248
∂w11∂L=−1.04⋅2=−2.08
∂w21∂L=−1.04⋅3=−3.12
∂w12∂L=−1.248⋅2=−2.496
∂w22∂L=−1.248⋅3=−3.744
∂b1∂L=−1.04
∂b2∂L=−1.248
Weight and Bias Update
w11=0.1−0.01×(−2.08)=0.1208
w21=0.3−0.01×(−3.12)=0.3312
w12=0.2−0.01×(−2.496)=0.22496
w22=0.4−0.01×(−3.744)=0.43744
w1=0.5−0.01×(−2.496)=0.52496
w2=0.6−0.01×(−4.368)=0.64368
b1=0.1−0.01×(−1.04)=0.1104
b2=0.1−0.01×(−1.248)=0.11248
bo=0.1−0.01×(−2.08)=0.1208Iteration 2 Calculations
Let's perform the calculations for the second iteration with the updated weights and biases.
Forward Propagation
Hidden Layer Calculation:
z1=(0.1208⋅2)+(0.3312⋅3)+0.1104=1.4456
z2=(0.22496⋅2)+(0.43744⋅3)+0.11248=2.3192
a1=max(0,1.4456)=1.4456
a2=max(0,2.3192)=2.3192Output Layer Calculation:
y^=(0.52496⋅1.4456)+(0.64368⋅2.3192)+0.1208=2.492704
Loss Calculation
L=21(2.492704−4)2=1.136195Backward Propagation
Output Layer Gradients:
∂y^∂L=2.492704−4=−1.507296
∂w1∂L=−1.507296⋅1.4456=−2.178768
∂w2∂L=−1.507296⋅2.3192=−3.496984
∂bo∂L=−1.507296Hidden Layer Gradients:
∂a1∂L=−1.507296⋅0.52496=−0.791429
∂a2∂L=−1.507296⋅0.64368=−0.970056
∂z1∂L=−0.791429⋅1=−0.791429
∂z2∂L=−0.970056⋅1=−0.970056
∂w11∂L=−0.791429⋅2=−1.582858
∂w21∂L=−0.791429⋅3=−2.374287
∂w12∂L=−0.970056⋅2=−1.940112
∂w22∂L=−0.970056⋅3=−2.910168
∂b1∂L=−0.791429
∂b2∂L=−0.970056
Weight and Bias Update
w11=0.1208−0.01×(−1.582858)=0.136628
w21=0.3312−0.01×(−2.374287)=0.354943
w12=0.22496−0.01×(−1.940112)=0.24436112
w22=0.43744−0.01×(−2.910168)=0.46654168
w1=0.52496−0.01×(−2.178768)=0.54674768
w2=0.64368−0.01×(−3.496984)=0.67864984
b1=0.1104−0.01×(−0.791429)=0.11831429
b2=0.11248−0.01×(−0.970056)=0.12218056
bo=0.1208−0.01×(−1.507296)=0.13587296Conclusion
After two iterations, the output y^ has moved closer to the target output (4) from 1.92 to 2.492704. The loss has decreased from 4.3264 to 1.136195, indicating that the network is learning and minimizing the error. The weight and bias updates demonstrate the network's adjustment process based on the gradients computed during backward propagation. Further iterations would continue this process, reducing the loss and improving the accuracy of the output.