Multivariable linear regression models and neural network models share several similarities, particularly when it comes to their structure and underlying principles. Here are some key similarities:
Multivariable Linear Regression
A multivariable linear regression model is given by:
where yis the dependent variable, x1,x2,…,xn are the independent variables, and w0,w1,…,wn are the coefficients (weights) of the model.
Neural Network Model (Single Layer)
A simple neural network with one hidden layer (single-layer perceptron) can be represented as:
where y is the output, x1,x2,…,xn are the inputs, w0,w1,…,wn are the weights, and σ\sigma is an activation function.
=-=-=-=-=-=-=-=-
In multivariable linear regression, the model is represented as:
The goal is to find the weights w0,w1,…,wnw_0, w_1, \ldots, w_n that minimize the cost function, typically the mean squared error (MSE):
where:
- mm is the number of training examples.
- hw(x(i))h_w(x^{(i)}) is the hypothesis function, hw(x)=w0+w1x1+⋯+wnxnh_w(x) = w_0 + w_1x_1 + \cdots + w_nx_n.
- y(i)y^{(i)} is the actual output for the ii-th training example.
Gradient Descent Algorithm
Gradient descent iteratively updates the weights to minimize the cost function. The update rule for each weight wjw_j is:
wj:=wj−α∂J(w)∂wjw_j := w_j – \alpha \frac{\partial J(w)}{\partial w_j}
where:
- α\alpha is the learning rate, a small positive number that controls the step size of each update.
- ∂J(w)∂wj\frac{\partial J(w)}{\partial w_j} is the partial derivative of the cost function with respect to wjw_j.
The partial derivative of the cost function with respect to wjw_j is:
∂J(w)∂wj=1m∑i=1m(hw(x(i))−y(i))xj(i)\frac{\partial J(w)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (h_w(x^{(i)}) – y^{(i)}) x_j^{(i)}
Gradient Descent Steps
- Initialize Weights: Start with initial guesses for the weights, typically zero or small random values.
- Compute Predictions: Compute the predicted values hw(x)h_w(x) for all training examples.
- Compute Cost: Calculate the cost function J(w)J(w).
- Compute Gradients: Calculate the partial derivatives of the cost function with respect to each weight.
- Update Weights: Update the weights using the gradient descent update rule.
- Repeat: Repeat steps 2-5 until the cost function converges (i.e., changes very little between iterations) or a specified number of iterations is reached.
=-=-=-=-=-
In this example:
X
is the design matrix with a column of ones for the intercept term and other columns for the features.y
is the vector of target values.w
is the vector of weights.alpha
is the learning rate.num_iterations
is the number of iterations for gradient descent.
The gradient_descent
function iteratively updates the weights to minimize the cost function and returns the optimized weights and the history of the cost function values.
import numpy as np
# Function to compute the cost
#X: The matrix of input features (including a column of ones for the intercept).
#y: The vector of target values.
#w: The vector of weights (parameters).
#The cost function in linear regression, specifically the mean squared error (MSE), is often presented with a factor of 12\frac{1}{2}21 in the formula for
#convenience in mathematical derivations, particularly when applying gradient descent. A factor of 2 appears in the derivative, which is generally
#unnecessary and can be avoided by incorporating the 1221 factor into the cost function. Using 1/2m in the cost function instead of 1/m does not
#change the optimization problem but simplifies the mathematical expressions involved in gradient descent.
def compute_cost(X, y, w):
m = len(y)
h = X.dot(w) #This line computes the predicted values h by performing the dot product of the input matrix X and the weight vector w.
cost = (1/(2*m)) * np.sum((h – y)**2) #This line calculates the cost function, which is the mean squared error of the predictions.
return cost
# Function to perform gradient descent
def gradient_descent(X, y, w, alpha, num_iterations):
m = len(y)
cost_history = np.zeros(num_iterations)
for i in range(num_iterations):
h = X.dot(w)
gradients = (1/m) * X.T.dot(h – y)
w = w – alpha * gradients #This line updates weights by subtracting product of learning rate alpha and gradients from current weights.
cost_history[i] = compute_cost(X, y, w)
return w, cost_history
# Example data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]])
y = np.array([1, 2, 3, 4, 5])
# Initialize weights
w = np.zeros(X.shape[1])
# Set hyperparameters
alpha = 0.01
num_iterations = 1000
# Perform gradient descent
w, cost_history = gradient_descent(X, y, w, alpha, num_iterations)
print(“Weights:”, w)
print(“Cost history:”, cost_history)