机器学习入门算法之梯度下降法(Gradient Descent)
一、概述
- 不是一个机器学习算法
- 是一种基于搜索的最优化算法
- 作用:最小化一个损失函数
- 梯度上升法:最大化一个效用函数
在函数曲线中,一点的导数代表自变量单位变化时,$损失函数J$响应的变化量,即为$\frac{dJ}{d\theta}$,另一方面,导数可以代表$损失函数J$增大的方向,通过梯度下降找到$损失函数J$的最小值,梯度的步长可表示为:$- \eta \frac{dJ}{d \theta} $.
$J = \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$
- $\eta$ 称为学习率(learning rate)
- $\eta$ 的取值影响获得最优解的速度
- $\eta$ 取值不合适,甚至得不到最优解
- $\eta$ 是梯度下降法的一个超参数
并不是所有函数都有唯一的极值点
解决方案:
- 多次运行,随机化初始点
- 梯度下降法的初始点也是一个超参数
在线性回归中使用梯度下降法,线性回归法的损失函数具有唯一的最优解。
二、用二次曲线模拟梯度下降法
import numpy as np
import matplotlib.pyplot as plt
#二次曲线y = (x-2.5)^2 -1
plot_x = np.linspace(-1, 6, 141)
plot_y = (plot_x - 2.5) ** 2 - 1
#求导函数值的函数
def dJ(theta):
return 2*(theta-2.5)
#求函数值的函数
def J(theta):
try:
return (theta -2.5)**2-1.
except:
return float('inf')
#梯度下降,把每一次的值记录下来以便绘图,当两次结果相差少于epsilon时,认为两个数值相等,计算机处理浮点数有精度误差,所以不要直接相等。当损失函数越来越大的时候,损失函数J会超出最大值inf,所以上面捕捉了异常,纺织报错,下面也限制了最大循环次数防止死循环。
def gradient_descent(initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
theta_history.append(initial_theta)
i_iter = 0
while i_iter < n_iters:
gradient = dJ(theta)
last_theta = theta
theta = theta - eta * gradient
theta_history.append(theta)
if abs(J(theta) - J(last_theta)) < epsilon:
break
i_iter += 1
def plot_theta_history():
plt.plot(plot_x, J(plot_x))
plt.plot(np.array(theta_history), J(np.array(theta_history)), color="r", marker="+")
plt.show()
#用例1
eta = 0.01
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
#用例2
eta = 0.9
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
#用例3
eta = 1.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
#用例4
eta = 1.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()
三、多元线性回归中的梯度下降法
梯度代表方向,对应$J$增大最快的方向:$- \eta \nabla J$,$\nabla J = (\frac{\partial J}{\partial \theta_0},\frac{\partial J}{\partial \theta_1},...,\frac{\partial J}{\partial \theta_n})$
(一)分析
目标:使 $\displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ 尽可能小,($\hat y^{(i)} = \theta_0 + \theta_1X^{(i)}_1 + \theta_2X^{(i)}_2 +...+ \theta_nX^{(i)}_n$)
代入得:$\displaystyle\sum_{i=1}^m (y^{(i) }- \theta_0 - \theta_1X^{(i)}_1 - \theta_2X^{(i)}_2 -...- \theta_nX^{(i)}_n)$
$$ \nabla J(\theta) = \left(\begin{matrix} \partial J / \partial \theta_0 \\ \partial J / \partial \theta_1 \\ \partial J / \partial \theta_2 \\ \cdots \\ \partial J / \partial \theta_n \end{matrix} \right) = \left(\begin{matrix} \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-1) \\ \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-X_1^{(i)}) \\ \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-X_2^{(i)}) \\ \cdots \\ \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-X_n^{(i)}) \\ \end{matrix} \right) = 2 \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) $$
其中$X_b = (1 , X_1, X_2, ..., X_n)^T$
解出来的梯度中,每个元素的值应该与m值无关,所以这里除以一个m。得出,
目标:使 $\frac{1}{m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ (MSE)尽可能小。
(有时取:使 $\frac{1}{2m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ ,不过差别不大)
$$ J(\theta) = MSE(y, \hat y) \qquad \nabla J(\theta) = \left(\begin{matrix} \partial J / \partial \theta_0 \\ \partial J / \partial \theta_1 \\ \partial J / \partial \theta_2 \\ \cdots \\ \partial J / \partial \theta_n \end{matrix} \right) = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) $$
(二)编码
在文件LinearRegression.py
中添加梯度下降法训练的函数。添加的代码如下:
# 梯度下降法训练
def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
"""根据训练集X_train,y_train,使用梯度下降法训练出Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_trian must be equal to the size of y_train."
def J(theta, X_b, y):
try:
return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
except:
return float('inf')
def dJ(theta, X_b, y):
res = np.empty(len(theta))
res[0] = np.sum(X_b.dot(theta)-y)
for i in range(1, len(theta)):
res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
return res * 2 / len(X_b)
def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
theta = initial_theta
cur_iter = 0
while cur_iter < n_iters:
gradient = dJ(theta, X_b, y)
last_theta = theta
theta = theta - eta * gradient
if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
break
cur_iter += 1
return theta
X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
initial_theta = np.zeros(X_b.shape[1])
self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)
self.interception_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
(三)算法测试
1.准备数据
import numpy as np
import matplotlib.pyplot as plt
#模拟一个接近y = x * 3 + 4的函数
np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3 + 4 + np.random.normal(size=100)
X = x.reshape(-1, 1)
plt.scatter(x, y)
plt.show()
2.模型测试
from playML.LinearRegression import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit_gd(X, y)
#系数
lin_reg.coef_
#截距
lin_reg.intercept_
四、算法优化——向量化
(一)分析
$$ = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)})\cdot X_0^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) \qquad X_0^{(i)} === 1 $$
$$ \downarrow \\ { \frac{2}{m} \cdot (X_b^{(1)}\theta - y^{(1)}, X_b^{(2)}\theta - y^{(2)}, X_b^{(3)}\theta - y^{(3)} ,..., X_b^{(m)}\theta - y^{(m)}) } \\ \left(\begin{matrix} X_0^{(1)} & X_1^{(1)} & X_2^{(1)} &\cdots& X_n^{(1)} \\ X_0^{(2)} & X_1^{(2)} & X_2^{(2)} &\cdots& X_n^{(2)} \\ X_0^{(3)} & X_1^{(3)} & X_2^{(3)} &\cdots& X_n^{(3)} \\ \cdots& \cdots& \cdots& \ddots& \cdots& \\ X_0^{(m)} & X_1^{(m)} & X_2^{(m)} &\cdots& X_n^{(m)} \\ \end{matrix}\right) $$
$$ \Downarrow $$
$$ = (\frac{2}{m} \cdot (X_b\theta - y)^T \cdot X_b)^T \\ = \frac{2}{m} \cdot X_b^T \cdot (X_b\theta - y) $$
$$ \Downarrow $$
$$ \nabla J(\theta) = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)})\cdot X_0^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) = \frac{2}{m} \cdot X_b^T \cdot (X_b\theta - y) $$
(二)编码
只需要修改函数$dJ$即可。LinearRegression.py
def dJ(theta, X_b, y):
# res = np.empty(len(theta))
# res[0] = np.sum(X_b.dot(theta)-y)
# for i in range(1, len(theta)):
# res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
# return res * 2 / len(X_b)
return X_b.T.dot(X_b.dot(theta) - y) * 2.0 / len(X_b)
(三)算法测试
1. 准备数据
import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
#分离数据集
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
2. 模型测试
from playML.LinearRegression import LinearRegression
lin_reg2 = LinearRegression()
%time lin_reg2.fit_gd(X_train, y_train)
<font color=red>报异常:</font>发现系数均为nan
,说明在梯度的过程中损失函数J越来越大,最后nan了,将$\eta$调小。如下
%time lin_reg2.fit_gd(X_train, y_train, eta=0.000001)
lin_reg2.score(X_test, y_test)
算法最终结果得到的$R^2$的值只有0.27,效率很低,考虑将运行次数n_iters调大,如下图。虽然$R^2$值有所提高,但是速度太慢了。
%time lin_reg2.fit_gd(X_train, y_train, eta=0.000001, n_iters=1e6)
lin_reg2.score(X_test, y_test)
之所以出现这种情况,是因为<font color=red>数据不在一个规模</font>上,查看训练集的前10个样本特征值如下图,可以看出规模有很大差别,就会导致步长有很大差别。
因此,要进行<font color=red>数据归一化</font>
3. 数据归一化
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
lin_reg3 = LinearRegression()
%time lin_reg3.fit_gd(X_train_standard, y_train)
X_test_standard = standardScaler.transform(X_test)
lin_reg3.score(X_test_standard, y_test)
(四)梯度下降法的优势
m = 1000
n = 8000
big_X = np.random.normal(size=(m, n))
true_theta = np.random.uniform(0.0, 100.0, size=n+1)
big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0., 10., size=m)
big_reg1 = LinearRegression()
%time big_reg1.fit_normal(big_X, big_y)
#Out: Wall time: 15 s
big_reg2 = LinearRegression()
%time big_reg2.fit_gd(big_X, big_y)
#Out: Wall time: 2.11 s
当样本特征数量n越大是差距越明显,但是,当样本数量m变大时,梯度下降法的效率就会降低,因为算法的每一个样本都要参与运算。此时,需要对算法进行改进——<font color=red>随机梯度下降法</font>
五、随机梯度下降法(Stochastic Gradient Descent)
(一)分析
$$ \nabla J(\theta) = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_0^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) $$
由式子可以看出每次计算都要把m个样本都参与进行运算,这种方法又称为批量梯度下降法(Batch Gradient Descent)
改进。把每一次运算都要遍历m个样本改为只对一个样本进行计算。
$$ = 2 \left(\begin{matrix} (X_b^{(i)} \theta - y^{(i)}) \cdot X_0^{(i)} \\ (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) = 2 \cdot (X_b^{(i)})^T \cdot (X_b^{(i)}\theta - y^{(i)}) $$
在随机梯度搜索过程中,学习率$\eta$应该逐渐变小,防止当搜索到最小值附近时又逐渐偏离最小值。$\eta = \frac{a}{i_iters + b}$,这里采用经验法。a = 5, b = 50。
事实上这是 模拟退火的思想
(二)编码
添加函数fit_sgd
,代码如下:LinearRegression.py
#随机梯度下降法
def fit_sgd(self, X_train, y_train, n_iters=5, t0=5, t1=50):
"""根据训练集X_train, y_train, 使用随机梯度下降法训练Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
assert n_iters >= 1, \
"the n_iters must be large than 0"
def dJ_sgd(theta, X_b_i, y_i):
return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2
def sgd(X_b, y, initial_theta, n_iters, t0=5, t1=50):
#计算eta值
def learning_rate(t):
return t0 / (t + t1)
theta = initial_theta
m = len(X_b)
for cur_iter in range(n_iters):
indexes = np.random.permutation(m)
X_b_new = X_b[indexes]
y_new = y[indexes]
for i in range(m):
gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
theta = theta - learning_rate(cur_iter*m + i) * gradient
return theta
X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
initial_theta = np.zeros(X_b.shape[1])
self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
(三)算法测试
1. 准备数据
import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
#分离数据集
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
2.数据归一化处理(预处理)
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
3.模型测试
from playML.LinearRegression import LinearRegression
lin_reg = LinearRegression()
%time lin_reg.fit_gd(X_train_standard, y_train, n_iters=2)
lin_reg.score(X_test_standard, y_test)
%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=50)
lin_reg.score(X_test_standard, y_test)
%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=100)
lin_reg.score(X_test_standard, y_test)
4.使用scikit-learn中封装的方法
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor()
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)
sgd_reg = SGDRegressor(n_iter_no_change=100)
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)
六、梯度的调试
(一)分析
原理:利用求曲线切线斜率的方法求梯度(导数)从而进行验证。
公式:$y'|_{x=x_0} = \displaystyle \lim_{x\rightarrow x_0} \frac{f(x) - f(x_0)}{x-x_0}$
$$ 二维平面上某点 \theta 的 斜率/导数 \qquad \Rightarrow \qquad \frac{dJ}{d\theta}=\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon} $$
扩展到高维:
$$ \Rightarrow \qquad \begin{cases} \theta = (\theta_0, \theta_1, \theta_2, ...,\theta_n) \\ \frac{\partial J } {\partial \theta } = \frac{\partial J } {\partial \theta_0 }, \frac{\partial J } {\partial \theta_1 }, \frac{\partial J } {\partial \theta_2 }, \cdots, \frac{\partial J } {\partial \theta_n } \end{cases} $$
- 对于$\theta_0$:
$$ \begin{cases} \theta_0^+ = (\theta_0 + \epsilon,\theta_1,\theta_2,,...,\theta_n) \\ \theta_0^- = (\theta_0 - \epsilon,\theta_1,\theta_2,,...,\theta_n) \\ \frac{\partial J}{\partial \theta_0} = \frac{J(\theta_0^+) - J(\theta_0^-)}{2\epsilon} \end{cases} $$
- 对于$\theta_1$:
$$ \begin{cases} \theta_1^+ = (\theta_0,\theta_1 + \epsilon,\theta_2,,...,\theta_n) \\ \theta_1^- = (\theta_0,\theta_1 - \epsilon,\theta_2,,...,\theta_n) \\ \frac{\partial J}{\partial \theta_1} = \frac{J(\theta_1^+) - J(\theta_1^-)}{2\epsilon} \end{cases} $$
以此类推。
(二)编码
这里我将方法封装在了文件myMLDebug.py
文件里。
七、总结
- 批量梯度下降法
- 随机梯度下降法
- 小批量梯度下降法
随机的优点:
- 跳出局部最优解,找到全局最优解
- 更快的运行速度
机器学习领域很多算法都要使用随机的特点
- 随机搜索
- 随机森林
梯度上升法:
$$ +\eta \frac{dJ}{d\theta} $$
有关梯度上升法的算法:见机器学习入门算法之主成分分析法 (PCA)