机器学习入门算法之梯度下降法(Gradient Descent)

一、概述

  • 不是一个机器学习算法
  • 是一种基于搜索的最优化算法
  • 作用:最小化一个损失函数
  • 梯度上升法:最大化一个效用函数

在函数曲线中,一点的导数代表自变量单位变化时,$损失函数J$响应的变化量,即为$\frac{dJ}{d\theta}$,另一方面,导数可以代表$损失函数J$增大的方向,通过梯度下降找到$损失函数J$的最小值,梯度的步长可表示为:$- \eta \frac{dJ}{d \theta} $.

$J = \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$

  • $\eta$ 称为学习率(learning rate)
  • $\eta$ 的取值影响获得最优解的速度
  • $\eta$ 取值不合适,甚至得不到最优解
  • $\eta$ 是梯度下降法的一个超参数

并不是所有函数都有唯一的极值点

解决方案:

  • 多次运行,随机化初始点
  • 梯度下降法的初始点也是一个超参数

在线性回归中使用梯度下降法,线性回归法的损失函数具有唯一的最优解。

二、用二次曲线模拟梯度下降法

import numpy as np
import matplotlib.pyplot as plt

#二次曲线y = (x-2.5)^2 -1
plot_x = np.linspace(-1, 6, 141)
plot_y = (plot_x - 2.5) ** 2 - 1

#求导函数值的函数
def dJ(theta):
    return 2*(theta-2.5)
#求函数值的函数
def J(theta):
    try:
        return (theta -2.5)**2-1.
    except:
        return float('inf')
#梯度下降,把每一次的值记录下来以便绘图,当两次结果相差少于epsilon时,认为两个数值相等,计算机处理浮点数有精度误差,所以不要直接相等。当损失函数越来越大的时候,损失函数J会超出最大值inf,所以上面捕捉了异常,纺织报错,下面也限制了最大循环次数防止死循环。
def gradient_descent(initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    theta_history.append(initial_theta)
    i_iter = 0
    while i_iter < n_iters:
        gradient = dJ(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)

        if abs(J(theta) - J(last_theta)) < epsilon:
            break
        
        i_iter += 1
            
def plot_theta_history():
    plt.plot(plot_x, J(plot_x))
    plt.plot(np.array(theta_history), J(np.array(theta_history)), color="r", marker="+")
    plt.show()
 
    
#用例1
eta = 0.01
theta_history = []
gradient_descent(0., eta)
plot_theta_history()

#用例2
eta = 0.9
theta_history = []
gradient_descent(0., eta)
plot_theta_history()

#用例3
eta = 1.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()

#用例4
eta = 1.1
theta_history = []
gradient_descent(0., eta)
plot_theta_history()

三、多元线性回归中的梯度下降法

梯度代表方向,对应$J$增大最快的方向:$- \eta \nabla J$,$\nabla J = (\frac{\partial J}{\partial \theta_0},\frac{\partial J}{\partial \theta_1},...,\frac{\partial J}{\partial \theta_n})$

(一)分析

目标:使 $\displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ 尽可能小,($\hat y^{(i)} = \theta_0 + \theta_1X^{(i)}_1 + \theta_2X^{(i)}_2 +...+ \theta_nX^{(i)}_n$)

代入得:$\displaystyle\sum_{i=1}^m (y^{(i) }- \theta_0 - \theta_1X^{(i)}_1 - \theta_2X^{(i)}_2 -...- \theta_nX^{(i)}_n)$

$$ \nabla J(\theta) = \left(\begin{matrix} \partial J / \partial \theta_0 \\ \partial J / \partial \theta_1 \\ \partial J / \partial \theta_2 \\ \cdots \\ \partial J / \partial \theta_n \end{matrix} \right) = \left(\begin{matrix} \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-1) \\ \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-X_1^{(i)}) \\ \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-X_2^{(i)}) \\ \cdots \\ \displaystyle\sum_{i=1}^m 2(y^{(i)} - X_b^{(i)} \theta )\cdot (-X_n^{(i)}) \\ \end{matrix} \right) = 2 \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) $$

其中$X_b = (1 , X_1, X_2, ..., X_n)^T$

解出来的梯度中,每个元素的值应该与m值无关,所以这里除以一个m。得出,

目标:使 $\frac{1}{m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ (MSE)尽可能小。

有时取:使 $\frac{1}{2m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ ,不过差别不大

$$ J(\theta) = MSE(y, \hat y) \qquad \nabla J(\theta) = \left(\begin{matrix} \partial J / \partial \theta_0 \\ \partial J / \partial \theta_1 \\ \partial J / \partial \theta_2 \\ \cdots \\ \partial J / \partial \theta_n \end{matrix} \right) = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) $$

(二)编码

在文件LinearRegression.py中添加梯度下降法训练的函数。添加的代码如下:

    # 梯度下降法训练
    def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        """根据训练集X_train,y_train,使用梯度下降法训练出Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_trian must be equal to the size of y_train."

        def J(theta, X_b, y):
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')

        def dJ(theta, X_b, y):
            res = np.empty(len(theta))
            res[0] = np.sum(X_b.dot(theta)-y)
            for i in range(1, len(theta)):
                res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            return res * 2 / len(X_b)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            theta = initial_theta
            cur_iter = 0
            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient

                if abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
                    break

                cur_iter += 1
            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

(三)算法测试

1.准备数据
import numpy as np
import matplotlib.pyplot as plt

#模拟一个接近y = x * 3 + 4的函数
np.random.seed(666)
x = 2 * np.random.random(size=100)
y = x * 3 + 4 + np.random.normal(size=100)
X = x.reshape(-1, 1)

plt.scatter(x, y)
plt.show()

2.模型测试
from playML.LinearRegression import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit_gd(X, y)

#系数
lin_reg.coef_
#截距
lin_reg.intercept_

四、算法优化——向量化

(一)分析

$$ = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)})\cdot X_0^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) \qquad X_0^{(i)} === 1 $$

$$ \downarrow \\ { \frac{2}{m} \cdot (X_b^{(1)}\theta - y^{(1)}, X_b^{(2)}\theta - y^{(2)}, X_b^{(3)}\theta - y^{(3)} ,..., X_b^{(m)}\theta - y^{(m)}) } \\ \left(\begin{matrix} X_0^{(1)} & X_1^{(1)} & X_2^{(1)} &\cdots& X_n^{(1)} \\ X_0^{(2)} & X_1^{(2)} & X_2^{(2)} &\cdots& X_n^{(2)} \\ X_0^{(3)} & X_1^{(3)} & X_2^{(3)} &\cdots& X_n^{(3)} \\ \cdots& \cdots& \cdots& \ddots& \cdots& \\ X_0^{(m)} & X_1^{(m)} & X_2^{(m)} &\cdots& X_n^{(m)} \\ \end{matrix}\right) $$

$$ \Downarrow $$

$$ = (\frac{2}{m} \cdot (X_b\theta - y)^T \cdot X_b)^T \\ = \frac{2}{m} \cdot X_b^T \cdot (X_b\theta - y) $$

$$ \Downarrow $$

$$ \nabla J(\theta) = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)})\cdot X_0^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) = \frac{2}{m} \cdot X_b^T \cdot (X_b\theta - y) $$

(二)编码

只需要修改函数$dJ$即可。
LinearRegression.py

        def dJ(theta, X_b, y):
            # res = np.empty(len(theta))
            # res[0] = np.sum(X_b.dot(theta)-y)
            # for i in range(1, len(theta)):
            #     res[i] = (X_b.dot(theta) - y).dot(X_b[:, i])
            # return res * 2 / len(X_b)
            return X_b.T.dot(X_b.dot(theta) - y) * 2.0 / len(X_b)

(三)算法测试

1. 准备数据
import numpy as np
from sklearn import datasets

boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

#分离数据集
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
2. 模型测试
from playML.LinearRegression import LinearRegression
lin_reg2 = LinearRegression()
%time lin_reg2.fit_gd(X_train, y_train)

<font color=red>报异常:</font>发现系数均为nan,说明在梯度的过程中损失函数J越来越大,最后nan了,将$\eta$调小。如下

%time lin_reg2.fit_gd(X_train, y_train, eta=0.000001)
lin_reg2.score(X_test, y_test)

算法最终结果得到的$R^2$的值只有0.27,效率很低,考虑将运行次数n_iters调大,如下图。虽然$R^2$值有所提高,但是速度太慢了。

%time lin_reg2.fit_gd(X_train, y_train, eta=0.000001, n_iters=1e6)
lin_reg2.score(X_test, y_test)

之所以出现这种情况,是因为<font color=red>数据不在一个规模</font>上,查看训练集的前10个样本特征值如下图,可以看出规模有很大差别,就会导致步长有很大差别。

因此,要进行<font color=red>数据归一化</font>

3. 数据归一化
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train)

X_train_standard = standardScaler.transform(X_train)
lin_reg3 = LinearRegression()
%time lin_reg3.fit_gd(X_train_standard, y_train) 

X_test_standard = standardScaler.transform(X_test)
lin_reg3.score(X_test_standard, y_test)

(四)梯度下降法的优势

m = 1000
n = 8000

big_X = np.random.normal(size=(m, n))
true_theta = np.random.uniform(0.0, 100.0, size=n+1)
big_y = big_X.dot(true_theta[1:]) + true_theta[0] + np.random.normal(0., 10., size=m)

big_reg1 = LinearRegression()
%time big_reg1.fit_normal(big_X, big_y)
#Out: Wall time: 15 s
big_reg2 = LinearRegression()
%time big_reg2.fit_gd(big_X, big_y)
#Out: Wall time: 2.11 s

当样本特征数量n越大是差距越明显,但是,当样本数量m变大时,梯度下降法的效率就会降低,因为算法的每一个样本都要参与运算。此时,需要对算法进行改进——<font color=red>随机梯度下降法</font>

五、随机梯度下降法(Stochastic Gradient Descent)

(一)分析

$$ \nabla J(\theta) = \frac{2}{m} \left(\begin{matrix} \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_0^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ \displaystyle\sum_{i=1}^m (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) $$

由式子可以看出每次计算都要把m个样本都参与进行运算,这种方法又称为批量梯度下降法(Batch Gradient Descent)

改进。把每一次运算都要遍历m个样本改为只对一个样本进行计算

$$ = 2 \left(\begin{matrix} (X_b^{(i)} \theta - y^{(i)}) \cdot X_0^{(i)} \\ (X_b^{(i)} \theta - y^{(i)}) \cdot X_1^{(i)} \\ (X_b^{(i)} \theta - y^{(i)}) \cdot X_2^{(i)} \\ \cdots \\ (X_b^{(i)} \theta - y^{(i)}) \cdot X_n^{(i)} \\ \end{matrix} \right) = 2 \cdot (X_b^{(i)})^T \cdot (X_b^{(i)}\theta - y^{(i)}) $$

在随机梯度搜索过程中,学习率$\eta$应该逐渐变小,防止当搜索到最小值附近时又逐渐偏离最小值。$\eta = \frac{a}{i_iters + b}$,这里采用经验法。a = 5, b = 50。

事实上这是 模拟退火的思想

(二)编码

添加函数fit_sgd,代码如下:
LinearRegression.py

#随机梯度下降法
    def fit_sgd(self, X_train, y_train, n_iters=5, t0=5, t1=50):
        """根据训练集X_train, y_train, 使用随机梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert n_iters >= 1, \
            "the n_iters must be large than 0"

        def dJ_sgd(theta, X_b_i, y_i):
            return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2

        def sgd(X_b, y, initial_theta, n_iters, t0=5, t1=50):
            #计算eta值
            def learning_rate(t):
                return t0 / (t + t1)

            theta = initial_theta
            m = len(X_b)
            for cur_iter in range(n_iters):
                indexes = np.random.permutation(m)
                X_b_new = X_b[indexes]
                y_new = y[indexes]
                for i in range(m):
                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(cur_iter*m + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1])
        self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]
        return self

(三)算法测试

1. 准备数据
import numpy as np
from sklearn import datasets

boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

#分离数据集
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
2.数据归一化处理(预处理)
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train)

X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
3.模型测试
from playML.LinearRegression import LinearRegression

lin_reg = LinearRegression()
%time lin_reg.fit_gd(X_train_standard, y_train, n_iters=2) 
lin_reg.score(X_test_standard, y_test)

%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=50) 
lin_reg.score(X_test_standard, y_test)

%time lin_reg.fit_sgd(X_train_standard, y_train, n_iters=100) 
lin_reg.score(X_test_standard, y_test)

4.使用scikit-learn中封装的方法
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor()
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)

sgd_reg = SGDRegressor(n_iter_no_change=100)
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)

六、梯度的调试

(一)分析

原理:利用求曲线切线斜率的方法求梯度(导数)从而进行验证。

公式:$y'|_{x=x_0} = \displaystyle \lim_{x\rightarrow x_0} \frac{f(x) - f(x_0)}{x-x_0}$

$$ 二维平面上某点 \theta 的 斜率/导数 \qquad \Rightarrow \qquad \frac{dJ}{d\theta}=\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon} $$

扩展到高维:

$$ \Rightarrow \qquad \begin{cases} \theta = (\theta_0, \theta_1, \theta_2, ...,\theta_n) \\ \frac{\partial J } {\partial \theta } = \frac{\partial J } {\partial \theta_0 }, \frac{\partial J } {\partial \theta_1 }, \frac{\partial J } {\partial \theta_2 }, \cdots, \frac{\partial J } {\partial \theta_n } \end{cases} $$

  • 对于$\theta_0$:

$$ \begin{cases} \theta_0^+ = (\theta_0 + \epsilon,\theta_1,\theta_2,,...,\theta_n) \\ \theta_0^- = (\theta_0 - \epsilon,\theta_1,\theta_2,,...,\theta_n) \\ \frac{\partial J}{\partial \theta_0} = \frac{J(\theta_0^+) - J(\theta_0^-)}{2\epsilon} \end{cases} $$

  • 对于$\theta_1$:

$$ \begin{cases} \theta_1^+ = (\theta_0,\theta_1 + \epsilon,\theta_2,,...,\theta_n) \\ \theta_1^- = (\theta_0,\theta_1 - \epsilon,\theta_2,,...,\theta_n) \\ \frac{\partial J}{\partial \theta_1} = \frac{J(\theta_1^+) - J(\theta_1^-)}{2\epsilon} \end{cases} $$

以此类推。

(二)编码

这里我将方法封装在了文件myMLDebug.py文件里。

七、总结

  • 批量梯度下降法
  • 随机梯度下降法
  • 小批量梯度下降法

随机的优点:

  • 跳出局部最优解,找到全局最优解
  • 更快的运行速度
  • 机器学习领域很多算法都要使用随机的特点

    • 随机搜索
    • 随机森林

梯度上升法:

$$ +\eta \frac{dJ}{d\theta} $$

有关梯度上升法的算法:见 机器学习入门算法之主成分分析法:

Last modification:February 7th, 2020 at 12:51 pm
如果觉得我的文章对你有用,请随意赞赏~