机器学习入门算法之线性回归法(Linear Regression)

文章中使用的测试样例:sklearn.datasets中的波士顿房产数据

一、简单线性回归

(一) 分析求解

$$ 目标:使 \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2尽可能小,其中\hat y^{(i)} = ax^{(i)} + b \\ \downarrow $$

$$ 目标:找到a和b,使得\displaystyle\sum_{i=1}^m (y^{(i)} - ax^{(i)} - b)^2 尽可能小 \quad ① \\ \downarrow $$

$$ 典型的最小二乘法问题:最小化误差的平方 \\ \downarrow $$

$$ a = \frac{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)(y^{(i)}-\bar y) }{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)^2 } \qquad b = \bar y - a\bar x \quad ② $$

$上述由①\rightarrow②简要思路:利用导数$

$$ J(a,b) = \displaystyle\sum_{i=1}^m (y^{(i)} - ax^{(i)} - b)^2 \\ \frac{\partial J(a,b)}{\partial a} = 0 \qquad \frac{\partial J(a,b)}{\partial b} = 0 \\解出上述这两个式子即可。 $$

(二)编码

1. 封装SimpleLinearRegression类

import numpy as np

class SimpleLinearRegression:

    def __init__(self):
        """初始化Simple Linear Regression模型"""
        self.a_ = None
        self.c_ = None

    def fit(self, x_train, y_train):
        """根据训练数据集x_train, y_train训练Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        num = 0.0 # 分子
        d = 0.0 # 分母
        for x, y in zip(x_train, y_train):
            num += (x - x_mean) * (y - y_mean)
            d += (x - x_mean) ** 2
                        
        self.a_ = num / d
        self.c_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        """给定待预测数据集x_predict, 返回表示x_predict的结果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert self.a_ is not None and self.c_ is not None, \
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """给定单个预测数据x_single, 返回x_single的预测结果值"""
        return self.a_ * x_single + self.c_

    def __repr__(self):
        return "SingleLinearRegression2()"

2. 测试线性回归模型

#x = np.array([1., 2., 3., 4., 5.])
#y = np.array([1., 3., 2., 3., 5.])
from playML.SimpleLinearRegression import SimpleLinearRegression

reg1 = SimpleLinearRegression()
reg1.fit(x, y)
reg1.predict(np.array([x_predict])) # x_predict = 6

reg1.a_
reg1.c_

y_hat1 = reg1.predict(np.array(x))
plt.scatter(x, y)
plt.plot(x, y_hat1, color = "r")
plt.axis([0, 6, 0, 6]) #x y 轴的范围
plt.show()

二、代码优化——向量化运算

(一)分析

$$ { a = \frac{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)(y^{(i)}-\bar y) }{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)^2 } \\ \qquad\qquad\qquad \downarrow \atop \displaystyle\sum_{i=1}^m w^{(i)} \cdot v^{(i)} } \qquad { 对于分子:w相当于(x - \bar x), v相当于(y - \bar y) \atop 对于分母:w相当于(x - \bar x), v也相当于(x - \bar x) } $$

进一步转化为向量运算:如下

$$ {\displaystyle\sum_{i=1}^m w^{(i)} \cdot v^{(i)} \\ \qquad \downarrow \atop w \cdot v } \qquad\qquad {w = (w^{(1)}, w^{(2)},...,w^{(m)}) \atop v = (v^{(1)}, v^{(2)},...,v^{(m)}) } $$

(二)编码

1. 只需要对封装的SimpleLinearRegression类的fit函数作修改即可,13~17行为修改内容,其余不变,如下。

    def fit(self, x_train, y_train):
        """根据训练数据集x_train, y_train训练Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        num = 0.0
        d = 0.0
        """for循环改为向量化运算"""
        # 分子(x−x¯)·(y−y¯)
        num = (x_train - x_mean).dot(y_train - y_mean)
        # 分母(x−x¯)·(x−x¯)
        d = (x_train - x_mean).dot(x_train - x_mean)

        self.a_ = num / d
        self.c_ = y_mean - self.a_ * x_mean
        return self

2. 测试模型(代码同上,代码略,截图如下)

(三)性能测试

三、衡量线性回归算法的指标/回归算法评价

(一)四种评价算法

1. 均方误差 MSE(Mean Squared Error)

$$ \frac{1}{m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2 $$

2. 解决量纲不一致问题——均方根误差 RMSE(Root Mean Squared Error)

$$ \sqrt{ \frac{1}{m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2 } = \sqrt{ MSE } $$

3. 平均绝对误差 MAE(Mean Absolute Error)

$$ \frac{1}{m} \displaystyle\sum_{i=1}^m |y^{(i)} - \hat y^{(i)}| $$

上述指标均无法直接衡量算法的好坏

4. 最好的衡量线性回归法的指标—— R Squared

$$ R^2 = 1 - \frac{\displaystyle\sum_i (\hat y^{(i)} - y^{(i)})^2}{\displaystyle\sum_i (\bar y-y^{(i)})^2} $$

$$ R^2 = 1 - \frac{(\displaystyle\sum_{i=1}^m (\hat y^{(i)} - y^{(i)})^2)/m} {(\displaystyle\sum_{i=1}^m (\bar y-y^{(i)})^2)/m} = 1 - \frac{ MSE(\hat y,y) }{ Var(y) } $$

对于$R^2$指标${\displaystyle\sum_i (\hat y^{(i)} - y^{(i)})^2}$ 是使用我们的训练的模型预测产生的错误。

对于$R^2$指标${\displaystyle\sum_i (\bar y^{(i)} - y^{(i)})^2}$ 是使用$y = \bar y (Baseline Model)$预测产生的错误。

  • 式子可以理解为$R^2 =$ 1 - 模型的错误情况
  • $R^2 <= 1$
  • $R^2$ 越大越好。当我们的预测模型不犯任何错误时,$R^2 $得到的最大值为1
  • 当我们的模型等于基准模型时,$R^2$ = 0
  • 如果$R^2 <= 0$,说明我们的学习到的模型还不如基准模型,此时,很有可能我们的数据不存在线性关系。

(二)编码

函数封装(metrics.py)

import numpy as np
from math import sqrt


def accuracy_score(y_true, y_predict):
    """计算y_true 和 y_predict 之间的准确率"""
    assert y_true.shape[0] == y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"

    return sum( y_true == y_predict ) / len(y_true)


def mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum( (y_true - y_predict) ** 2) / len(y_true)

def root_mean_squared_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""

    return sqrt( mean_squared_error(y_true, y_predict) )

def mean_absolute_error(y_true, y_predict):
    """计算y_true和y_predict之间的RMSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)

def r2_score(y_true, y_predict):
    """计算y_true和y_predict之间的R_Squared"""

    return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)

(三)算法测试(样例:波士顿房产数据)

1. 准备数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

boston = datasets.load_boston()
print(boston.DESCR)

boston.feature_names

x = boston.data[:,5] # 只使用房间的数量这个特征
x.shape

y = boston.target
y.shape

plt.scatter(x, y)
plt.show()

np.max(y)
x = x[y<50.0]
y = y[y<50.0]

plt.scatter(x, y)
plt.show()

2. 使用简单线性回归法进行预测

from playML.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)

x_train.shape
x_test.shape

from playML.SimpleLinearRegression import SimpleLinearRegression

reg = SimpleLinearRegression()
reg.fit(x_train, y_train)

reg.a_
reg.c_

plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color="r")
plt.show()

y_predict = reg.predict(x_test)

3. 模型测试

from playML.metrics import mean_squared_error
from playML.metrics import root_mean_squared_error
from playML.metrics import mean_absolute_error
from playML.metrics import r2_score

#MSE
mean_squared_error(y_test, y_predict)
#RMSE
root_mean_squared_error(y_test, y_predict)
#MAE
mean_absolute_error(y_test, y_predict)
#R Squared
r2_score(y_test, y_predict)

3.scikit-learn 中的MSE、RMSE、MAE、R Squared

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

mean_squared_error(y_test, y_predict)
root_mean_squared_error(y_test, y_predict)
mean_absolute_error(y_test, y_predict)
r2_score(y_test, y_predict)

(四)将计算score方法封装入SimpleLinearRegression类,见第2、46~50行

import numpy as np
from .metrics import r2_score

class SimpleLinearRegression:

    def __init__(self):
        """初始化Simple Linear Regression模型"""
        self.a_ = None
        self.c_ = None

    def fit(self, x_train, y_train):
        """根据训练数据集x_train, y_train训练Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        num = 0.0
        d = 0.0
        """for循环改为向量化运算"""
        # 分子(x−x¯)·(y−y¯)
        num = (x_train - x_mean).dot(y_train - y_mean)
        # 分母(x−x¯)·(x−x¯)
        d = (x_train - x_mean).dot(x_train - x_mean)

        self.a_ = num / d
        self.c_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        """给定待预测数据集x_predict, 返回表示x_predict的结果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert self.a_ is not None and self.c_ is not None, \
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """给定单个预测数据x_single, 返回x_single的预测结果值"""
        return self.a_ * x_single + self.c_

    def score(self, x_test, y_test):
        """根据测试数据集x_test和y_test确定当前模型的准确度"""

        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "SingleLinearRegression2()"
reg.score(x_test, y_test)
#Out:0.6129316803937322

四、多元线性回归

(一)分析

目标:使 $\displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ 尽可能小,($\hat y^{(i)} = \theta_0 + \theta_1X^{(i)}_1 + \theta_2X^{(i)}_2 +...+ \theta_nX^{(i)}_n$)

对$\hat y^{(i)}$公式进行转化:

  令$X_0^{(0)}=1$, 则$\hat y^{(i)} = \theta_0X^{(0}_0 + \theta_1X^{(i)}_1 + \theta_2X^{(i)}_2 +...+ \theta_nX^{(i)}_n$

  令$\theta = (\theta_0,\theta_1,\theta_2,...,\theta_n)^T$,$X^{(i)} = ( X^{(i)}_0,X^{(i)}_1,X^{(i)}_2,...,X^{(i)}_n )$

    则$\hat y^{(i)} = X^{(i)} \cdot \theta$

推广到若干个样本,记为$X_b$

$X_b =\left[\begin{matrix} 1 & X^{(1)}_1 & X^{(1)}_2 & \cdots & X^{(1)}_n\\1 & X^{(2)}_1 & X^{(2)}_2 & \cdots & X^{(2)}_n \\ \vdots &\vdots &\vdots &\ddots &\vdots \\ 1 & X^{(m)}_1 & X^{(m)}_2 & \cdots & X^{(m)}_n \end{matrix} \right]$     $\theta = \left[\begin{matrix} \theta_0 \\ \theta_1\\ \theta_2 \\ \vdots \\ \theta_n \end{matrix} \right]$

$\hat y^{(i)} = X^{(i)} \cdot \theta$

在$\theta$中用$\theta_0表示截距intercept,用\theta_1-\theta_n表示n个系数coefficients$

(二)求解

$$ 目标:使 \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2 尽可能小 \\ \downarrow \\ 目标:使 (y - X_b \cdot \theta)^T(y - X_b \cdot \theta) 尽可能小 \\ \downarrow \\ \theta = (X_b^TX_b)^{-1}X_b^T y $$

多元线性回归的正规方程解(Normal Equation):$\theta = (X_b^TX_b)^{-1}X_b^T y$

  问题:时间复杂度高,O(n^3)(优化后O(^2.4))。

  优点:不需要对数据做归一化处理。

(三)编码实现

1. 封装LinearRegression

import numpy as np
from .metrics import r2_score

class LinearRegression:

    def __init__(self):

        # 系数
        self.coef_ = None
        # 截距
        self.interception_ = None

        self._theta = None

    # 用正规方程解
    def fit_normal(self, X_train, y_train):
        """根据训练集X_train,y_train训练出Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must equal to the size of y_train"

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict, 返回表示X_predict的结果向量"""
        assert self.interception_ is not None and self.coef_ is not None, \
            "must fit before predict"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        """根据X_test和y_test确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"

2. 测试多元线性回归模型

X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]

X.shape
y.shape

from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)

reg.coef_
#array([-1.20354261e-01,  3.64423279e-02, -3.61493155e-02,  5.12978140e-02,
#       -1.15775825e+01,  3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
#        2.60101728e-01, -1.40219119e-02, -8.35430488e-01,  7.80472852e-03,
#       -3.80923751e-01])
reg.interception_
#34.117399723204585
reg.score(X_test, y_test)
#0.8129794056212895

3. 使用scikit-learn中的线性回归模型

#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
#由于skearn的训练测试split算法和我写的有点不同,便于调试我在这里调用我的数据训练分割算法
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

lin_reg.coef_
lin_reg.intercept_

lin_reg.score(X_test, y_test)

4. 使用kNN的回归模型预测

#kNN解决分类问题是KNeighborsClassifier,回归问题是KNeighborsRegressor
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
knn_reg.score(X_test, y_test)
#Out:0.5865412198300899

5. kNN网格搜索寻找最好的超参数并进行预测

#网格搜索寻找超参数
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        "weights": ["uniform"],
        "n_neighbors": [i for i in range(1, 11)]
    },
    {
        "weights": ["distance"],
        "n_neighbors": [i for i in range(1, 11)],
        "p": [i for i in range(1, 6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
# n_jobs并行处理参数,-1表示调用cpu所有核
# verbose表示输出的详细情况.
grid_search.fit(X_train, y_train)
#Out:如下图

grid_search.best_params_
#Out: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
grid_search.best_score_
#Out: 0.6340477954176972

grid_search.best_estimator_.score(X_test, y_test)
#Out: 0.7044357727037996


##grid_search.best_score_用的是GridSearchCV的交叉验证
##grid_search.best_estimator返回KNeighborsRegressor对象,调用KNeighborsRegressor对象的score方法
##还有一点要注意的是在寻找超参数的过程中,会进行score的比较,这个比较也用的GridSearchCV的算法和我们的score算法不太一样。

五、线性回归算法总结

  • 典型的参数学习

    • 对比kNN:非参数学习
  • 只能解决回归问题。

虽然很多分类方法中,线性回归是基础。

对比kNN:即可以解决分类问题,又可以解决回归问题。

  • 对数据有假设:线性

对比kNN对数据没有假设

  • 优先:对数据有可解释性

多元回归算法的正规方程解的时间复杂度高,因此改为梯度下降法,见机器学习入门算法之梯度下降法:

Last modification:August 17th, 2020 at 09:32 am
如果觉得我的文章对你有用,请随意赞赏~