机器学习入门算法之线性回归法(Linear Regression)
文章中使用的测试样例:sklearn.datasets中的波士顿房产数据
一、简单线性回归
(一) 分析求解
$$ 目标:使 \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2尽可能小,其中\hat y^{(i)} = ax^{(i)} + b \\ \downarrow $$
$$ 目标:找到a和b,使得\displaystyle\sum_{i=1}^m (y^{(i)} - ax^{(i)} - b)^2 尽可能小 \quad ① \\ \downarrow $$
$$ 典型的最小二乘法问题:最小化误差的平方 \\ \downarrow $$
$$ a = \frac{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)(y^{(i)}-\bar y) }{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)^2 } \qquad b = \bar y - a\bar x \quad ② $$
$上述由①\rightarrow②简要思路:利用导数$
$$ J(a,b) = \displaystyle\sum_{i=1}^m (y^{(i)} - ax^{(i)} - b)^2 \\ \frac{\partial J(a,b)}{\partial a} = 0 \qquad \frac{\partial J(a,b)}{\partial b} = 0 \\解出上述这两个式子即可。 $$
(二)编码
1. 封装SimpleLinearRegression类
import numpy as np
class SimpleLinearRegression:
def __init__(self):
"""初始化Simple Linear Regression模型"""
self.a_ = None
self.c_ = None
def fit(self, x_train, y_train):
"""根据训练数据集x_train, y_train训练Simple Linear Regression模型"""
assert x_train.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert len(x_train) == len(y_train), \
"the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0 # 分子
d = 0.0 # 分母
for x, y in zip(x_train, y_train):
num += (x - x_mean) * (y - y_mean)
d += (x - x_mean) ** 2
self.a_ = num / d
self.c_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""给定待预测数据集x_predict, 返回表示x_predict的结果向量"""
assert x_predict.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert self.a_ is not None and self.c_ is not None, \
"must fit before predict!"
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
"""给定单个预测数据x_single, 返回x_single的预测结果值"""
return self.a_ * x_single + self.c_
def __repr__(self):
return "SingleLinearRegression2()"
2. 测试线性回归模型
#x = np.array([1., 2., 3., 4., 5.])
#y = np.array([1., 3., 2., 3., 5.])
from playML.SimpleLinearRegression import SimpleLinearRegression
reg1 = SimpleLinearRegression()
reg1.fit(x, y)
reg1.predict(np.array([x_predict])) # x_predict = 6
reg1.a_
reg1.c_
y_hat1 = reg1.predict(np.array(x))
plt.scatter(x, y)
plt.plot(x, y_hat1, color = "r")
plt.axis([0, 6, 0, 6]) #x y 轴的范围
plt.show()
二、代码优化——向量化运算
(一)分析
$$ { a = \frac{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)(y^{(i)}-\bar y) }{ \displaystyle\sum_{i=1}^m(x^{(i)}-\bar x)^2 } \\ \qquad\qquad\qquad \downarrow \atop \displaystyle\sum_{i=1}^m w^{(i)} \cdot v^{(i)} } \qquad { 对于分子:w相当于(x - \bar x), v相当于(y - \bar y) \atop 对于分母:w相当于(x - \bar x), v也相当于(x - \bar x) } $$
进一步转化为向量运算:如下
$$ {\displaystyle\sum_{i=1}^m w^{(i)} \cdot v^{(i)} \\ \qquad \downarrow \atop w \cdot v } \qquad\qquad {w = (w^{(1)}, w^{(2)},...,w^{(m)}) \atop v = (v^{(1)}, v^{(2)},...,v^{(m)}) } $$
(二)编码
1. 只需要对封装的SimpleLinearRegression类的fit
函数作修改即可,13~17行为修改内容,其余不变,如下。
def fit(self, x_train, y_train):
"""根据训练数据集x_train, y_train训练Simple Linear Regression模型"""
assert x_train.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert len(x_train) == len(y_train), \
"the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0
d = 0.0
"""for循环改为向量化运算"""
# 分子(x−x¯)·(y−y¯)
num = (x_train - x_mean).dot(y_train - y_mean)
# 分母(x−x¯)·(x−x¯)
d = (x_train - x_mean).dot(x_train - x_mean)
self.a_ = num / d
self.c_ = y_mean - self.a_ * x_mean
return self
2. 测试模型(代码同上,代码略,截图如下)
(三)性能测试
三、衡量线性回归算法的指标/回归算法评价
(一)四种评价算法
1. 均方误差 MSE(Mean Squared Error)
$$ \frac{1}{m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2 $$
2. 解决量纲不一致问题——均方根误差 RMSE(Root Mean Squared Error)
$$ \sqrt{ \frac{1}{m} \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2 } = \sqrt{ MSE } $$
3. 平均绝对误差 MAE(Mean Absolute Error)
$$ \frac{1}{m} \displaystyle\sum_{i=1}^m |y^{(i)} - \hat y^{(i)}| $$
上述指标均无法直接衡量算法的好坏
4. 最好的衡量线性回归法的指标—— R Squared
$$ R^2 = 1 - \frac{\displaystyle\sum_i (\hat y^{(i)} - y^{(i)})^2}{\displaystyle\sum_i (\bar y-y^{(i)})^2} $$
$$ R^2 = 1 - \frac{(\displaystyle\sum_{i=1}^m (\hat y^{(i)} - y^{(i)})^2)/m} {(\displaystyle\sum_{i=1}^m (\bar y-y^{(i)})^2)/m} = 1 - \frac{ MSE(\hat y,y) }{ Var(y) } $$
对于$R^2$指标${\displaystyle\sum_i (\hat y^{(i)} - y^{(i)})^2}$ 是使用我们的训练的模型预测产生的错误。
对于$R^2$指标${\displaystyle\sum_i (\bar y^{(i)} - y^{(i)})^2}$ 是使用$y = \bar y (Baseline Model)$预测产生的错误。
- 式子可以理解为$R^2 =$ 1 - 模型的错误情况
- $R^2 <= 1$
- $R^2$ 越大越好。当我们的预测模型不犯任何错误时,$R^2 $得到的最大值为1
- 当我们的模型等于基准模型时,$R^2$ = 0
- 如果$R^2 <= 0$,说明我们的学习到的模型还不如基准模型,此时,很有可能我们的数据不存在线性关系。
(二)编码
函数封装(metrics.py)
import numpy as np
from math import sqrt
def accuracy_score(y_true, y_predict):
"""计算y_true 和 y_predict 之间的准确率"""
assert y_true.shape[0] == y_predict.shape[0], \
"the size of y_true must be equal to the size of y_predict"
return sum( y_true == y_predict ) / len(y_true)
def mean_squared_error(y_true, y_predict):
"""计算y_true和y_predict之间的MSE"""
assert len(y_true) == len(y_predict), \
"the size of y_true must be equal to the size of y_predict"
return np.sum( (y_true - y_predict) ** 2) / len(y_true)
def root_mean_squared_error(y_true, y_predict):
"""计算y_true和y_predict之间的RMSE"""
return sqrt( mean_squared_error(y_true, y_predict) )
def mean_absolute_error(y_true, y_predict):
"""计算y_true和y_predict之间的RMSE"""
assert len(y_true) == len(y_predict), \
"the size of y_true must be equal to the size of y_predict"
return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
def r2_score(y_true, y_predict):
"""计算y_true和y_predict之间的R_Squared"""
return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)
(三)算法测试(样例:波士顿房产数据)
1. 准备数据
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
print(boston.DESCR)
boston.feature_names
x = boston.data[:,5] # 只使用房间的数量这个特征
x.shape
y = boston.target
y.shape
plt.scatter(x, y)
plt.show()
np.max(y)
x = x[y<50.0]
y = y[y<50.0]
plt.scatter(x, y)
plt.show()
2. 使用简单线性回归法进行预测
from playML.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
x_train.shape
x_test.shape
from playML.SimpleLinearRegression import SimpleLinearRegression
reg = SimpleLinearRegression()
reg.fit(x_train, y_train)
reg.a_
reg.c_
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color="r")
plt.show()
y_predict = reg.predict(x_test)
3. 模型测试
from playML.metrics import mean_squared_error
from playML.metrics import root_mean_squared_error
from playML.metrics import mean_absolute_error
from playML.metrics import r2_score
#MSE
mean_squared_error(y_test, y_predict)
#RMSE
root_mean_squared_error(y_test, y_predict)
#MAE
mean_absolute_error(y_test, y_predict)
#R Squared
r2_score(y_test, y_predict)
3.scikit-learn 中的MSE、RMSE、MAE、R Squared
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
mean_squared_error(y_test, y_predict)
root_mean_squared_error(y_test, y_predict)
mean_absolute_error(y_test, y_predict)
r2_score(y_test, y_predict)
(四)将计算score方法封装入SimpleLinearRegression
类,见第2、46~50行
import numpy as np
from .metrics import r2_score
class SimpleLinearRegression:
def __init__(self):
"""初始化Simple Linear Regression模型"""
self.a_ = None
self.c_ = None
def fit(self, x_train, y_train):
"""根据训练数据集x_train, y_train训练Simple Linear Regression模型"""
assert x_train.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert len(x_train) == len(y_train), \
"the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0.0
d = 0.0
"""for循环改为向量化运算"""
# 分子(x−x¯)·(y−y¯)
num = (x_train - x_mean).dot(y_train - y_mean)
# 分母(x−x¯)·(x−x¯)
d = (x_train - x_mean).dot(x_train - x_mean)
self.a_ = num / d
self.c_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""给定待预测数据集x_predict, 返回表示x_predict的结果向量"""
assert x_predict.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert self.a_ is not None and self.c_ is not None, \
"must fit before predict!"
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
"""给定单个预测数据x_single, 返回x_single的预测结果值"""
return self.a_ * x_single + self.c_
def score(self, x_test, y_test):
"""根据测试数据集x_test和y_test确定当前模型的准确度"""
y_predict = self.predict(x_test)
return r2_score(y_test, y_predict)
def __repr__(self):
return "SingleLinearRegression2()"
reg.score(x_test, y_test)
#Out:0.6129316803937322
四、多元线性回归
(一)分析
目标:使 $\displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2$ 尽可能小,($\hat y^{(i)} = \theta_0 + \theta_1X^{(i)}_1 + \theta_2X^{(i)}_2 +...+ \theta_nX^{(i)}_n$)
对$\hat y^{(i)}$公式进行转化:
令$X_0^{(0)}=1$, 则$\hat y^{(i)} = \theta_0X^{(0}_0 + \theta_1X^{(i)}_1 + \theta_2X^{(i)}_2 +...+ \theta_nX^{(i)}_n$
令$\theta = (\theta_0,\theta_1,\theta_2,...,\theta_n)^T$,$X^{(i)} = ( X^{(i)}_0,X^{(i)}_1,X^{(i)}_2,...,X^{(i)}_n )$
则$\hat y^{(i)} = X^{(i)} \cdot \theta$
推广到若干个样本,记为$X_b$
$X_b =\left[\begin{matrix} 1 & X^{(1)}_1 & X^{(1)}_2 & \cdots & X^{(1)}_n\\1 & X^{(2)}_1 & X^{(2)}_2 & \cdots & X^{(2)}_n \\ \vdots &\vdots &\vdots &\ddots &\vdots \\ 1 & X^{(m)}_1 & X^{(m)}_2 & \cdots & X^{(m)}_n \end{matrix} \right]$ $\theta = \left[\begin{matrix} \theta_0 \\ \theta_1\\ \theta_2 \\ \vdots \\ \theta_n \end{matrix} \right]$
$\hat y^{(i)} = X^{(i)} \cdot \theta$
在$\theta$中用$\theta_0表示截距intercept,用\theta_1-\theta_n表示n个系数coefficients$
(二)求解
$$ 目标:使 \displaystyle\sum_{i=1}^m (y^{(i)} - \hat y^{(i)})^2 尽可能小 \\ \downarrow \\ 目标:使 (y - X_b \cdot \theta)^T(y - X_b \cdot \theta) 尽可能小 \\ \downarrow \\ \theta = (X_b^TX_b)^{-1}X_b^T y $$
多元线性回归的正规方程解(Normal Equation):$\theta = (X_b^TX_b)^{-1}X_b^T y$
问题:时间复杂度高,O(n^3)
(优化后O(^2.4)
)。
优点:不需要对数据做归一化处理。
(三)编码实现
1. 封装LinearRegression
类
import numpy as np
from .metrics import r2_score
class LinearRegression:
def __init__(self):
# 系数
self.coef_ = None
# 截距
self.interception_ = None
self._theta = None
# 用正规方程解
def fit_normal(self, X_train, y_train):
"""根据训练集X_train,y_train训练出Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must equal to the size of y_train"
X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
self.interception_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
def predict(self, X_predict):
"""给定待预测数据集X_predict, 返回表示X_predict的结果向量"""
assert self.interception_ is not None and self.coef_ is not None, \
"must fit before predict"
assert X_predict.shape[1] == len(self.coef_), \
"the feature number of X_predict must be equal to X_train"
X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
return X_b.dot(self._theta)
def score(self, X_test, y_test):
"""根据X_test和y_test确定当前模型的准确度"""
y_predict = self.predict(X_test)
return r2_score(y_test, y_predict)
def __repr__(self):
return "LinearRegression()"
2. 测试多元线性回归模型
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
X.shape
y.shape
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
reg.coef_
#array([-1.20354261e-01, 3.64423279e-02, -3.61493155e-02, 5.12978140e-02,
# -1.15775825e+01, 3.42740062e+00, -2.32311760e-02, -1.19487594e+00,
# 2.60101728e-01, -1.40219119e-02, -8.35430488e-01, 7.80472852e-03,
# -3.80923751e-01])
reg.interception_
#34.117399723204585
reg.score(X_test, y_test)
#0.8129794056212895
3. 使用scikit-learn中的线性回归模型
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
#由于skearn的训练测试split算法和我写的有点不同,便于调试我在这里调用我的数据训练分割算法
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_reg.coef_
lin_reg.intercept_
lin_reg.score(X_test, y_test)
4. 使用kNN的回归模型预测
#kNN解决分类问题是KNeighborsClassifier,回归问题是KNeighborsRegressor
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)
knn_reg.score(X_test, y_test)
#Out:0.5865412198300899
5. kNN网格搜索寻找最好的超参数并进行预测
#网格搜索寻找超参数
from sklearn.model_selection import GridSearchCV
param_grid = [
{
"weights": ["uniform"],
"n_neighbors": [i for i in range(1, 11)]
},
{
"weights": ["distance"],
"n_neighbors": [i for i in range(1, 11)],
"p": [i for i in range(1, 6)]
}
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
# n_jobs并行处理参数,-1表示调用cpu所有核
# verbose表示输出的详细情况.
grid_search.fit(X_train, y_train)
#Out:如下图
grid_search.best_params_
#Out: {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
grid_search.best_score_
#Out: 0.6340477954176972
grid_search.best_estimator_.score(X_test, y_test)
#Out: 0.7044357727037996
##grid_search.best_score_用的是GridSearchCV的交叉验证
##grid_search.best_estimator返回KNeighborsRegressor对象,调用KNeighborsRegressor对象的score方法
##还有一点要注意的是在寻找超参数的过程中,会进行score的比较,这个比较也用的GridSearchCV的算法和我们的score算法不太一样。
五、线性回归算法总结
典型的参数学习
- 对比kNN:非参数学习
- 只能解决回归问题。
虽然很多分类方法中,线性回归是基础。
对比kNN:即可以解决分类问题,又可以解决回归问题。
- 对数据有假设:线性
对比kNN对数据没有假设
- 优先:对数据有可解释性
多元回归算法的正规方程解的时间复杂度高,因此改为梯度下降法,见机器学习入门算法之梯度下降法:
[post cid="39" /]