问答网站问题、回答数量预测
标杆:问答网站问题、回答数量预测
我们可以用任何现成的函数完成,也可以自己动手写一个随机梯度下降法来得到回归系数。
一元线性回归模型(Python)
该模型预测结果的平均MAPE为:0.09190
该模型只用id作为一元线性模型的自变量。我们可以用任何现成的函数完成,也可以自己动手写一个随机梯度下降法来得到回归系数。
星期交叉项回归模型(Python)
该模型预测结果的平均MAPE为:0.04531
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# 读取数据
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")
# 取出真实值:questions和answers
q_train = train.pop('questions')
a_train = train.pop('answers')
# 把date转为时间格式,得到星期,再进行独热处理
train['date'] = pd.to_datetime(train['date'])
train['dayofweek'] = train['date'].dt.dayofweek
train = pd.get_dummies(train, columns=['dayofweek'])
test['date'] = pd.to_datetime(test['date'])
test['dayofweek'] = test['date'].dt.dayofweek
test = pd.get_dummies(test, columns=['dayofweek'])
# 插入id与星期的交叉相,一共得到7项
for i in range(7):
train['id_dayofweek_%s'%i] = train['id'] * train['dayofweek_%s'%i]
test['id_dayofweek_%s'%i] = test['id'] * test['dayofweek_%s'%i]
# 去掉date这一列
train.drop('date', axis=1, inplace=True)
test.drop('date', axis=1, inplace=True)
# 建立多变量线性回归模型并进行预测
# 预测questions
reg = LinearRegression()
reg.fit(train, q_train)
q_pred = reg.predict(test)
# 预测answers
reg = LinearRegression()
reg.fit(train, a_train)
a_pred = reg.predict(test)
# 输出预测结果至my_LR_prediction.csv
submit['questions'] = q_pred
submit['answers'] = a_pred
submit.to_csv('my_LR_prediction.csv', index=False)
线性回归k近邻混合模型(Python)
该模型预测结果的平均MAPE为:0.03170
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.linear_model import LinearRegression
# 读取数据
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")
# 构造非线性特征
cols_lr = ['id', 'sqrt_id']
train['sqrt_id'] = np.sqrt(train['id'])
test['sqrt_id'] = np.sqrt(test['id'])
# 构造星期、月、年特征
train['date'] = pd.to_datetime(train['date'])
train['d_w'] = train['date'].dt.dayofweek
train['d_m'] = train['date'].dt.month
train['d_y'] = train['date'].dt.year
test['date'] = pd.to_datetime(test['date'])
test['d_w'] = test['date'].dt.dayofweek
test['d_m'] = test['date'].dt.month
test['d_y'] = test['date'].dt.year
cols_knn = ['d_w', 'd_m', 'd_y']
# 根据特征['id', 'sqrt_id'],构造线性模型预测questions
reg = LinearRegression()
reg.fit(train[cols_lr], train['questions'])
q_fit = reg.predict(train[cols_lr])
q_pred = reg.predict(test[cols_lr])
# 根据特征['id', 'sqrt_id'],构造线性模型预测answers
reg = LinearRegression()
reg.fit(train[cols_lr], train['answers'])
a_fit = reg.predict(train[cols_lr])
a_pred = reg.predict(test[cols_lr])
# 得到questions和answers的训练误差
q_diff = train['questions'] - q_fit
a_diff = train['answers'] - a_fit
# 把训练误差作为新的目标值,使用特征cols_knn,建立kNN模型
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor()
reg.fit(train[cols_knn], q_diff)
q_pred_knn = reg.predict(test[cols_knn])
reg = KNeighborsRegressor()
reg.fit(train[cols_knn], a_diff)
a_pred_knn = reg.predict(test[cols_knn])
#输出预测结果至my_Lr_Knn_prediction.csv
submit['questions'] = q_pred + q_pred_knn
submit['answers'] = a_pred + a_pred_knn
submit.to_csv('my_Lr_Knn_prediction.csv', index=False)