在线观看www成人影院-在线观看www日本免费网站-在线观看www视频-在线观看操-欧美18在线-欧美1级

0
  • 聊天消息
  • 系統(tǒng)消息
  • 評論與回復(fù)
登錄后你可以
  • 下載海量資料
  • 學(xué)習(xí)在線課程
  • 觀看技術(shù)視頻
  • 寫文章/發(fā)帖/加入社區(qū)
會(huì)員中心
創(chuàng)作中心

完善資料讓更多小伙伴認(rèn)識(shí)你,還能領(lǐng)取20積分哦,立即完善>

3天內(nèi)不再提示

在kaggle的競賽中,參賽者取得top0.3%的經(jīng)驗(yàn)和技巧

WpOh_rgznai100 ? 來源:lq ? 2019-07-13 07:29 ? 次閱讀

導(dǎo)讀:剛開始接觸數(shù)據(jù)競賽時(shí),我們可能會(huì)被一些高大上的技術(shù)嚇到。各界大佬云集,各種技術(shù)令人眼花繚亂,新手們就像蜉蝣一般渺小無助。今天本文就分享一下在 kaggle 的競賽中,參賽者取得 top0.3% 的經(jīng)驗(yàn)和技巧。讓我們開始吧!

Top 0.3% 模型概覽

賽題和目標(biāo)

數(shù)據(jù)集中的每一行都描述了某一匹馬的特征

在已知這些特征的條件下,預(yù)測每匹馬的銷售價(jià)格

預(yù)測價(jià)格對數(shù)和真實(shí)價(jià)格對數(shù)的RMSE(均方根誤差)作為模型的評估指標(biāo)。將RMSE轉(zhuǎn)化為對數(shù)尺度,能夠保證廉價(jià)馬匹和高價(jià)馬匹的預(yù)測誤差,對模型分?jǐn)?shù)的影響較為一致。

模型訓(xùn)練過程中的重要細(xì)節(jié)

交叉驗(yàn)證:使用12-折交叉驗(yàn)證

模型:在每次交叉驗(yàn)證中,同時(shí)訓(xùn)練七個(gè)模型(ridge, svr, gradient boosting, random forest, xgboost, lightgbm regressors)

Stacking 方法:使用 xgboot 訓(xùn)練了元 StackingCVRegressor 學(xué)習(xí)器

模型融合:所有訓(xùn)練的模型都會(huì)在不同程度上過擬合,因此,為了做出最終的預(yù)測,將這些模型進(jìn)行了融合,得到了魯棒性更強(qiáng)的預(yù)測結(jié)果

模型性能

從下圖可以看出,融合后的模型性能最好,RMSE 僅為 0.075,該融合模型用于最終預(yù)測。

In[1]:

from IPython.display import ImageImage("../input/kernel-files/model_training_advanced_regression.png")

Output[1]:

現(xiàn)在讓我們正式開始吧!

In[2]:

# Essentialsimport numpy as npimport pandas as pdimport datetimeimport random

# Plotsimport seaborn as snsimport matplotlib.pyplot as plt# Modelsfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressorfrom sklearn.kernel_ridge import KernelRidgefrom sklearn.linear_model import Ridge, RidgeCVfrom sklearn.linear_model import ElasticNet, ElasticNetCVfrom sklearn.svm import SVRfrom mlxtend.regressor import StackingCVRegressorimport lightgbm as lgbfrom lightgbm import LGBMRegressorfrom xgboost import XGBRegressor# Statsfrom scipy.stats import skew, normfrom scipy.special import boxcox1pfrom scipy.stats import boxcox_normmax# Miscfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import KFold, cross_val_scorefrom sklearn.metrics import mean_squared_errorfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.preprocessing import LabelEncoderfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import scalefrom sklearn.preprocessing import StandardScalerfrom sklearn.preprocessing import RobustScalerfrom sklearn.decomposition import PCApd.set_option('display.max_columns', None)# Ignore useless warningsimport warningswarnings.filterwarnings(action="ignore")pd.options.display.max_seq_items = 8000pd.options.display.max_rows = 8000import osprint(os.listdir("../input/kernel-fi

Output[2]:

['model_training_advanced_regression.png']

In[3]:

# Read in the dataset as a dataframetrain = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')train.shape, test.shape

Output[3]:

((1460, 81), (1459, 80))

EDA

目標(biāo)

數(shù)據(jù)集中的每一行都描述了某一匹馬的特征

在已知這些特征的條件下,預(yù)測每匹馬的銷售價(jià)格

對原始數(shù)據(jù)進(jìn)行可視化

In[4]:

# Preview the data we're working withtrain.head()

Output[5]:

SalePrice:目標(biāo)值的特性探究

In[5]:

sns.set_style("white")sns.set_color_codes(palette='deep')f, ax = plt.subplots(figsize=(8, 7))#Check the new distributionsns.distplot(train['SalePrice'], color="b");ax.xaxis.grid(False)ax.set(ylabel="Frequency")ax.set(xlabel="SalePrice")ax.set(title="SalePrice distribution")sns.despine(trim=True, left=True)plt.show()

In[6]:

# Skew and kurtprint("Skewness: %f" % train['SalePrice'].skew())print("Kurtosis: %f" % train['SalePrice'].kurt())

Skewness: 1.882876

Kurtosis: 6.536282

可用的特征:深入探索

數(shù)據(jù)可視化

In[7]:

# Finding numeric featuresnumeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']numeric = []for i in train.columns: if train[i].dtype in numeric_dtypes: if i in ['TotalSF', 'Total_Bathrooms','Total_porch_sf','haspool','hasgarage','hasbsmt','hasfireplace']: pass else: numeric.append(i)# visualising some more outliers in the data valuesfig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120))plt.subplots_adjust(right=2)plt.subplots_adjust(top=2)sns.color_palette("husl", 8)for i, feature in enumerate(list(train[numeric]), 1): if(feature=='MiscVal'): break plt.subplot(len(list(numeric)), 3, i) sns.scatterplot(x=feature, y='SalePrice', hue='SalePrice', palette='Blues', data=train) plt.xlabel('{}'.format(feature), size=15,labelpad=12.5) plt.ylabel('SalePrice', size=15, labelpad=12.5) for j in range(2): plt.tick_params(axis='x', labelsize=12) plt.tick_params(axis='y', labelsize=12) plt.legend(loc='best', prop={'size': 10})plt.show()

探索這些特征以及 SalePrice 的相關(guān)性

In[8]:

corr = train.corr()plt.subplots(figsize=(15,12))sns.heatmap(corr, vmax=0.9, cmap="Blues", square=True)

Output[8]:

選取部分特征,可視化它們和 SalePrice 的相關(guān)性

Input[9]:

data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)f, ax = plt.subplots(figsize=(8, 6))fig = sns.boxplot(x=train['OverallQual'], y="SalePrice", data=data)fig.axis(ymin=0, ymax=800000);

Input[10]:

data = pd.concat([train['SalePrice'], train['YearBuilt']], axis=1)f, ax = plt.subplots(figsize=(16, 8))fig = sns.boxplot(x=train['YearBuilt'], y="SalePrice", data=data)fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=45);

Input[11]:

data = pd.concat([train['SalePrice'], train['TotalBsmtSF']], axis=1)data.plot.scatter(x='TotalBsmtSF', y='SalePrice', alpha=0

.3, ylim=(0,800000));

Input[12]:

data = pd.concat([train['SalePrice'], train['LotArea']], axis=1)data.plot.scatter(x='LotArea', y='SalePrice', alpha=0.3, y

lim=(0,800000));

Input[13]:

data = pd.concat([train['SalePrice'], train['GrLivArea']], axis=1)data.plot.scatter(x='GrLivArea', y='SalePrice', alpha=0.3,

ylim=(0,800000));

Input[14]:

# Remove the Ids from train and test, as they are unique for each row and hence not useful for the modeltrain_ID = train['Id']test_ID = test['Id']train.drop(['Id'], axis=1, inplace=True)test.drop(['Id'], axis=1, inplace=True)train.shape, test.shape

Output[14]:

((1460, 80), (1459, 79))

可視化 salePrice 的分布

Input[15]:

sns.set_style("white")sns.set_color_codes(palette='deep')f, ax = plt.subplots(figsize=(8, 7))#Check the new distributionsns.distplot(train['SalePrice'], color="b");ax.xaxis.grid(False)ax.set(ylabel="Frequency")ax.set(xlabel="SalePrice")ax.set(title="SalePrice distribution")sns.despine(trim=True, left=True)plt.show()

從上圖中可以看出,SalePrice 有點(diǎn)向右邊傾斜,由于大多數(shù)機(jī)器學(xué)習(xí)模型對非正態(tài)分布的數(shù)據(jù)的效果不佳,因此,我們對數(shù)據(jù)進(jìn)行變換,修正這種傾斜:log(1+x)

Input[16]:

# log(1+x) transformtrain["SalePrice"] = np.log1p(train["SalePrice"])

對 SalePrice 重新進(jìn)行可視化

Input[17]:

sns.set_style("white")sns.set_color_codes(palette='deep')f, ax = plt.subplots(figsize=(8, 7))#Check the new distributionsns.distplot(train['SalePrice'] , fit=norm, color="b");# Get the fitted parameters used by the function(mu, sigma) = norm.fit(train['SalePrice'])print( ' mu = {:.2f} and sigma = {:.2f} '.format(mu, sigma))#Now plot the distributionplt.legend(['Normal dist. ($mu=$ {:.2f} and $sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')ax.xaxis.grid(False)ax.set(ylabel="Frequency")ax.set(xlabel="SalePrice")ax.set(title="SalePrice distribution")sns.despine(trim=True, left=True)plt.show

mu = 12.02 and sigma = 0.40

從圖中可以看到,當(dāng)前的 SalePrice 已經(jīng)變成了正態(tài)分布

Input[18]:

# Remove outlierstrain.drop(train[(train['OverallQual']<5) & (train['SalePrice']>200000)].index, inplace=True)train.drop(train[(train['GrLivArea']>4500) & (train['SalePrice']<300000)].index, inplace=True)train.reset_index(drop=True, inplace=True)

Input[19]:

# Split features and labelstrain_labels = train['SalePrice'].reset_index(drop=True)train_features = train.drop(['SalePrice'], axis=1)test_features = test# Combine train and test features in order to apply the feature transformation pipeline to the entire datasetall_features = pd.concat([train_features, test_features]).reset_index(drop=True)all_features.shape

Input[19]:

(2917, 79)

填充缺失值

Input[20]:

# determine the threshold for missing valuesdef percent_missing(df): data = pd.DataFrame(df) df_cols = list(pd.DataFrame(data)) dict_x = {} for i in range(0, len(df_cols)): dict_x.update({df_cols[i]: round(data[df_cols[i]].isnull().mean()*100,2)}) return dict_xmissing = percent_missing(all_features)df_miss = sorted(missing.items(), key=lambda x: x[1], reverse=True)print('Percent of missing data')df_miss[0:10]

Percent of missing data

Output[20]:

[('PoolQC', 99.69),

('MiscFeature', 96.4),

('Alley', 93.21),

('Fence', 80.43),

('FireplaceQu', 48.68),

('LotFrontage', 16.66),

('GarageYrBlt', 5.45),

('GarageFinish', 5.45),

('GarageQual', 5.45),

('GarageCond', 5.45)]

Input[21]:

# Visualize missing valuessns.set_style("white")f, ax = plt.subplots(figsize=(8, 7))sns.set_color_codes(palette='deep')missing = round(train.isnull().mean()*100,2)missing = missing[missing > 0]missing.sort_values(inplace=True)missing.plot.bar(color="b")# Tweak the visual presentationax.xaxis.grid(False)ax.set(ylabel="Percent of missing values")ax.set(xlabel="Features")ax.set(title="Percent missing data by feature")sns.despine(trim=True, left=True)

接下來,我們將分別對每一列填充缺失值

Input[22]:

# Some of the non-numeric predictors are stored as numbers; convert them into stringsall_features['MSSubClass'] = all_features['MSSubClass'].apply(str)all_features['YrSold'] = all_features['YrSold'].astype(str)all_features['MoSold'] = all_features['MoSold'].astype(str)

Input[23]:

def handle_missing(features): # the data description states that NA refers to typical ('Typ') values features['Functional'] = features['Functional'].fillna('Typ') # Replace the missing values in each of the columns below with their mode features['Electrical'] = features['Electrical'].fillna("SBrkr") features['KitchenQual'] = features['KitchenQual'].fillna("TA") features['Exterior1st'] = features['Exterior1st'].fillna(features['Exterior1st'].mode()[0]) features['Exterior2nd'] = features['Exterior2nd'].fillna(features['Exterior2nd'].mode()[0]) features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0]) features['MSZoning'] = features.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0])) # the data description stats that NA refers to "No Pool" features["PoolQC"] = features["PoolQC"].fillna("None") # Replacing the missing values with 0, since no garage = no cars in garage for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'): features[col] = features[col].fillna(0) # Replacing the missing values with None for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']: features[col] = features[col].fillna('None') # NaN values for these categorical basement features, means there's no basement for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'): features[col] = features[col].fillna('None') # Group the by neighborhoods, and fill in missing value by the median LotFrontage of the neighborhood features['LotFrontage'] = features.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median())) # We have no particular intuition around how to fill in the rest of the categorical features # So we replace their missing values with None objects = [] for i in features.columns: if features[i].dtype == object: objects.append(i) features.update(features[objects].fillna('None')) # And we do the same thing for numerical features, but this time with 0s numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] numeric = [] for i in features.columns: if features[i].dtype in numeric_dtypes: numeric.append(i) features.update(features[numeric].fillna(0)) return featuresall_features = handle_missing(all_features

Input[24]:

# Let's make sure we handled all the missing valuesmissing = percent_missing(all_features)df_miss = sorted(missing.items(), key=lambda x: x[1], reverse=True)print('Percent of missing data')df_miss[0:10]

Output[14]:

Percent of missing data

[('MSSubClass', 0.0),

('MSZoning', 0.0),

('LotFrontage', 0.0),

('LotArea', 0.0),

('Street', 0.0),

('Alley', 0.0),

('LotShape', 0.0),

('LandContour', 0.0),

('Utilities', 0.0),

('LotConfig', 0.0)]

從上面的結(jié)果可以看到,所有缺失值已經(jīng)填充完畢

調(diào)整分布傾斜的特征

Input[25]:

# Fetch all numeric featuresnumeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']numeric = []for i in all_features.columns: if all_features[i].dtype in numeric_dtypes: numeric.append(i)

Input[26]:

# Create box plots for all numeric featuressns.set_style("white")f, ax = plt.subplots(figsize=(8, 7))ax.set_xscale("log")ax = sns.boxplot(data=all_features[numeric] , orient="h", palette="Set1")ax.xaxis.grid(False)ax.set(ylabel="Feature names")ax.set(xlabel="Numeric values")ax.set(title="Numeric Distribution of Features")sns.despine(trim=True, left=True)

Input[27]:

# Find skewed numerical featuresskew_features=all_features[numeric].apply(lambdax:skew(x)).sort_values(ascending=False)high_skew = skew_features[skew_features > 0.5]skew_index = high_skew.indexprint("There are {} numerical features with Skew > 0.5 :".format(high_skew.shape[0]))skewness = pd.DataFrame({'Skew' :high_skew})skew_features.head(10

Output[27]:

There are 25 numerical features with Skew > 0.5 :

MiscVal 21.939672

PoolArea 17.688664

LotArea 13.109495

LowQualFinSF 12.084539

3SsnPorch 11.372080

KitchenAbvGr 4.300550

BsmtFinSF2 4.144503

EnclosedPorch 4.002344

ScreenPorch 3.945101

BsmtHalfBath 3.929996

dtype: float64

使用 scipy 的函數(shù) boxcox1來進(jìn)行 Box-Cox 轉(zhuǎn)換,將數(shù)據(jù)正態(tài)化

Input[28]:# Normalize skewed featuresfor i in skew_index: all_features[i] = boxcox1p(all_features[i], boxcox_normmax(all_features[i]+1))

Input[29]:

# Let's make sure we handled all the skewed valuessns.set_style("white")f, ax = plt.subplots(figsize=(8, 7))ax.set_xscale("log")ax = sns.boxplot(data=all_features[skew_index] , orient="h", palette="Set1")ax.xaxis.grid(False)ax.set(ylabel="Feature names")ax.set(xlabel="Numeric values")ax.set(title="Numeric Distribution of Features")sns.despine(trim=True, left=True)

從上圖可以看到,所有特征都看上去呈正態(tài)分布了。

創(chuàng)建一些有用的特征

機(jī)器學(xué)習(xí)模型對復(fù)雜模型的認(rèn)知較差,因此我們需要用我們的直覺來構(gòu)建有效的特征,從而幫助模型更加有效的學(xué)習(xí)。

all_features['BsmtFinType1_Unf'] = 1*(all_features['BsmtFinType1'] == 'Unf')all_features['HasWoodDeck'] = (all_features['WoodDeckSF'] == 0) * 1all_features['HasOpenPorch'] = (all_features['OpenPorchSF'] == 0) * 1all_features['HasEnclosedPorch'] = (all_features['EnclosedPorch'] == 0) * 1all_features['Has3SsnPorch'] = (all_features['3SsnPorch'] == 0) * 1all_features['HasScreenPorch'] = (all_features['ScreenPorch'] == 0) * 1all_features['YearsSinceRemodel'] = all_features['YrSold'].astype(int) - all_features['YearRemodAdd'].astype(int)all_features['Total_Home_Quality'] = all_features['OverallQual'] + all_features['OverallCond']all_features = all_features.drop(['Utilities', 'Street', 'PoolQC',], axis=1)all_features['TotalSF'] = all_features['TotalBsmtSF'] + all_features['1stFlrSF'] + all_features['2ndFlrSF']all_features['YrBltAndRemod'] = all_features['YearBuilt'] + all_features['YearRemodAdd']all_features['Total_sqr_footage'] = (all_features['BsmtFinSF1'] + all_features['BsmtFinSF2'] + all_features['1stFlrSF'] + all_features['2ndFlrSF'])all_features['Total_Bathrooms'] = (all_features['FullBath'] + (0.5 * all_features['HalfBath']) + all_features['BsmtFullBath'] + (0.5 * all_features['BsmtHalfBath']))all_features['Total_porch_sf'] = (all_features['OpenPorchSF'] + all_features['3SsnPorch'] + all_features['EnclosedPorch'] + all_features['ScreenPorch'] + all_features['WoodDeckSF'])all_features['TotalBsmtSF'] = all_features['TotalBsmtSF'].apply(lambda x: np.exp(6) if x <= 0.0 else x)all_features['2ndFlrSF'] = all_features['2ndFlrSF'].apply(lambda x: np.exp(6.5) if x <= 0.0 else x)all_features['GarageArea'] = all_features['GarageArea'].apply(lambda x: np.exp(6) if x <= 0.0 else x)all_features['GarageCars'] = all_features['GarageCars'].apply(lambda x: 0 if x <= 0.0 else x)all_features['LotFrontage'] = all_features['LotFrontage'].apply(lambda x: np.exp(4.2) if x <= 0.0 else x)all_features['MasVnrArea'] = all_features['MasVnrArea'].apply(lambda x: np.exp(4) if x <= 0.0 else x)all_features['BsmtFinSF1'] = all_features['BsmtFinSF1'].apply(lambda x: np.exp(6.5) if x <= 0.0 else x)all_features['haspool'] = all_features['PoolArea'].apply(lambda x: 1 if x > 0 else 0)all_features['has2ndfloor'] = all_features['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)all_features['hasgarage'] = all_features['GarageArea'].apply(lambda x: 1 if x > 0 else 0)all_features['hasbsmt'] = all_features['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)all_features['hasfireplace'] = all_features['Fireplaces'].apply(lambda x: 1 if x > 0 else 0

特征轉(zhuǎn)換

通過對特征取對數(shù)或者平方,可以創(chuàng)造更多的特征,這些操作有利于發(fā)掘潛在的有用特征。

def logs(res, ls): m = res.shape[1] for l in ls: res = res.assign(newcol=pd.Series(np.log(1.01+res[l])).values) res.columns.values[m] = l + '_log' m += 1 return reslog_features = ['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF', 'TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea', 'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr', 'TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF', 'EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal','YearRemodAdd','TotalSF']all_features = logs(all_features, log_features

def squares(res, ls): m = res.shape[1] for l in ls: res = res.assign(newcol=pd.Series(res[l]*res[l]).values) res.columns.values[m] = l + '_sq' m += 1 return ressquared_features = ['YearRemodAdd', 'LotFrontage_log', 'TotalBsmtSF_log', '1stFlrSF_log', '2ndFlrSF_log', 'GrLivArea_log', 'GarageCars_log', 'GarageArea_log']all_features = squares(all_features, squared_features)

對集合特征進(jìn)行編碼

對集合特征進(jìn)行數(shù)值編碼,使得機(jī)器學(xué)習(xí)模型能夠處理這些特征。

all_features = pd.get_dummies(all_features).reset_index(drop=True)all_features.shape

(2917, 379)

all_features.head()

all_features.shape

(2917, 379)

# Remove any duplicated column namesall_features = all_features.loc[:,~all_features.columns. duplicated()]

重新創(chuàng)建訓(xùn)練集和測試集

X = all_features.iloc[:len(train_labels), :]X_test = all_features.iloc[len(train_labels):, :]X.shape, train_labels.shape, X_test.shape

((1458, 378), (1458,), (1459, 378))

對訓(xùn)練集中的部分特征進(jìn)行可視化

# Finding numeric featuresnumeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']numeric = []for i in X.columns: if X[i].dtype in numeric_dtypes: if i in ['TotalSF', 'Total_Bathrooms','Total_porch_sf','haspool','hasgarage','hasbsmt','hasfireplace']: pass else: numeric.append(i)# visualising some more outliers in the data valuesfig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 150))plt.subplots_adjust(right=2)plt.subplots_adjust(top=2)sns.color_palette("husl", 8)for i, feature in enumerate(list(X[numeric]), 1): if(feature=='MiscVal'): break plt.subplot(len(list(numeric)), 3, i) sns.scatterplot(x=feature, y='SalePrice', hue='SalePrice', palette='Blues', data=train) plt.xlabel('{}'.format(feature), size=15,labelpad=12.5) plt.ylabel('SalePrice', size=15, labelpad=12.5) for j in range(2): plt.tick_params(axis='x', labelsize=12) plt.tick_params(axis='y', labelsize=12) plt.legend(loc='best', prop={'size': 10})plt.show()

模型訓(xùn)練

模型訓(xùn)練過程中的重要細(xì)節(jié)

交叉驗(yàn)證:使用12-折交叉驗(yàn)證

模型:在每次交叉驗(yàn)證中,同時(shí)訓(xùn)練七個(gè)模型(ridge, svr, gradient boosting, random forest, xgboost, lightgbm regressors)

Stacking 方法:使用xgboot訓(xùn)練了元 StackingCVRegressor 學(xué)習(xí)器

模型融合:所有訓(xùn)練的模型都會(huì)在不同程度上過擬合,因此,為了做出最終的預(yù)測,將這些模型進(jìn)行了融合,得到了魯棒性更強(qiáng)的預(yù)測結(jié)果

初始化交叉驗(yàn)證,定義誤差評估指標(biāo)

# Setup cross validation foldskf=KFold(n_splits=12,random_state=42,shuffle=True)

# Define error metricsdef rmsle(y, y_pred): return np.sqrt(mean_squared_error(y, y_pred))def cv_rmse(model, X=X): rmse = np.sqrt(-cross_val_score(model, X, train_labels, scoring="neg_mean_squared_error", cv=kf)) return (rmse)

建立模型

# Light Gradient Boosting Regressorlightgbm = LGBMRegressor(objective='regression', num_leaves=6, learning_rate=0.01, n_estimators=7000, max_bin=200, bagging_fraction=0.8, bagging_freq=4, bagging_seed=8, feature_fraction=0.2, feature_fraction_seed=8, min_sum_hessian_in_leaf = 11, verbose=-1, random_state=42)# XGBoost Regressorxgboost = XGBRegressor(learning_rate=0.01, n_estimators=6000, max_depth=4, min_child_weight=0, gamma=0.6, subsample=0.7, colsample_bytree=0.7, objective='reg:linear', nthread=-1, scale_pos_weight=1, seed=27, reg_alpha=0.00006, random_state=42)# Ridge Regressorridge_alphas = [1e-15, 1e-10, 1e-8, 9e-4, 7e-4, 5e-4, 3e-4, 1e-4, 1e-3, 5e-2, 1e-2, 0.1, 0.3, 1, 3, 5, 10, 15, 18, 20, 30, 50, 75, 100]ridge = make_pipeline(RobustScaler(), RidgeCV(alphas=ridge_alphas, cv=kf))# Support Vector Regressorsvr = make_pipeline(RobustScaler(), SVR(C= 20, epsilon= 0.008, gamma=0.0003))# Gradient Boosting Regressorgbr = GradientBoostingRegressor(n_estimators=6000, learning_rate=0.01, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=42)# Random Forest Regressorrf = RandomForestRegressor(n_estimators=1200, max_depth=15, min_samples_split=5, min_samples_leaf=5, max_features=None, oob_score=True, random_state=42)# Stack up all the models above, optimized using xgbooststack_gen = StackingCVRegressor(regressors=(xgboost, lightgbm, svr, ridge, gbr, rf), meta_regressor=xgboost, use_features_in_secondary=True)

訓(xùn)練模型

計(jì)算每個(gè)模型的交叉驗(yàn)證的得分

scores = {}score = cv_rmse(lightgbm)print("lightgbm: {:.4f} ({:.4f})".format(score.mean(), score.std()))scores['lgb'] = (score.mean(), score.std())

lightgbm: 0.1159 (0.0167)

score = cv_rmse(xgboost)print("xgboost: {:.4f} ({:.4f})".format(score.mean(), score.std()))scores['xgb'] = (score.mean(), score.std())

xgboost: 0.1364 (0.0175)

score = cv_rmse(svr)print("SVR: {:.4f} ({:.4f})".format(score.mean(), score.std()))scores['svr'] = (score.mean(), score.std())

SVR: 0.1094 (0.0200)

score = cv_rmse(ridge)print("ridge: {:.4f} ({:.4f})".format(score.mean(), score.std()))scores['ridge'] = (score.mean(), score.std())

ridge: 0.1101 (0.0161)

score = cv_rmse(rf)print("rf: {:.4f} ({:.4f})".format(score.mean(), score.std()))scores['rf'] = (score.mean(), score.std())

rf: 0.1366 (0.0188

score = cv_rmse(gbr)print("gbr: {:.4f} ({:.4f})".format(score.mean(), score.std()))scores['gbr']=(score.mean(),score.std())

gbr: 0.1121 (0.0164)

擬合模型

print('stack_gen')stack_gen_model=stack_gen.fit(np.array(X),np.array(train_labels))

stack_gen

print('lightgbm')lgb_model_full_data = lightgbm.fit(X, train_labels)

lightgbm

print('xgboost')xgb_model_full_data = xgboost.fit(X, train_labels)

xgboost

print('Svr')svr_model_full_data = svr.fit(X, train_labels)

Svr

print('Ridge')ridge_model_full_data = ridge.fit(X, train_labels)

Ridge

print('RandomForest')rf_model_full_data = rf.fit(X, train_labels)

RandomForest

print('GradientBoosting')gbr_model_full_data = gbr.fit(X, train_labels)

GradientBoosting

融合各個(gè)模型,并進(jìn)行最終預(yù)測

# Blend models in order to make the final predictions more robust to overfittingdef blended_predictions(X): return ((0.1 * ridge_model_full_data.predict(X)) + (0.2 * svr_model_full_data.predict(X)) + (0.1 * gbr_model_full_data.predict(X)) + (0.1 * xgb_model_full_data.predict(X)) + (0.1 * lgb_model_full_data.predict(X)) + (0.05 * rf_model_full_data.predict(X)) + (0.35 * stack_gen_model.predict(np.array(X))))

# Get final precitions from the blended modelblended_score = rmsle(train_labels, blended_predictions(X))scores['blended'] = (blended_score, 0)print('RMSLE score on train data:')print(blended_score)

RMSLE score on train data:

0.07537440195302639

各模型性能比較

# Plot the predictions for each modelsns.set_style("white")fig = plt.figure(figsize=(24, 12))ax = sns.pointplot(x=list(scores.keys()), y=[score for score, _ in scores.values()], markers=['o'], linestyles=['-'])for i, score in enumerate(scores.values()): ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]), horizontalalignment='left', size='large', color='black', weight='semibold')plt.ylabel('Score (RMSE)', size=20, labelpad=12.5)plt.xlabel('Model', size=20, labelpad=12.5)plt.tick_params(axis='x', labelsize=13.5)plt.tick_params(axis='y', labelsize=12.5)plt.title('Scores of Models', size=20)plt.sho

從上圖可以看出,融合后的模型性能最好,RMSE 僅為 0.075,該融合模型用于最終預(yù)測。

提交預(yù)測結(jié)果

# Read in sample_submission dataframesubmission = pd.read_csv("../input/house-prices-advanced-regression-techniques/sample_submission.csv")submission.shape

(1459, 2)

# Append predictions from blended modelssubmission.iloc[:,1] = np.floor(np.expm1(blended_predictions(X_test)))# Fix outleir predictionsq1 = submission['SalePrice'].quantile(0.0045)q2 = submission['SalePrice'].quantile(0.99)submission['SalePrice'] = submission['SalePrice'].apply(lambda x: x if x > q1 else x*0.77)submission['SalePrice'] = submission['SalePrice'].apply(lambda x: x if x < q2 else x*1.1)submission.to_csv("submission_regression1.csv",?index=False)

# Scale predictionssubmission['SalePrice'] *= 1.001619submission.to_csv("submission_regression2.csv", index=False)

聲明:本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人,不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學(xué)習(xí)之用,如有內(nèi)容侵權(quán)或者其他違規(guī)問題,請聯(lián)系本站處理。 舉報(bào)投訴
  • 數(shù)據(jù)集
    +關(guān)注

    關(guān)注

    4

    文章

    1208

    瀏覽量

    24703
  • 模型訓(xùn)練
    +關(guān)注

    關(guān)注

    0

    文章

    18

    瀏覽量

    1341

原文標(biāo)題:過關(guān)斬將打進(jìn)Kaggle競賽Top 0.3%,我是這樣做的

文章出處:【微信號:rgznai100,微信公眾號:rgznai100】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。

收藏 人收藏

    評論

    相關(guān)推薦

    全國大學(xué)生電子設(shè)計(jì)競賽大家準(zhǔn)備好了嗎?

    ”的組織運(yùn)行模式也不斷完善,有力推動(dòng)了教育教學(xué)改革,為人才培養(yǎng)做了很好的鋪墊,為了協(xié)助廣大師生在競賽取得優(yōu)異成績,c8051f網(wǎng)絡(luò)特別增設(shè)“全國大學(xué)生電子設(shè)計(jì)
    發(fā)表于 07-21 14:05

    如何查詢2011全國大學(xué)生電子設(shè)計(jì)競賽各大賽區(qū)名單?

    如何查詢2011全國大學(xué)生電子設(shè)計(jì)競賽各大賽區(qū)參賽者名單?
    發(fā)表于 11-21 19:21

    電子大賽一等獎(jiǎng)獲得談比賽經(jīng)驗(yàn)

    之外實(shí)際的工作,我們一般都要與人合作共同完成某一項(xiàng)目,這就非常需要團(tuán)隊(duì)精神,而這一點(diǎn)課堂常規(guī)教學(xué)得到的鍛煉是很有限的。電子設(shè)計(jì)競賽要求三人組隊(duì)
    發(fā)表于 02-24 09:44

    玩轉(zhuǎn)FPGA,賽靈思FPGA設(shè)計(jì)大賽活動(dòng)細(xì)則,參賽必看

    和水平。【活動(dòng)報(bào)名】a.點(diǎn)擊“我要報(bào)名”按鈕,在網(wǎng)站活動(dòng)頁面進(jìn)行報(bào)名,參賽方案需提交到FPGA論壇專區(qū)。b.參賽者論壇使用唯一ID。報(bào)名后,管理員會(huì)給每個(gè)報(bào)名提供一個(gè)
    發(fā)表于 04-24 14:40

    賽靈思FPGA設(shè)計(jì)大賽參賽者自評分表格下載

    賽靈思FPGA設(shè)計(jì)大賽參賽者自評分表格下載自評分表填寫指引:參賽者須于提交設(shè)計(jì)作品時(shí)一并呈交自評分表。每一個(gè)參賽作品最高可獲得10分自評分。請?jiān)谶m當(dāng)?shù)姆礁裆洗蚬础?b class='flag-5'>參賽者作品自評分表格下
    發(fā)表于 04-24 15:07

    賽靈思fpga設(shè)計(jì)比賽火爆進(jìn)行

    和設(shè)計(jì)思路,同時(shí)也是對本次比賽的重視。對于設(shè)計(jì)遇到的問題,參賽者之間也進(jìn)行了激烈的交流和討論,從前期方案的選題、設(shè)計(jì)器件的使用到代碼的編寫、設(shè)計(jì)經(jīng)驗(yàn)等,都進(jìn)行了深入的探討。隨著第一批入圍方案的公布
    發(fā)表于 06-06 14:49

    Robei設(shè)計(jì)大賽,優(yōu)勝者獲得谷歌眼鏡或Ipad Mini

    參賽者自主填報(bào);2. 課題設(shè)計(jì)數(shù)量不限,參賽人員可以用多個(gè)課題申報(bào),但是每個(gè)參賽人員最多只有10個(gè)課題參與積分。3. 模型設(shè)計(jì)完成后,上傳到競賽官方網(wǎng),進(jìn)行公示。上傳文件
    發(fā)表于 06-09 18:06

    第二屆 ARM-STM32 校園創(chuàng)新大賽介紹

    類似作品(由評審委員會(huì)認(rèn)定),若已在其他廠商及教育部相關(guān)競賽獲獎(jiǎng),不得以此作品報(bào)名參加本競賽。若在本競賽期間在其他相關(guān)競賽獲獎(jiǎng),則須自動(dòng)放
    發(fā)表于 06-13 00:22

    第二屆STM32校園創(chuàng)新大賽,火熱報(bào)名

    內(nèi)核的NUCLEO-F072R8、基于Cortex-M4內(nèi)核的混合信號處 理NUCLEO-F302R8和具有高效信號處理能力的高性能NUCLEO-F401RE)。參賽者可以任意選擇其中一款搭配
    發(fā)表于 06-13 15:35

    華秋開發(fā)設(shè)計(jì)大賽第一季 - 電源設(shè)計(jì)大賽正式啟動(dòng)?。?!

    排名TOP 30項(xiàng)目即入圍決賽。第三階段:決賽-TOP30終極評選(7月20日-9月12日)大賽決賽將采用眾籌評選這種新穎的評比形式進(jìn)行,輔助參賽者打造個(gè)人品牌,助力創(chuàng)意實(shí)現(xiàn)。決賽將在電子發(fā)燒友眾籌平臺(tái)
    發(fā)表于 05-06 18:38

    【電源大賽】華秋開發(fā)設(shè)計(jì)大賽第一季 - 電源設(shè)計(jì)大賽來了?。?!

    排名TOP 30項(xiàng)目即入圍決賽。第三階段:決賽-TOP30終極評選(7月20日-9月12日)大賽決賽將采用眾籌評選這種新穎的評比形式進(jìn)行,輔助參賽者打造個(gè)人品牌,助力創(chuàng)意實(shí)現(xiàn)。決賽將在電子發(fā)燒友眾籌平臺(tái)
    發(fā)表于 05-14 15:24

    Fibocom 2022國研究生電子設(shè)計(jì)競賽 相關(guān)資料

    子學(xué)會(huì)于 1996 年共同發(fā)起,至今已成功舉辦十三屆,開賽之初,競賽就確立了“以企業(yè)為主要推動(dòng)力”的發(fā)展思路,企業(yè)資金支持、企業(yè)參與設(shè)計(jì)題目、企業(yè)參與評審、企業(yè)為優(yōu)秀參賽者提供創(chuàng)業(yè)就業(yè)機(jī)會(huì)等,讓企業(yè)成為此項(xiàng)創(chuàng)新
    發(fā)表于 12-09 14:47

    《我的世界》的生成設(shè)計(jì)競賽向AI提出挑戰(zhàn)

    自2018年以來,《我的世界》一直是挑戰(zhàn)創(chuàng)新能力的平臺(tái),它可以擴(kuò)展機(jī)器的功能。一年一度的《我的世界創(chuàng)生設(shè)計(jì)》(GDMC)競賽要求參賽者構(gòu)建一個(gè)人工智能,該人工智能可以以前看不見的地方生成現(xiàn)實(shí)的城鎮(zhèn)或村莊。
    的頭像 發(fā)表于 09-25 10:07 ?1361次閱讀

    騰訊宣布其人工智能球隊(duì)獲首屆谷歌足球Kaggle競賽冠軍

    12月30日,騰訊宣布其人工智能球隊(duì)摘得了首屆谷歌足球Kaggle競賽冠軍。這是一場由Google Research與英超曼城俱樂部Kaggle平臺(tái)上聯(lián)合舉辦的足球AI比賽,經(jīng)過多輪
    的頭像 發(fā)表于 12-30 15:58 ?1899次閱讀

    如何從13個(gè)Kaggle比賽挑選出的最好的Kaggle kernel

    。機(jī)器學(xué)習(xí)和圖像分類也不例外,工程師們可以通過參加像Kaggle這樣的競賽來展示最佳實(shí)踐。在這篇文章,我將給你很多資源來學(xué)習(xí),聚焦于從13個(gè)Kaggle比賽
    的頭像 發(fā)表于 06-27 09:26 ?2011次閱讀
    主站蜘蛛池模板: 免费的三及片| 黄色网在线播放| 啪啪福利视频| 三级在线网站| free性欧美video| bt天堂在线观看| 亚洲成a人片在线观看导航| 久久国产成人精品国产成人亚洲| 天天干夜夜想| 久久精品在| 免费视频爰爱太爽了| 日日爱网址| 加勒比一区二区| 精品伊人久久香线蕉| 久久精品系列| 亚洲入口无毒网址你懂的| 国产日韩三级| 性欧美丰满xxxx性久久久| 五月婷婷六月合| 99久久99久久久99精品齐| 美国bj69video18| 让她爽的喷水叫爽乱| 色老头久久久久| 午夜在线免费视频| 日本亚洲免费| 久久婷婷午色综合夜啪| 国产久视频| 色老头在线官方网站| 天天做天天爱天天综合网2021| 啪啪网站色大全免费| 黄色香蕉网| 欧美八区| 女人精aaaa片一级毛片女女| 久久久久国产精品| 天堂电影免费在线观看| 午夜在线播放视频在线观看视频| 水果视频色版| 国产一级特黄特色aa毛片| 久久影视免费体验区午夜啪啪| 欧美性视频一区二区三区| 国产福利午夜|