How I learned to stop worrying and love Random Forests (Part 2)


Hello and welcome back to our initial exploration into Random Forests! Today I'm going to spice a little bit of unsupervised learning into our analysis. I'm also going to try a few interpretation techniques that can not only be useful for getting to know your data better but also for explaining why your model gives the results it does.

Removing Redundant features with unsupervised learning

One thing that makes this harder to interpret is if there are some variables with very similar meanings. Let's try to remove redundent features using some unsupervised learning; specifically we'll be using some hierarchical (or agglomerated) clustering. We'll do this by taking advantage of Spearman's rank correlation coefficient. Spearman's correlation is the same thing as converting two variables to rank values and then running a standard Pearson's correlation on those ranked variables. When I say 'rank values' I simply mean ordering each independent (and associated dependent) variable in increasing order. Their place in this list is their 'rank'. Then a Pearson's correlation measures the linear corelation between these newly ranked independent and dependent variables. One can think of this as checking to see which ranked features have a increase correlated with an increase in the ranked response (SalePrice) and grouping those that are correlated in a similar fashion. Let's see how we do this analysis in python:

from scipy.cluster import hierarchy as hc
corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
plt.show()

png

corr = np.round(scipy.stats.spearmanr(df_keep2).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=df_keep2.columns, orientation='left', leaf_font_size=16)
plt.show()

png

Doesn't look like there's much redundancy in the features of either one, though one worrisome aspect of both is the big blue split that occurs in both. There seems to be two major groupings in both sets of data which leads me to believe that some of these features are inextricably linked. Intuitively this would make sense, e.g. GarageArea and GarageCars must explain the same sort of aspect of the data, namely how big the garage is.

display_all(df_keep.describe(include='all').T)
count mean std min 25% 50% 75% max
OverallQual 1460.0 6.099315 1.382997 1.0 5.00 6.0 7.00 10.0
GrLivArea 1460.0 1515.463699 525.480383 334.0 1129.50 1464.0 1776.75 5642.0
YearBuilt 1460.0 1971.267808 30.202904 1872.0 1954.00 1973.0 2000.00 2010.0
GarageArea 1460.0 472.980137 213.804841 0.0 334.50 480.0 576.00 1418.0
TotalBsmtSF 1460.0 1057.429452 438.705324 0.0 795.75 991.5 1298.25 6110.0
GarageCars 1460.0 1.767123 0.747315 0.0 1.00 2.0 2.00 4.0
1stFlrSF 1460.0 1162.626712 386.587738 334.0 882.00 1087.0 1391.25 4692.0
ExterQual 1460.0 3.539726 0.693995 1.0 3.00 4.0 4.00 4.0
YearRemodAdd 1460.0 1984.865753 20.645407 1950.0 1967.00 1994.0 2004.00 2010.0
FullBath 1460.0 1.565068 0.550916 0.0 1.00 2.0 2.00 3.0
LotArea 1460.0 10516.828082 9981.264932 1300.0 7553.50 9478.5 11601.50 215245.0
GarageQual 1460.0 4.594521 1.262078 0.0 5.00 5.0 5.00 5.0
2ndFlrSF 1460.0 346.992466 436.528436 0.0 0.00 0.0 728.00 2065.0
BsmtFinSF1 1460.0 443.639726 456.098091 0.0 0.00 383.5 712.25 5644.0
BsmtQual 1460.0 3.178767 0.998402 0.0 3.00 3.0 4.00 4.0
FireplaceQu 1460.0 1.969178 2.037956 0.0 0.00 2.0 3.00 5.0
OverallCond 1460.0 5.575342 1.112799 1.0 5.00 5.0 6.00 9.0
Fireplaces 1460.0 0.613014 0.644666 0.0 0.00 1.0 1.00 3.0
LotFrontage 1460.0 69.863699 22.027677 21.0 60.00 69.0 79.00 313.0
CentralAir 1460.0 1.934932 0.246731 1.0 2.00 2.0 2.00 2.0
PavedDrive 1460.0 2.856164 0.496592 1.0 3.00 3.0 3.00 3.0
ExterCond 1460.0 4.733562 0.731807 1.0 5.00 5.0 5.00 5.0
Neighborhood 1460.0 13.251370 6.013735 1.0 8.00 13.0 18.00 25.0
display_all(df_keep2.describe(include='all').T)
count mean std min 25% 50% 75% max
OverallQual 1460.0 6.099315 1.382997 1.0 5.00 6.0 7.00 10.0
GrLivArea 1460.0 1515.463699 525.480383 334.0 1129.50 1464.0 1776.75 5642.0
YearBuilt 1460.0 1971.267808 30.202904 1872.0 1954.00 1973.0 2000.00 2010.0
TotalBsmtSF 1460.0 1057.429452 438.705324 0.0 795.75 991.5 1298.25 6110.0
GarageArea 1460.0 472.980137 213.804841 0.0 334.50 480.0 576.00 1418.0
ExterQual_TA 1460.0 0.620548 0.485417 0.0 0.00 1.0 1.00 1.0
1stFlrSF 1460.0 1162.626712 386.587738 334.0 882.00 1087.0 1391.25 4692.0
GarageCars 1460.0 1.767123 0.747315 0.0 1.00 2.0 2.00 4.0
LotArea 1460.0 10516.828082 9981.264932 1300.0 7553.50 9478.5 11601.50 215245.0
BsmtFinSF1 1460.0 443.639726 456.098091 0.0 0.00 383.5 712.25 5644.0
GarageYrBlt 1460.0 1978.589041 23.997022 1900.0 1962.00 1980.0 2001.00 2010.0
YearRemodAdd 1460.0 1984.865753 20.645407 1950.0 1967.00 1994.0 2004.00 2010.0
2ndFlrSF 1460.0 346.992466 436.528436 0.0 0.00 0.0 728.00 2065.0
GarageQual_TA 1460.0 0.897945 0.302824 0.0 1.00 1.0 1.00 1.0
OverallCond 1460.0 5.575342 1.112799 1.0 5.00 5.0 6.00 9.0
LotFrontage 1460.0 69.863699 22.027677 21.0 60.00 69.0 79.00 313.0
Fireplaces 1460.0 0.613014 0.644666 0.0 0.00 1.0 1.00 3.0
Neighborhood 1460.0 13.251370 6.013735 1.0 8.00 13.0 18.00 25.0
ExterCond_Fa 1460.0 0.019178 0.137198 0.0 0.00 0.0 0.00 1.0
BsmtUnfSF 1460.0 567.240411 441.866955 0.0 223.00 477.5 808.00 2336.0
FullBath 1460.0 1.565068 0.550916 0.0 1.00 2.0 2.00 3.0

Let's look at the partial dependence to see if we can find any interactions between the features.

Partial Dependence Plots

So now we're at the point where we want to try and really understand the data better by taking advantage of our new trained model. We are going to use something called partial dependence. What we are going to do is we are going to find the features that are important and how they relate to the dependent variable. First let's look at a few univariate relationships.

df_keep2.to_feather('tmp/house-prices-ohe-trim')
fi2[:10]
cols imp
0 OverallQual 0.368414
1 GrLivArea 0.173185
2 YearBuilt 0.085010
3 TotalBsmtSF 0.056902
9 GarageArea 0.042945
7 1stFlrSF 0.035206
6 BsmtFinSF1 0.031514
4 GarageCars 0.023275
10 LotArea 0.022124
12 YearRemodAdd 0.021555
df_raw = pd.read_feather('tmp/house-prices-raw')
x_all = get_sample(df_raw, 500)

ggplot provides a great way of looking at univariate relationships. By adding standard error equals true (se=True), it also shows me the confidence interval of this smoother. loess stands for locally weighted regression which is this idea of doing lots of little mini regressions.

ggplot(x_all, aes('OverallQual', 'SalePrice'))+stat_smooth(se=True, method='loess')

png

<ggplot: (-9223363268887317700)>

Looks like something occurs at an OverallQual of 6. The slope of the log of the sale price shifts to a higher value.

ggplot(x_all, aes('GrLivArea', 'SalePrice'))+stat_smooth(se=True, method='loess')

png

<ggplot: (8746504262043)>
ggplot(x_all, aes('YearBuilt', 'SalePrice'))+stat_smooth(se=True, method='loess')

png

<ggplot: (8767967175088)>

While all our previous univariate relationships were basically straight lines, here we have some odd behavior. The problem is that when you look at a univariate relationship like this, there is a whole lot of collinearity going on — a whole lot of interactions that are being lost. For example, why did the price drop? Is it actually because things made between 1920 and 1960 are less valuable? Or is it actually because most of them were also sold during that time and there was maybe a recession then?

So what we really want to be able to do is say “well, what’s the relationship between SalePrice and YearBuilt all other things being equal. “All other things being equal” basically means if we sold something in 1990 vs. 1920 and it was exactly the same thing to exactly the same person in exactly the same auction so on and so forth, what would have been the difference in price? We can do this by using a partial dependence plot (pdp for short). What this analysis does is takes a random sample of 500 rows from the training data, assigns each row a new YearBuilt and uses our random forest model to predict what the new SalePrice's. We do this for the entire range of YearBuilt and we get the following plot:

from pdpbox import pdp
from plotnine import *
x = get_sample(df_keep2, 500)
def plot_pdp(feat, model_feat, clusters=None, feat_name=None):
    feat_name = feat_name or feat
    p = pdp.pdp_isolate(m2, x, model_feat,feat)
    return pdp.pdp_plot(p, feat_name, plot_lines=True, 
                        cluster=clusters is not None, n_cluster_centers=clusters)
model_feat = x.columns.values.tolist()
plot_pdp('YearBuilt',model_feat)
(<Figure size 1080x684 with 2 Axes>,
 {'title_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f47446a77b8>,
  'pdp_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f4746a282b0>})

png

Each blue line represents a different batch of 500 random samples while the black line is the average of all the lines. While this isn't a completely straight line, it appears much more linear than the U shape we saw above. Hence it would appear that the high SalePrice isn't because of a specific feature of these old houses, perhaps these are historical houses that hold some special connotation that isn't represented in the data.

While we're at it, let's see if we can better understand that kink that was in the OverallQual curve.

plot_pdp('OverallQual',model_feat)
(<Figure size 1080x684 with 2 Axes>,
 {'title_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474ce7dfd0>,
  'pdp_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474c6336a0>})

png

plot_pdp('OverallQual',model_feat,clusters = 5)
(<Figure size 1080x684 with 2 Axes>,
 {'title_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474c8ec6d8>,
  'pdp_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474c88a160>})

png

All other things being equal, there does indeed appear to be something special about an OverallQual of 6. Most likely this feature is simply a shorthand for a number of other features. Let keep exploring the data.

plot_pdp('GrLivArea',model_feat)
(<Figure size 1080x684 with 2 Axes>,
 {'title_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474c83f9e8>,
  'pdp_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474c7f1748>})

png

The above plot makes it seem like GrLivArea should definitely be in log. After rerunning the feature importance, GrLivArea jumped from ~0.15 to ~0.24! Definitely seems like a good move. Now lets see the new plot:

plot_pdp('GrLivArea',model_feat)
(<Figure size 1080x684 with 2 Axes>,
 {'title_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474473bba8>,
  'pdp_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f4744707b00>})

png

plot_pdp('TotalBsmtSF',model_feat)
(<Figure size 1080x684 with 2 Axes>,
 {'title_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f474415ac88>,
  'pdp_ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f47441afc18>})

png

Another one that could benefit from taking the log of the variable. In fact it looks like a number of the SF and Area variables can benefit from it.

df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)
X_train, X_valid = split_vals(df_trn2, n_trn)

m = RandomForestRegressor(n_estimators=40, min_samples_leaf=1, max_features=0.6, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

Tree Interpreter