决策树或者随机森林能够直接处理missing data吗?还是需要做预处理?
2个回答
一。决策树/随机森林(RF)直接处理missing data的方法:
1.CART中可用surrogate splits,但是根据Random Survival Forests,RF不推荐surrogate splits。
"Although surrogate splitting works well for trees, the method may not be well suited for forests. Speed is one issue. Finding a surrogate split is computationally intensive and may become infeasible when growing a large number of trees, especially for fully saturated trees used by forests. Further, surrogate splits may not even be meaningful in a forest paradigm. RF randomly selects variables when splitting a node and, as such, variables within a node may be uncorrelated, and a reasonable surrogate split may not exist. Another concern is that surrogate splitting alters the interpretation of a variable, which affects measures such as VIMP."
2.用C4.5代替CART,C4.5计算information gain时没直接用到missing data。
二。或者用填值(imputation)的方法预处理:
1.用average/median/mode填;或根据原始的RF,用加权后的average/median/mode填,权重是missing data point和其他data point的相识度。
2.用复杂算法去估计missing data,比如R中SVDmiss,交替地算SVD和填值。还有missRanger和missForest,交替的填值和进行随机森林。
3.这里提到"on the fly imputation" (OTFI),随机地填其他数据点中出现过的值,但是填充的值不用于split的计算。
4.Handling missing data in trees: surrogate splits or statistical imputation ?其中说填值方法计算量小,效果好。
我的理解是最好用填值,因为填值和训练是独立的两个步骤,填值后数据比较稳定,利于分析,且训练计算量小。对于决策树可以不用填值,但是随机森林需要填值。