决策树或者随机森林能够直接处理missing data吗？-SofaSofa

决策树或者随机森林能够直接处理missing data吗？还是需要做预处理？

yayat 2018-07-07 16:27

2个回答

一。决策树/随机森林（RF）直接处理missing data的方法：

1.CART中可用surrogate splits，但是根据Random Survival Forests，RF不推荐surrogate splits。

"Although surrogate splitting works well for trees, the method may not be well suited for forests. Speed is one issue. Finding a surrogate split is computationally intensive and may become infeasible when growing a large number of trees, especially for fully saturated trees used by forests. Further, surrogate splits may not even be meaningful in a forest paradigm. RF randomly selects variables when splitting a node and, as such, variables within a node may be uncorrelated, and a reasonable surrogate split may not exist. Another concern is that surrogate splitting alters the interpretation of a variable, which affects measures such as VIMP."

2.用C4.5代替CART，C4.5计算information gain时没直接用到missing data。

二。或者用填值（imputation）的方法预处理：

1.用average/median/mode填；或根据原始的RF，用加权后的average/median/mode填，权重是missing data point和其他data point的相识度。

2.用复杂算法去估计missing data，比如R中SVDmiss，交替地算SVD和填值。还有missRanger和missForest，交替的填值和进行随机森林。

3.这里提到"on the fly imputation" (OTFI)，随机地填其他数据点中出现过的值，但是填充的值不用于split的计算。

4.Handling missing data in trees: surrogate splits or statistical imputation ?其中说填值方法计算量小，效果好。

我的理解是最好用填值，因为填值和训练是独立的两个步骤，填值后数据比较稳定，利于分析，且训练计算量小。对于决策树可以不用填值，但是随机森林需要填值。

SofaSofa数据科学社区 DS面试题库 DS面经

Zealing 2018-07-08 01:04

谢谢解答！我还有一个问题，以C4.5为例，missing data不影响information gain的计算，可以得到数值特征的splitting阈值，那么在这个节点上split样本的时候，missing data是放在<=的左边呢，还是>的右边呢？ - yayat 2018-07-08 11:15

1.分到data point多的一边 2.一个data在两边都有一部分，按两边data point数比例。比如nonmissing data在左面有90个，在右面10个，那么一个data90%分左面，10%分右面。 3.随机分在两边，分配概率正比与两边data point数。 - Zealing 2018-07-08 13:10

懂了，谢谢大神！ - yayat 2018-07-08 13:17

python里sklearn不能，pycaret能。

原理上说随机森林是支持有缺失值的。

SofaSofa数据科学社区 DS面试题库 DS面经

黑泽先生 2020-08-10 23:15

决策树或者随机森林能够直接处理missing data吗？

Warning

2个回答

Warning

Warning