Decision Tree¶
归纳偏置: 优先选择较小的树
决策树
根开始,到叶子节点,对实例进行分类。树上的节点对应每个实例的属性的测试,后继分支代表该属性的一个可能值。
本质上,它代表实例属性值的合取(conjuction, 从根到叶子节点)的析取式(disjunction, 一个根节点的不同的可能性)
适用问题特征
下面三点比较显然。
-
实例由 "属性-对" 组成,属性对应的值可以式离散的,也可以是分区间连续的。
-
目标函数有离散的输出值,如布尔型分类,或两个以上输出值的分类器。
-
需要析取的描述
下面两点值得探讨。
-
实例集可以包含错误??
-
实例集可以包含缺少属性值的实例??
ID3 算法¶
使用贪婪算法。
核心问题是选择每个节点要测试的属性。如何衡量一个属性具有最好的选择能力?这里使用信息增益。
信息增益¶
For a specific characteristic, we could categorize samples into different kinds. We take an inspiration from entropy.
Entropy of a random variable
Assume \(\xi\) is a discrete random variable, with its distribution
Then its entropy is defined by
Note that entropy does not depend direcly on the value of \(x_k\), but its probability \(p_k\).
In practive, usually given a sample set \(S\), we let \( K \) denotes the number of states, and \( p_k \) represents the proportion of instances for a certain classification value (estimated using maximum likelihood), then the above calculation gives empirical entropy.
The range of entropy is
The maximum value of entropy is estimated using Jensen's inequality. The minimum value is \( 0 \), achieved when all instances belong to a single state, while the maximum is attained when instances are equally distributed among all states.
Conditional Entropy
Assume two random variables \(X,Y\) has a unified distribution
Given \(X\), the entropy of random variable \(Y\) is
In practice, given \(X\) means given a characteristic \(A\), and it has \(n\) different options. The data set has still \(K\) categories. The conditional entropy means the entropy under a certain characteristic.
Gain of info
The gain of infomation of characteristic \(A\) on data set \(S\), is defined by
here \(H(S\mid A)\) is the empirical conditional entropy. We could calculate it as follows.
Data set \(S\) has \(|S|\) number of samples with \(|S_k|\) for category \(k\), \((k=1,\cdots,K)\).
For a given characteristic \(A\), with \(n\) values or options, namely \(\{x_i\}_{1\leq i\leq n}\). \(A\) partitions \(S\) into another \(n\) parts \(S_{Ai}\), with \(|S_{Ai}|\) number of samples, \((i=1,\cdots, n)\). So let \(X\) to be values of \(A\) for a sample in \(S\), then we have an estimation
Let \(Y\) to be the categories of a sample in \(S\), which has \(K\) values, then
For \(X=x_i\), let \(S_{ik}=S_{A_i}\cap S_k\), then
这里用期望,表明一个属性的不同值对做出样例集划分后,要加权求平均来表达总体的熵减情况。
注意后面的熵\(H(S_v)\)是条件熵,基于\(S_v\)来计算后续别的属性的熵.
Usually \(\frac{|S_{Ai}|}{|S|}\) causes the algorithm to prefer the characteristic which has more samples in \(S\).
Information gain ratio
Define the entropy with respect to characteristic \(A\)
then information gain ratio is
We currently do not have theoretical analysis for this.
递归调用。当样例个数为0或熵为0时,确定叶子节点的属性。已被树的较高节点测试的属性被排除在外。
ID3 算法的优劣
优势 | 劣势 |
---|---|
假设空间包含所有的决策树,时关于现有属性的有限离散值函数的一个完整空间。 | 训练结束后,ID3只维护单一的当前假设。不像变形空间的候选消除算法能够维护与训练样例一致的所有假设 |
每一步都使用当前的所有训练样例,基于统计,能够降低对个别训练样例错误的敏感性,不像Find_S算法/候选消除算法对个别数据的错误非常敏感 | 搜索不进行回溯,容易收敛到局部最优答案?? |
归纳偏置¶
近似的归纳偏置: 较短的树比较长的树优先。信息增益高的属性更靠近根节点的树优先。
ID3算法 | 候选消除算法 |
---|---|
优选偏置,搜索偏置,对于某种假设胜过其他假设的优选,并不考虑最终可列举的假设种类 | 限定偏置,语言偏置,限定假设的空间 |
奥坎姆剃刀(Occam's razor)¶
优先选择拟合数据的最简单的假设。
ID3算法优选较短决策树的归纳偏置是泛化的一种依据。
常见问题: 过度拟合¶
- 噪声、巧合会导致决策树更加复杂。
解决方案
-
及早停止树增长: 较为困难
-
后修剪法: 采用一定的准则进行修剪。
||| |:---:|
- 训练和验证集法
训练集可能因噪声、随即错误和巧合规律性误导,但验证集不太可能表现出同样的波动。这样求验证集要足够大,一般⅓的数据用于验证。
后修剪法
-
错误率降低修剪
-
规则后修剪