Decision Tree¶

归纳偏置: 优先选择较小的树

决策树

根开始，到叶子节点，对实例进行分类。树上的节点对应每个实例的属性的测试，后继分支代表该属性的一个可能值。

本质上，它代表实例属性值的合取(conjuction, 从根到叶子节点)的析取式(disjunction, 一个根节点的不同的可能性)

适用问题特征

下面三点比较显然。

实例由 "属性-对" 组成，属性对应的值可以式离散的，也可以是分区间连续的。
目标函数有离散的输出值，如布尔型分类，或两个以上输出值的分类器。
需要析取的描述

下面两点值得探讨。

实例集可以包含错误？？
实例集可以包含缺少属性值的实例？？

ID3 算法¶

使用贪婪算法。

核心问题是选择每个节点要测试的属性。如何衡量一个属性具有最好的选择能力？这里使用信息增益。

信息增益¶

For a specific characteristic, we could categorize samples into different kinds. We take an inspiration from entropy.

Entropy of a random variable

Assume \(\xi\) is a discrete random variable, with its distribution

\[ P(\xi=x_k)=p_k, (i=1,2,\cdots, K). \]

Then its entropy is defined by

\[ H(\xi)=-\sum_{k=1}^K p_k\log_2 p_k. \]

Note that entropy does not depend direcly on the value of \(x_k\), but its probability \(p_k\).

In practive, usually given a sample set \(S\), we let \( K \) denotes the number of states, and \( p_k \) represents the proportion of instances for a certain classification value (estimated using maximum likelihood), then the above calculation gives empirical entropy.

The range of entropy is

\[ 0\leq H(S)\leq \log_2 K. \]

The maximum value of entropy is estimated using Jensen's inequality. The minimum value is \( 0 \), achieved when all instances belong to a single state, while the maximum is attained when instances are equally distributed among all states.

Conditional Entropy

Assume two random variables \(X,Y\) has a unified distribution

\[ P(X=x_i, Y=y_k)=p_{ik},\quad i=1,\cdots, n; \quad k=1,\cdots, K. \]

Given \(X\), the entropy of random variable \(Y\) is

\[ \begin{align} H(Y\mid X)&=\sum_{i=1}^n P(X=x_i) H(Y\mid X=x_i)\\ &=\sum_{i=1}^n p_i \left(-\sum_{k=1}^K p_{ik}\log_2 p_{ik}\right) \end{align} \]

In practice, given \(X\) means given a characteristic \(A\), and it has \(n\) different options. The data set has still \(K\) categories. The conditional entropy means the entropy under a certain characteristic.

Gain of info

The gain of infomation of characteristic \(A\) on data set \(S\), is defined by

\[ g(D,A)=H(S)-H(S\mid A) \]

here \(H(S\mid A)\) is the empirical conditional entropy. We could calculate it as follows.

Data set \(S\) has \(|S|\) number of samples with \(|S_k|\) for category \(k\), \((k=1,\cdots,K)\).

For a given characteristic \(A\), with \(n\) values or options, namely \(\{x_i\}_{1\leq i\leq n}\). \(A\) partitions \(S\) into another \(n\) parts \(S_{Ai}\), with \(|S_{Ai}|\) number of samples, \((i=1,\cdots, n)\). So let \(X\) to be values of \(A\) for a sample in \(S\), then we have an estimation

\[ P(X=x_i)\approx \frac{|S_{Ai}|}{|S|}, \quad i=1,\cdots,n. \]

Let \(Y\) to be the categories of a sample in \(S\), which has \(K\) values, then

\[ H(Y)\approx H(S)=-\sum_{k=1}\frac{|S_k|}{|S|}\log_2 \frac{|S_k|}{|S|}. \]

For \(X=x_i\), let \(S_{ik}=S_{A_i}\cap S_k\), then

\[ H(Y \mid X)\approx H(S\mid A)=\sum_{i=1}^n \frac{|S_{Ai}|}{|S|} \left(-\sum_{k=1}^K \frac{|S_{ik}|}{|S_{Ai}|}\log_2 \frac{|S_{ik}|}{|S_{Ai}|}\right). \]

In Chinese

这里用期望，表明一个属性的不同值对做出样例集划分后，要加权求平均来表达总体的熵减情况。

\[ g(S,A) = H(S) - \sum_{v\in value(A)}\frac{|S_v|}{|S|}H(S_v) \]

注意后面的熵\(H(S_v)\)是条件熵，基于\(S_v\)来计算后续别的属性的熵.

Usually \(\frac{|S_{Ai}|}{|S|}\) causes the algorithm to prefer the characteristic which has more samples in \(S\).

Information gain ratio

Define the entropy with respect to characteristic \(A\)

\[ H_A(S)=-\sum_{i=1}^n \frac{|S_{Ai}|}{|S|}\log_2 \frac{|S_{Ai}|}{|S|}, \]

then information gain ratio is

\[ g_R(D,A)=\frac{g(D, A)}{H_A(D)}. \]

We currently do not have theoretical analysis for this.

递归调用。当样例个数为0或熵为0时，确定叶子节点的属性。已被树的较高节点测试的属性被排除在外。

ID3 算法的优劣

优势	劣势
假设空间包含所有的决策树，时关于现有属性的有限离散值函数的一个完整空间。	训练结束后，ID3只维护单一的当前假设。不像变形空间的候选消除算法能够维护与训练样例一致的所有假设
每一步都使用当前的所有训练样例，基于统计，能够降低对个别训练样例错误的敏感性，不像Find_S算法/候选消除算法对个别数据的错误非常敏感	搜索不进行回溯，容易收敛到局部最优答案？？

归纳偏置¶

近似的归纳偏置: 较短的树比较长的树优先。信息增益高的属性更靠近根节点的树优先。

ID3算法	候选消除算法
优选偏置，搜索偏置，对于某种假设胜过其他假设的优选，并不考虑最终可列举的假设种类	限定偏置，语言偏置，限定假设的空间

奥坎姆剃刀(Occam's razor)¶

优先选择拟合数据的最简单的假设。

ID3算法优选较短决策树的归纳偏置是泛化的一种依据。

常见问题: 过度拟合¶

噪声、巧合会导致决策树更加复杂。

解决方案

及早停止树增长: 较为困难
后修剪法: 采用一定的准则进行修剪。

||| |:---:|

训练和验证集法

训练集可能因噪声、随即错误和巧合规律性误导，但验证集不太可能表现出同样的波动。这样求验证集要足够大，一般⅓的数据用于验证。

后修剪法

错误率降低修剪
规则后修剪