Lightgbm:高效梯度提升决策树
摘要:梯度提升决策树(GBDT)是一种流行的机器学习算法,并且有很多有效的实现,例如XGBoost和pGBRT。尽管在这些实现中已经采用了许多工程优化,但是当面对维度高,数据量大的问题时,其特征的效率和可扩展性仍然不尽人意。其中一个主要原因是对于每个特征,他们需要遍历所有的数据实例来估计所有可能的分割点的信息增益,这非常耗时。为了解决这个问题,我们提出了两种新颖的技术:基于梯度的单面采样(GOSS)和互补特征压缩(EFB)。使用GOSS排除了很大比例的小梯度数据实例,只用剩下的来估计信息收益。我们证明,由于具有较大梯度的数据实例发挥作用在信息获取计算中起更重要的作用,GOSS可以获得用更小的数据量对信息增益进行相当准确的估计。借助EFB,我们将相互排斥的特征压缩在一起(即它们很少取非零值),以减少特征的数量。我们证明这一发现互补特征的最佳匹配是NP难度,但是却可以与贪婪算法一样可以达到相当好的近似比率(因而可以有效地减少许多不影响分割点确定的准确性)。我们称GOSS和EFB LightGBM为我们的新GBDT实施。在多个公共数据集上进行的实验表明,LightGBM加快了传统GBDT训练过程速度达到20余次准确度几乎相同。
1.引言
梯度提升决策树(GBDT)[1] 因其效率、准确性和可解释性成为一种广泛使用的机器学习算法。 GBDT在机器学习许多任务中实现了最先进的性能,例如多类分类[2],点击预测[3]和学习排名[4]。 近年来,随着大数据(特征数量和样本数量)的出现,GBDT面临着新的挑战,特别是在精度和效率之间的权衡。 GBDT的传统实现需要针对每个特征扫描所有数据实例以估计所有可能分割点的信息增益。 因此,它们的计算复杂度将与特征数量和样本数量成正比。 这使得这些实现在处理大数据时非常耗时。
为了应对这一挑战,一个简单的想法是减少数据实例的数量和特征的数量。 然而,事实证明这是非常重要的。 例如,目前还不清楚如何对GBDT执行数据采样。 虽然有些研究成果提出根据对应的权重对数据进行采样以加速提高训练过程[5,6,7],但由于GBDT中根本没有样本权重,因此不能直接应用于GBDT。 在本文中,我们提出了实现这一目标的两种新技术,详见下文。

Gradient-based One-Side Sampling (GOSS). While there is no native weight for data instance in GBDT, we notice that data instances with different gradients play different roles in the computation of information gain. In particular, according to the definition of information gain, those instances with larger gradients1 (i.e., under-trained instances) will contribute more to the information gain. Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, we should better keep those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients. We prove that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, especially when the value of information gain has a large range.

基于梯度的单面采样(GOSS)。 尽管GBDT中的数据实例没有自具权重,但我们注意到具有不同梯度的数据实例在计算信息增益时扮演不同的角色。 具体而言,根据信息增益的定义,具有较大梯度的那些实例(即训练不足的实例)将对信息增益做出更多贡献。 因此,在对数据实例进行下采样时,为了保持信息增益估计的准确性,我们应该更好地保留那些具有较大梯度(例如,大于预先定义的阈值或者最高百分位数)的实例,并且只能随机地 用小渐变删除这些实例。 我们证明,这种处理可以导致比均匀随机采样更准确的增益估计,具有相同的目标采样率,特别是当信息增益值的范围较大时。
Exclusive Feature Bundling (EFB). Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features (e.g., one-hot word representation in text mining). We can safely bundle such exclusive features. To this end, we design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.

互补特征压缩(EFB)。 通常在实际应用中,虽然有大量的特征,但特征空间相当稀疏,这为我们设计几乎无损的方法来减少有效特征的数量提供了可能性。 特别地,在稀疏特征空间中,许多特征是(几乎)排他性的,即它们很少同时取非零值。 例子包括一个热门特征(例如,文本挖掘中的热门词汇表示)。 我们可以安全地压缩这些独特特征。 为此,我们设计了一种有效的算法,通过将最优压缩问题简化为图着色问题(通过将特征作为顶点,并且如果两个特征不相互排斥,则为每两个特征添加边),然后用贪心算法 恒定近似比。
We call the new GBDT algorithm with GOSS and EFB LightGBM2. Our experiments on multiple public datasets show that LightGBM can accelerate the training process by up to over 20 times while achieving almost the same accuracy.

我们将GOSS和EFB LightGBM2称为新的GBDT算法。在多个公共数据集上进行的实验表明,达到几乎相同精度的情况下 LightGBM可以将训练过程加速20倍以上。
The remaining of this paper is organized as follows. At first, we review GBDT algorithms and related work in Sec.2. Then, we introduce the details of GOSS in Sec.3 and EFB in Sec.4. Our experiments for LightGBM on public datasets are presented in Sec. 5. Finally, we conclude the paper in Sec. 6.
本文的其余部分安排如下。 首先,我们回顾GBDT算法和第二部分的相关工作。 然后,在第3节中详细介绍GOSS,第4节中介绍EFB。在公共数据集上的LightGBM实验将在第5节中介绍。 最后,在第6节总结全文。

2.预备知识
2.1GBDT和它的复杂性分析
GBDT is an ensemble model of decision trees, which are trained in sequence [1]. In each iteration, GBDT learns the decision trees by fitting the negative gradients (also known as residual errors).

GBDT是决策树的集合模型,它们按顺序进行训练[1]。 在每次迭代中,GBDT通过拟合负梯度(也称为残差)来学习决策树。
The main cost in GBDT lies in learning the decision trees, and the most time-consuming part in learning a decision tree is to find the best split points. One of the most popular algorithms to find split points is the pre-sorted algorithm [8, 9], which enumerates all possible split points on the pre-sorted feature values. This algorithm is simple and can find the optimal split points, however, it is inefficient in both training speed and memory consumption. Another popular algorithm is the histogram-based algorithm [10, 11, 12], as shown in Alg. 13. Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the histogram-based algorithm is more efficient in both memory consumption and training speed, we will develop our work on its basis.

GBDT的主要成本在于学习决策树,学习决策树中最耗时的部分是找出最佳的分割点。 找到分割点的最流行的算法之一是预先排序的算法[8,9],该算法枚举了预先排序的特征值上所有可能的分割点。 这个算法很简单,可以找到最优的分割点,但是它在训练速度和内存消耗方面都是不够的。 另一种流行的算法是基于直方图的算法[10,11,12],如Alg所示。 13.不是在已排序的特征值上找到分割点,而是基于直方图的算法将连续的特征值抽象成离散的区域,并使用这些区域在训练过程中构建特征直方图。 由于基于直方图的算法在内存消耗和训练速度方面都更加高效,因此我们将在其基础上开展工作。
As shown in Alg. 1, the histogram-based algorithm finds the best split points based on the feature histograms. It costs O(#data×#feature) for histogram building and O(#bin×#feature) for split point finding. Since #bin is usually much smaller than #data, histogram building will dominate the computational complexity. If we can reduce #data or #feature, we will be able to substantially speed up the training of GBDT.

如Alg1所示,基于直方图的算法根据特征直方图找出最佳分割点。 它用于构建直方图的O(#data×#特征)和用于分割点的O(#bin×#特征)。 由于#bin通常比#data小得多,因此直方图构建将主导计算复杂性。 如果我们能够减少#data或#feature,我们将能够大幅加速GBDT的训练。

2.2相关性工作
There have been quite a few implementations of GBDT in the literature, including XGBoost [13], pGBRT [14], scikit-learn [15], and gbm in R [16] 4. Scikit-learn and gbm in R implements the presorted algorithm, and pGBRT implements the histogram-based algorithm. XGBoost supports both the pre-sorted algorithm and histogram-based algorithm. As shown in [13], XGBoost outperforms the other tools. So, we use XGBoost as our baseline in the experiment section.

在文献中有很多GBDT的实现,包括XGBoost [13],pGBRT [14],scikit-learn [15]和R [16]中的gbm 4. R中的Scikit-learn和gbm实现预分类 算法,并且pGBRT实现基于直方图的算法。 XGBoost支持预排序算法和基于直方图的算法。 如[13]所示,XGBoost优于其他工具。 所以,我们在实验部分使用XGBoost作为我们的基准。

To reduce the size of the training data, a common approach is to down sample the data instances. For example, in [5], data instances are filtered if their weights are smaller than a fixed threshold. SGB [20] uses a random subset to train the weak learners in every iteration. In [6], the sampling ratio are dynamically adjusted in the training progress. However, all these works except SGB [20] are based on AdaBoost [21], and cannot be directly applied to GBDT since there are no native weights for data instances in GBDT. Though SGB can be applied to GBDT, it usually hurts accuracy and thus it is not a desirable choice.

为了减小训练数据的大小,常用的方法是对数据实例进行下采样。 例如,在[5]中,如果数据实例的权重小于固定阈值,则会过滤数据实例。 SGB [20]使用随机子集在每次迭代中训练弱的学习者。 在[6]中,采样率在训练过程中动态调整。 然而,除了SGB [20]之外的所有这些工作都基于AdaBoost [21],并且不能直接应用于GBDT,因为GBDT中没有数据实例的本机权重。 虽然SGB可以应用于GBDT,但它通常会损害准确性,因此它不是一个理想的选择。

Similarly, to reduce the number of features, it is natural to filter weak features [22, 23, 7, 24]. This is usually done by principle component analysis or projection pursuit. However, these approaches highly rely on the assumption that features contain significant redundancy, which might not always be true in practice (features are usually designed with their unique contributions and removing any of them may affect the training accuracy to some degree).

同样,为了减少特征的数量,自然可以滤除弱点[22,23,7,24]。 这通常通过主成分分析或投影追踪来完成。 然而,这些方法在很大程度上依赖于特征包含重要冗余这一假设,在实践中可能并非总是如此(特征通常是以其独特贡献设计的,并且删除它们中的任何一个都可能在一定程度上影响训练的准确性)。

The large-scale datasets used in real applications are usually quite sparse. GBDT with the pre-sorted algorithm can reduce the training cost by ignoring the features with zero values [13]. However, GBDT with the histogram-based algorithm does not have efficient sparse optimization solutions. The reason is that the histogram-based algorithm needs to retrieve feature bin values (refer to Alg. 1) for each data instance no matter the feature value is zero or not. It is highly preferred that GBDT with the histogram-based algorithm can effectively leverage such sparse property.

实际应用中使用的大规模数据集通常很稀少。 GBDT与预排序算法可以通过忽略具有零值的特征来降低训练成本[13]。 但是,基于直方图的GBDT算法没有有效的稀疏优化解决方案。 原因在于基于直方图的算法无论特征值是否为零,都需要为每个数据实例检索特征仓值(参见第1章)。 具有基于直方图的算法的GBDT可以有效地利用这种稀疏属性。

To address the limitations of previous works, we propose two new novel techniques called Gradientbased One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). More details will be introduced in the next sections.

为了解决先前工作的局限性,我们提出了两种称为基于梯度的单面采样(GOSS)和独特特征捆绑(EFB)的新颖技术。 更多细节将在下一节中介绍。
3.基于梯度的单面采样(Gradient-based One-Side Sampling)
In this section, we propose a novel sampling method for GBDT that can achieve a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees.

在本节中,我们提出了一种新的GBDT抽样方法,可以在减少数据实例数量和保持学习决策树的准确性之间取得良好的平衡。

3.1算法描述
In AdaBoost, the sample weight serves as a good indicator for the importance of data instances. However, in GBDT, there are no native sample weights, and thus the sampling methods proposed for AdaBoost cannot be directly applied. Fortunately, we notice that the gradient for each data instance in GBDT provides us with useful information for data sampling. That is, if an instance is associated with a small gradient, the training error for this instance is small and it is already well-trained. A straightforward idea is to discard those data instances with small gradients. However, the data distribution will be changed by doing so, which will hurt the accuracy of the learned model. To avoid this problem, we propose a new method called Gradient-based One-Side Sampling (GOSS).

在AdaBoost中,样本权重是数据实例重要性的良好指标。 但是,在GBDT中,不存在本地样本权重,因此针对AdaBoost提出的抽样方法不能直接应用。 幸运的是,我们注意到GBDT中每个数据实例的渐变为我们提供了有用的数据采样信息。 也就是说,如果一个实例与一个小梯度相关联,则此实例的训练错误很小,并且它已经受过良好训练。 一个简单的想法是放弃那些具有小梯度的数据实例。 但是,这样做会改变数据分布,这会损害学习模型的准确性。 为了避免这个问题,我们提出了一种叫做基于梯度的单面采样(GOSS)的新方法。

GOSS keeps all the instances with large gradients and performs random sampling on the instances with small gradients. In order to compensate the influence to the data distribution, when computing the information gain, GOSS introduces a constant multiplier for the data instances with small gradients (seeAlg.2). Specifically, GOSS firstly sorts the data instances according to the absolute value of their gradients and selects the top a×100% instances. Then it randomly samples b×100% instances from the rest of the data. After that, GOSS amplifies the sampled data with small gradients by a constant 1−a b when calculating the information gain. By doing so, we put more focus on the under-trained instances without changing the original data distribution by much.

GOSS保留所有具有大梯度的实例,并对具有小梯度的实例执行随机采样。 为了补偿对数据分布的影响,在计算信息增益时,GOSS为具有小梯度的数据实例引入了一个常数乘数(参见第2节)。 具体而言,GOSS首先根据梯度的绝对值对数据实例进行排序,并选择最高的a×100%实例。 然后从其余数据中随机抽样b×100%的实例。 之后,当计算信息增益时,GOSS以小的梯度以恒定的1-a b放大采样数据。 通过这样做,我们将更多的重点放在训练有素的实例上,而不用多次改变原始数据分布。

3.2理论分析
GBDT uses decision trees to learn a function from the input space Xs to the gradient spaceG [1]. Suppose that we have a training set with n i.i.d. instances{x1,··· ,xn}, where each xi is a vector with dimension s in space Xs. In each iteration of gradient boosting, the negative gradients of the loss function with respect to the output of the model are denoted as{g1,··· ,gn}. The decision tree model splits each node at the most informative feature(with the largest information gain). For GBDT, the information gain is usually measured by the variance after splitting, which is defined as below.

GBDT使用决策树从输入空间Xs学习一个函数到梯度空间G [1]。 假设我们有一个n i.i.d.的训练集。 实例{x1,...,xn},其中每个xi是空间Xs中尺寸为s的向量。 在每次梯度增强迭代中,损失函数相对于模型输出的负梯度表示为{g1,...,gn}。 决策树模型将信息最丰富的特征(具有最大的信息增益)划分为每个节点。 对于GBDT,信息增益通常用分裂后的方差来衡量,其定义如下。
4.互补特征压缩(Exclusive Feature Bundling)
在本节中,我们提出了一种有效减少特征数量的新方法。

High-dimensional data are usually very sparse. The sparsity of the feature space provides us a possibility of designing a nearly lossless approach to reduce the number of features. Specifically, in a sparse feature space, many features are mutually exclusive, i.e., they never take nonzero values simultaneously. We can safely bundle exclusive features into a single feature (which we call an exclusive feature bundle). By a carefully designed feature scanning algorithm, we can build the same feature histograms from the feature bundles as those from individual features. In this way, the complexity of histogram building changes from O(#data×#feature) to O(#data×#bundle), while #bundle << #feature. Then we can significantly speed up the training of GBDT without hurting the accuracy. In the following, we will show how to achieve this in detail. There are two issues to be addressed. The first one is to determine which features should be bundled together. The second is how to construct the bundle.

高维数据通常非常稀少。特征空间的稀疏性为我们提供了设计几乎无损的方法来减少特征数量的可能性。特别地,在稀疏特征空间中,许多特征是相互排斥的,即它们从不同时取非零值。我们可以安全地将专有特征捆绑到一个特征(我们称之为专用特征包)。通过精心设计的特征扫描算法,我们可以从特征捆绑中构建与单个特征相同的特征直方图。这样,直方图构建的复杂度从O(#data×#feature)改为O(#data×#bundle),而#bundle << #feature。那么我们可以在不损害准确性的情况下显着加快GBDT的培训速度。下面我们将详细介绍如何实现这一点。有两个问题需要解决。第一个是确定哪些特征应该捆绑在一起。其次是如何构建捆绑。

Theorem4.1 The problem of partitioning features into a smallest number of exclusive bundles is NP-hard.

定理4.1将特征划分为最小数量的互补压缩问题是NP难题。

Proof: We will reduce the graph coloring problem [25] to our problem. Since graph coloring problem is NP-hard, we can then deduce our conclusion.

证明:我们将把图着色问题[25]减少到我们的问题。 由于图着色问题是NP难题,我们可以推断出我们的结论。

Given any instance G = (V,E) of the graph coloring problem. We construct an instance of our problem as follows. Take each row of the incidence matrix of G as a feature, and get an instance of our problem with |V| features. It is easy to see that an exclusive bundle of features in our problem corresponds to a set of vertices with the same color, and vice versa.

给定图的着色问题的任意实例G =(V,E)作为构建我们问题的一个实例。 以G的关联矩阵的每一行为特征,并用| V |得到问题的一个实例特征。 很容易看出,问题中的一组特征与一组具有相同颜色的顶点相对应,反之亦然。

For the first issue, we prove in Theorem 4.1 that it is NP-Hard to find the optimal bundling strategy, which indicates that it is impossible to find an exact solution within polynomial time. In order to find a good approximation algorithm, we first reduce the optimal bundling problem to the graph coloring problem by taking features as vertices and adding edges for every two features if they are not mutually exclusive, then we use a greedy algorithm which can produce reasonably good results (with a constant approximation ratio) for graph coloring to produce the bundles. Furthermore, we notice that there are usually quite a few features, although not 100% mutually exclusive, also rarely take nonzero values simultaneously. If our algorithm can allow a small fraction of conflicts, we can get an even smaller number of feature bundles and further improve the computational efficiency. By simple calculation, random polluting a small fraction of feature values will affect the training accuracy by at mostO([(1−γ)n]−2/3)(See Proposition 2.1 in the supplementary materials), where γ is the maximal conflict rate in each bundle. So, if we choose a relatively small γ, we will be able to achieve a good balance between accuracy and efficiency.

对于第一个问题,我们在定理4.1中证明了找到最优捆绑策略是NP难的,这表明在多项式时间内找到一个精确解是不可能的。为了找到一个好的近似算法,我们首先将最优捆绑问题归结为图着色问题,即如果两个特征之间不是互斥的,则将特征作为顶点,并为每个特征添加边,那么我们使用一种可合理生成的贪心算法好的结果(具有恒定的近似比)用于图着色以产生束。此外,我们注意到通常有很多特征,虽然不是100%互斥,但也很少同时使用非零值。如果我们的算法可以允许一小部分冲突,我们可以得到更少数量的特征捆绑并进一步提高计算效率。通过简单的计算,随机污染一小部分特征值将至多影响训练的准确性O([(1-γ)n] -2/3)(见补充材料中的命题2.1),其中γ是最大冲突率在每个捆绑。所以,如果我们选择一个相对较小的γ,我们将能够在精度和效率之间取得很好的平衡。

Based on the above discussions, we design an algorithm for exclusive feature bundling as shown in Alg. 3. First, we construct a graph with weighted edges, whose weights correspond to the total conflicts between features. Second, we sort the features by their degrees in the graph in the descending order. Finally, we check each feature in the ordered list, and either assign it to an existing bundle with a small conflict (controlled by γ), or create a new bundle. The time complexity of Alg. 3 is O(#feature2) and it is processed only once before training. This complexity is acceptable when the number of features is not very large, but may still suffer if there are millions of features. To further improve the efficiency, we propose a more efficient ordering strategy without building the graph: ordering by the count of nonzero values, which is similar to ordering by degrees since more nonzero values usually leads to higher probability of conflicts. Since we only alter the ordering strategies in Alg. 3, the details of the new algorithm are omitted to avoid duplication.

基于以上讨论,我们设计了一个独家特征捆绑算法,如Alg所示。首先,我们构造一个加权边的图,其权重对应于特征之间的总冲突。其次,我们按照降序排列的图表中的度数对特征进行排序。最后,我们检查有序列表中的每个要素,并将其分配给一个具有小冲突(由γ控制)的现有捆绑包,或者创建一个新的捆绑包。 Alg3的时间复杂性是O(#特征2),训练前仅处理一次。当特征数量不是很大时,这种复杂性是可以接受的,但是如果有数百万个特征可能仍会受到影响。为了进一步提高效率,我们提出了一个更有效的排序策略,不需要构建图表:按非零值计数排序,这类似于按度排序,因为更多的非零值通常会导致更高的冲突概率。因为我们只改变Alg3中的排序策略,新算法的细节被省略以避免重复。

For the second issues, we need a good way of merging the features in the same bundle in order to reduce the corresponding training complexity. The key is to ensure that the values of the original features can be identified from the feature bundles. Since the histogram-based algorithm stores discrete bins instead of continuous values of the features, we can construct a feature bundle by letting exclusive features reside in different bins. This can be done by adding offsets to the original values of the features. For example, suppose we have two features in a feature bundle. Originally, feature A takesvaluefrom [0,10) andfeatureBtakesvalue [0,20). Wethenaddanoffsetof10tothevaluesof feature B so that the refined feature takes values from [10,30). After that, it is safe to merge features A and B, and use a feature bundle with range [0,30] to replace the original features A and B. The detailed algorithm is shown in Alg. 4.

对于第二个问题,我们需要一种合并同一捆绑中的特征的好方法,以减少相应的训练复杂度。 关键是要确保可以从特征包中识别出原始特征的值。 由于基于直方图的算法存储离散仓而不是特征的连续值,我们可以通过让独占特征位于不同仓中来构建特征捆绑。 这可以通过向特征的原始值添加偏移来完成。 例如,假设我们在一个特征包中有两个特征。 最初,特征A从[0,10)和featureBtakesvalue[0,20)中取值。 对特征B的值进行加法和偏移,以使得精化特征取[10,30]中的值。 之后,合并特征A和B并使用范围为[0,30]的特征束来替换原始特征A和B是安全的。详细的算法在Alg4中显示。

EFB algorithm can bundle many exclusive features to the much fewer dense features, which can effectively avoid unnecessary computation for zero feature values. Actually, we can also optimize the basic histogram-based algorithm towards ignoring the zero feature values by using a table for each feature to record the data with nonzero values. By scanning the data in this table, the cost of histogram building for a feature will change from O(#data) to O(#non_zero_data). However, thismethodneedsadditionalmemoryandcomputationcosttomaintaintheseper-featuretablesinthe whole tree growth process. We implement this optimization in LightGBM as a basic function. Note, this optimization does not conflict with EFB since we can still use it when the bundles are sparse.

EFB算法可以将许多独有的特征捆绑到少得多的密集特征上,这可以有效地避免对零特征值进行不必要的计算。 实际上,我们也可以通过使用每个特征的表来优化基本的基于直方图的算法来忽略零特征值,以记录具有非零值的数据。 通过扫描此表中的数据,特征的直方图构建成本将从O(#data)更改为O(#non_zero_data)。 但是,这种方法需要额外的记忆和计算费用来维护整个树形成过程中的特征表。 我们在LightGBM中实现这种优化作为基本特征。 请注意,此优化不会与EFB冲突,因为我们仍然可以在捆绑稀疏时使用它。

5.实验
6.总结
7.参考文献
[1] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189?232, 2001.
[2] Ping Li. Robust logitboost and adaptive base class (abc) logitboost. arXiv preprint arXiv:1203.3491, 2012.
[3] Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521?30. ACM, 2007.
[4] Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581):81, 2010.
[5] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and
a rejoinder by the authors). The annals of statistics, 28(2):337?07, 2000.
[6] Charles Dubout and Fran鏾is Fleuret. Boosting with maximum adaptive sampling. In Advances in Neural Information Processing
Systems, pages 1332?340, 2011.
[7] Ron Appel, Thomas J Fuchs, Piotr Doll醨, and Pietro Perona. Quickly boosting decision trees-pruning underachieving features early. In
ICML (3), pages 594?02, 2013.
[8] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. Sliq: A fast scalable classifier for data mining. In International Conference on
Extending Database Technology, pages 18?2. Springer, 1996.
[9] John Shafer, Rakesh Agrawal, and Manish Mehta. Sprint: A scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 544?55. Citeseer, 1996.
[10] Sanjay Ranka and V Singh. Clouds: A decision tree classifier for large datasets. In Proceedings of the 4th Knowledge Discovery and
Data Mining Conference, pages 2?, 1998.
[11] Ruoming Jin and Gagan Agrawal. Communication and memory efficient parallel decision tree construction. In Proceedings of the 2003
SIAM International Conference on Data Mining, pages 119?29. SIAM, 2003.
[12] Ping Li, Christopher JC Burges, Qiang Wu, JC Platt, D Koller, Y Singer, and S Roweis. Mcrank: Learning to rank using multiple
classification and gradient boosting. In NIPS, volume 7, pages 845?52, 2007.
[13] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 785?94. ACM, 2016.
[14] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression trees for web search ranking. In
Proceedings of the 20th international conference on World wide web, pages 387?96. ACM, 2011.
[15] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter
Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research,
12(Oct):2825–2830, 2011.
[16] Greg Ridgeway. Generalized boosted models: A guide to the gbm package. Update, 1(1):2007, 2007.
[17] Huan Zhang, Si Si, and Cho-Jui Hsieh. Gpu-acceleration for large-scale tree boosting. arXiv preprint arXiv:1706.08359, 2017.
[18] Rory Mitchell and Eibe Frank. Accelerating the xgboost algorithm using gpu computing. PeerJ Preprints, 5:e2911v1, 2017.
[19] Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tieyan Liu. A communication-efficient parallel algorithm
for decision tree. In Advances in Neural Information Processing Systems, pages 1271–1279, 2016.
[20] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
[21] Michael Collins, Robert E Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning,
48(1-3):253–285, 2002.
[22] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.
[23] Luis O Jimenez and David A Landgrebe. Hyperspectral data analysis and supervised feature reduction via projection pursuit. IEEE
Transactions on Geoscience and Remote Sensing, 37(6):2653–2667, 1999.
[24] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
[25] Tommy R Jensen and Bjarne Toft. Graph coloring problems, volume 39. John Wiley & Sons, 2011.
[26] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013.
[27] Allstate claim data, https://www.kaggle.com/c/ClaimPredictionChallenge.
[28] Flight delay data, https://github.com/szilard/benchm-ml#data.
[29] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G McKenzie, Jung-Wei Chou, Po-Han Chung, Chia-Hua Ho, Chun-Fu
Chang, Yin-Hsuan Wei, et al. Feature engineering and classifier ensemble for kdd cup 2010. In KDD Cup, 2010.
[30] Kuan-Wei Wu, Chun-Sung Ferng, Chia-Hua Ho, An-Chun Liang, Chun-Heng Huang, Wei-Yuan Shen, Jyun-Yu Jiang, Ming-Hao Yang,
Ting-Wei Lin, Ching-Pei Lee, et al. A two-stage ensemble of diverse models for advertisement ranking in kdd cup 2012. In KDDCup,2012.
[31] Libsvm binary classification data, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.
[32] Haijian Shi. Best-first decision tree learning. PhD thesis, The University of Waikato, 2007.