[论文翻译]Scikit-learn: Python语言中的机器学习


原文地址:https://arxiv.org/pdf/1201.0490v4


Scikit-learn: Machine Learning in Python

Scikit-learn: Python语言中的机器学习

Abstract

摘要

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.org.

Scikit-learn 是一个 Python 模块,集成了多种先进的机器学习算法,适用于中等规模的监督和无监督问题。该软件包专注于通过通用高级语言将机器学习带给非专业人士。重点放在易用性、性能、文档和 API 一致性上。它具有最少的依赖项,并在简化的 BSD 许可证下分发,鼓励在学术和商业环境中使用。源代码、二进制文件和文档可以从 http://scikit-learn.org 下载。

Keywords: Python, supervised learning, unsupervised learning, model selection

关键词:Python语言,监督学习,无监督学习,模型选择

1. Introduction

1. 引言

The Python programming language is establishing itself as one of the most popular languages for scientific computing. Thanks to its high-level interactive nature and its maturing ecosystem of scientific libraries, it is an appealing choice for algorithmic development and exploratory data analysis (Dubois, 2007; Milmann and Avaizis, 2011). Yet, as a generalpurpose language, it is increasingly used not only in academic settings but also in industry.

Python语言正在成为科学计算中最受欢迎的语言之一。得益于其高级交互特性和不断成熟的科学库生态系统,它成为算法开发和探索性数据分析 (exploratory data analysis) 的有力选择 (Dubois, 2007; Milmann and Avaizis, 2011)。然而,作为一种通用语言,它不仅越来越多地应用于学术环境,也逐渐在工业界得到广泛应用。

Scikit-learn harnesses this rich environment to provide state-of-the-art implementations of many well known machine learning algorithms, while maintaining an easy-to-use interface tightly integrated with the Python language. This answers the growing need for statistical data analysis by non-specialists in the software and web industries, as well as in fields outside of computer-science, such as biology or physics. Scikit-learn differs from other machine learning toolboxes in Python for various reasons: i) it is distributed under the BSD license ii) it incorporates compiled code for efficiency, unlike MDP (Zito et al., 2008) and pybrain (Schaul et al., 2010), iii) it depends only on numpy and scipy to facilitate easy distribution, unlike pymvpa (Hanke et al., 2009) that has optional dependencies such as R and shogun, and iv) it focuses on imperative programming, unlike pybrain which uses a data-flow framework. While the package is mostly written in Python, it incorporates the C++ libraries LibSVM (Chang and Lin, 2001) and LibLinear (Fan et al., 2008) that provide reference implementations of SVMs and generalized linear models with compatible licenses. Binary packages are available on a rich set of platforms including Windows and any POSIX platforms. Furthermore, thanks to its liberal license, it has been widely distributed as part of major free software distributions such as Ubuntu, Debian, Mandriva, NetBSD and Macports and in commercial distributions such as the “Enthought Python Distribution”.

Scikit-learn 利用这一丰富的环境,提供了许多知名机器学习算法的最先进实现,同时保持了与 Python 语言紧密集成的易用接口。这满足了软件和网络行业非专业人士以及计算机科学以外的领域(如生物学或物理学)对统计数据分析日益增长的需求。Scikit-learn 与其他 Python 机器学习工具箱的不同之处在于:i) 它在 BSD 许可证下分发;ii) 与 MDP (Zito et al., 2008) 和 pybrain (Schaul et al., 2010) 不同,它包含了编译代码以提高效率;iii) 与 pymvpa (Hanke et al., 2009) 不同,它仅依赖于 numpy 和 scipy 以方便分发,而 pymvpa 有可选的依赖项,如 R 和 shogun;iv) 它专注于命令式编程,而 pybrain 使用数据流框架。虽然该包主要用 Python 编写,但它包含了 C++ 库 LibSVM (Chang and Lin, 2001) 和 LibLinear (Fan et al., 2008),这些库提供了具有兼容许可证的 SVM 和广义线性模型的参考实现。二进制包可在包括 Windows 和任何 POSIX 平台在内的多种平台上使用。此外,由于其宽松的许可证,它已作为主要自由软件发行版(如 Ubuntu、Debian、Mandriva、NetBSD 和 Macports)以及商业发行版(如“Enthought Python Distribution”)的一部分广泛分发。

2. Project Vision

2. 项目愿景

Code quality. Rather than providing as many features as possible, the project’s goal has been to provide solid implementations. Code quality is ensured with unit tests—as of release 0.8, test coverage is 81%—and the use of static analysis tools such as pyflakes and pep8. Finally, we strive to use consistent naming for the functions and parameters used throughout a strict adherence to the Python coding guidelines and numpy style documentation. BSD licensing. Most of the Python ecosystem is licensed with non-copyleft licenses. While such policy is beneficial for adoption of these tools by commercial projects, it does impose some restrictions: we are unable to use some existing scientific code, such as the GSL. Bare-bone design and API. To lower the barrier of entry, we avoid framework code and keep the number of different objects to a minimum, relying on numpy arrays for data containers. Community-driven development. We base our development on collaborative tools such as git, github and public mailing lists. External contributions are welcome and encouraged. Documentation. Scikit-learn provides a $\sim300$ page user guide including narrative documentation, class references, a tutorial, installation instructions, as well as more than 60 examples, some featuring real-world applications. We try to minimize the use of machinelearning jargon, while maintaining precision with regards to the algorithms employed.

代码质量。项目的目标不是提供尽可能多的功能,而是提供稳健的实现。通过单元测试(截至0.8版本,测试覆盖率为81%)以及使用静态分析工具(如pyflakes和pep8)来确保代码质量。最后,我们努力在整个项目中为函数和参数使用一致的命名,并严格遵守Python编码指南和numpy风格文档。BSD许可。大多数Python生态系统都采用非copyleft许可。虽然这种政策有利于商业项目采用这些工具,但它也带来了一些限制:我们无法使用一些现有的科学代码,例如GSL。简洁的设计和API。为了降低入门门槛,我们避免使用框架代码,并将不同对象的数量保持在最低限度,依赖numpy数组作为数据容器。社区驱动的开发。我们的开发基于协作工具,如git、github和公共邮件列表。我们欢迎并鼓励外部贡献。文档。Scikit-learn提供了一本约300页的用户指南,包括叙述性文档、类参考、教程、安装说明以及60多个示例,其中一些示例展示了实际应用。我们尽量减少使用机器学习术语,同时保持对所使用算法的精确描述。

3. Underlying Technologies

3. 底层技术

Numpy: the base data structure used for data and model parameters. Input data is presented as numpy arrays, thus integrating seamlessly with other scientific Python libraries. Numpy’s view-based memory model limits copies, even when binding with compiled code (Van der Walt et al., 2011). It also provides basic arithmetic operations. Scipy: efficient algorithms for linear algebra, sparse matrix representation, special functions and basic statistical functions. Scipy has bindings for many Fortran-based standard numerical packages, such as LAPACK. This is important for ease of installation and portability, as providing libraries around Fortran code can prove challenging on various platforms. Cython: a language for combining C in Python. Cython makes it easy to reach the performance of compiled languages with Python-like syntax and high-level operations. It is also used to bind compiled libraries, eliminating the boilerplate code of Python/C extensions.

Numpy: 用于数据和模型参数的基础数据结构。输入数据以 numpy 数组形式呈现,因此可以与其他科学 Python 库无缝集成。Numpy 基于视图的内存模型限制了复制操作,即使在与编译代码绑定时也是如此 (Van der Walt et al., 2011)。它还提供了基本的算术运算。Scipy: 用于线性代数、稀疏矩阵表示、特殊函数和基本统计函数的高效算法。Scipy 与许多基于 Fortran 的标准数值包(如 LAPACK)有绑定。这对于安装的简便性和可移植性非常重要,因为在各种平台上提供围绕 Fortran 代码的库可能具有挑战性。Cython: 一种在 Python 中结合 C 的语言。Cython 使得通过类似 Python 的语法和高级操作轻松达到编译语言的性能。它还用于绑定编译库,消除了 Python/C 扩展的样板代码。

4. Code Design

4. 代码设计

Objects specified by interface, not by inheritance. To facilitate the use of external objects with scikit-learn, inheritance is not enforced; instead, code conventions provide a consistent interface. The central object is an estimator, that implements a fit method, accepting as

通过接口而非继承指定的对象。为了便于在 scikit-learn 中使用外部对象,不强制使用继承;相反,代码约定提供了一致的接口。核心对象是一个估计器 (estimator),它实现了 fit 方法,接受作为

-: Not implemented. ⋆: Does not converge within 1 hour.

scikit-learn mlpy pybrain pymvpa mdp shogun
支持向量分类 (Support Vector Classification) 5.2 9.47 17.5 11.52 40.48 5.63
Lasso (LARS) 1.17 105.3 - 37.35 - -
弹性网络 (ElasticNet) 0.52 73.7 - 1.44 - -
k-近邻 (k-Nearest Neighbors) 0.57 1.41 - 0.56 0.58 1.36
PCA (9 个成分) 0.18 - - 8.93 0.47 0.33
k-均值 (k-Means, 9 个簇) 1.34 0.79 - 35.75 0.68
许可证 BSD GPL BSD BSD BSD GPL

-: 未实现。⋆: 1 小时内未收敛。

Table 1: Time in seconds on the Madelon data set for various machine learning libraries exposed in Python: MLPy (Albanese et al., 2008), PyBrain (Schaul et al., 2010), pymvpa (Hanke et al., 2009), MDP (Zito et al., 2008) and Shogun (Sonnenburg et al., 2010). For more benchmarks see http://github.com/scikit-learn.

表 1: 在 Madelon 数据集上,使用 Python 语言暴露的各种机器学习库所需的时间(单位:秒):MLPy (Albanese et al., 2008)、PyBrain (Schaul et al., 2010)、pymvpa (Hanke et al., 2009)、MDP (Zito et al., 2008) 和 Shogun (Sonnenburg et al., 2010)。更多基准测试请参见 http://github.com/scikit-learn

arguments an input data array and, optionally, an array of labels for supervised problems. Supervised estimators, such as SVM class if i ers, can implement a predict method. Some estimators, that we call transformers, for example, PCA, implement a transform method, returning modified input data. Estimators may also provide a score method, which is an increasing evaluation of goodness of fit: a log-likelihood, or a negated loss function. The other important object is the cross-validation iterator, which provides pairs of train and test indices to split input data, for example K-fold, leave one out, or stratified cross-validation. Model selection. Scikit-learn can evaluate an estimator’s performance or select parameters using cross-validation, optionally distributing the computation to several cores. This is accomplished by wrapping an estimator in a Grid Search CV object, where the “CV” stands for “cross-validated”. During the call to fit, it selects the parameters on a specified parameter grid, maximizing a score (the score method of the underlying estimator). predict, score, or transform are then delegated to the tuned estimator. This object can therefore be used transparently as any other estimator. Cross validation can be made more efficient for certain estimators by exploiting specific properties, such as warm restarts or regular iz ation paths (Friedman et al., 2010). This is supported through special objects, such as the LassoCV. Finally, a Pipeline object can combine several transformers and an estimator to create a combined estimator to, for example, apply dimension reduction before fitting. It behaves as a standard estimator, and Grid Search CV therefore tune the parameters of all steps.

参数包括一个输入数据数组,以及可选的用于监督问题的标签数组。监督估计器(如SVM分类器)可以实现一个预测方法。一些估计器(我们称之为转换器,例如PCA)实现了一个转换方法,返回修改后的输入数据。估计器还可以提供一个评分方法,这是对拟合优度的递增评估:对数似然或负损失函数。另一个重要的对象是交叉验证迭代器,它提供了训练和测试索引对以分割输入数据,例如K折、留一法或分层交叉验证。模型选择。Scikit-learn可以使用交叉验证评估估计器的性能或选择参数,可选地将计算分布到多个核心。这是通过将估计器包装在Grid Search CV对象中实现的,其中“CV”代表“交叉验证”。在调用fit期间,它会在指定的参数网格上选择参数,最大化评分(底层估计器的评分方法)。然后,predict、score或transform被委托给调优后的估计器。因此,该对象可以像任何其他估计器一样透明地使用。对于某些估计器,通过利用特定属性(如热重启或正则化路径)可以使交叉验证更高效(Friedman等人,2010)。这通过特殊对象(如LassoCV)得到支持。最后,Pipeline对象可以组合多个转换器和一个估计器,以创建一个组合估计器,例如在拟合之前应用降维。它的行为类似于标准估计器,因此Grid Search CV会调整所有步骤的参数。

5. High-level yet Efficient: Some Trade Offs

5. 高级但高效:一些权衡

While scikit-learn focuses on ease of use, and is mostly written in a high level language, care has been taken to maximize computational efficiency. In Table 1, we compare computation time for a few algorithms implemented in the major machine learning toolkits accessible in Python. We use the Madelon data set (Guyon et al., 2004), 4400 instances and 500 attributes, The data set is quite large, but small enough for most algorithms to run.

虽然 scikit-learn 注重易用性,并且大部分代码是用高级语言编写的,但在计算效率方面也进行了优化。在表 1 中,我们比较了 Python 中主要机器学习工具包实现的几种算法的计算时间。我们使用了 Madelon 数据集 (Guyon et al., 2004),包含 4400 个实例和 500 个属性。该数据集相当大,但对于大多数算法来说足够小,可以运行。

SVM. While all of the packages compared call libsvm in the background, the performance of scikit-learn can be explained by two factors. First, our bindings avoid memory copies and have up to 40% less overhead than the original libsvm Python bindings. Second, we patch libsvm to improve efficiency on dense data, use a smaller memory footprint, and better use memory alignment and pipelining capabilities of modern processors. This patched version also provides unique features, such as setting weights for individual samples.

SVM。虽然所有比较的包都在后台调用libsvm,但scikit-learn的性能可以通过两个因素来解释。首先,我们的绑定避免了内存复制,并且比原始的libsvm Python绑定减少了高达40%的开销。其次,我们对libsvm进行了补丁,以提高对密集数据的效率,使用更小的内存占用,并更好地利用现代处理器的内存对齐和流水线能力。这个补丁版本还提供了独特的功能,例如为单个样本设置权重。

$L A R S$ . Iterative ly refining the residuals instead of re computing them gives performance gains of 2–10 times over the reference R implementation (Hastie and Efron, 2004). Pymvpa uses this implementation via the Rpy R bindings and pays a heavy price to memory copies. Elastic Net. We benchmarked the scikit-learn coordinate descent implementations of Elastic Net. It achieves the same order of performance as the highly optimized Fortran version glmnet (Friedman et al., 2010) on medium-scale problems, but performance on very large problems is limited since we do not use the KKT conditions to define an active set. kNN. The k-nearest neighbors classifier implementation constructs a ball tree (Omohundro, 1989) of the samples, but uses a more efficient brute force search in large dimensions. $P C A$ . For medium to large data sets, scikit-learn provides an implementation of a truncated PCA based on random projections (Rokhlin et al., 2009). $k$ -means. scikit-learn’s k-means algorithm is implemented in pure Python. Its performance is limited by the fact that numpy’s array operations take multiple passes over data.

$L A R S$。通过迭代优化残差而不是重新计算,性能比参考的 R 实现 (Hastie and Efron, 2004) 提升了 2-10 倍。Pymvpa 通过 Rpy R 绑定使用此实现,并在内存拷贝上付出了较大代价。Elastic Net。我们对 scikit-learn 的坐标下降法实现的 Elastic Net 进行了基准测试。在中等规模问题上,它的性能与高度优化的 Fortran 版本 glmnet (Friedman et al., 2010) 相当,但在非常大的问题上性能有限,因为我们没有使用 KKT 条件来定义活动集。kNN。k 近邻分类器实现构建了样本的球树 (Omohundro, 1989),但在高维情况下使用了更高效的暴力搜索。$P C A$。对于中大型数据集,scikit-learn 提供了基于随机投影的截断 PCA 实现 (Rokhlin et al., 2009)。$k$ -means。scikit-learn 的 k-means 算法是用纯 Python 实现的。由于 numpy 的数组操作需要多次遍历数据,其性能受到限制。

6. Conclusion

6. 结论

Scikit-learn exposes a wide variety of machine learning algorithms, both supervised and unsupervised, using a consistent, task-oriented interface, thus enabling easy comparison of methods for a given application. Since it relies on the scientific Python ecosystem, it can easily be integrated into applications outside the traditional range of statistical data analysis. Importantly, the algorithms, implemented in a high-level language, can be used as building blocks for approaches specific to a use case, for example, in medical imaging (Michel et al., 2011). Future work includes online learning, to scale to large data sets.

Scikit-learn 提供了多种监督和非监督的机器学习算法,通过一致的任务导向接口,使得在特定应用中轻松比较不同方法成为可能。由于它依赖于科学 Python 生态系统,因此可以轻松集成到传统统计分析范围之外的应用中。重要的是,这些用高级语言实现的算法可以作为特定用例方法的构建模块,例如在医学影像中的应用 (Michel et al., 2011)。未来的工作包括在线学习,以扩展到大规模数据集。

References

参考文献

Pedregosa, Varoquaux, Gramfort et al.

Pedregosa, Varoquaux, Gramfort 等人

阅读全文(20积分)