[论文翻译]Scikit-learn: Python语言中的机器学习


原文地址:https://arxiv.org/pdf/1201.0490v4


Scikit-learn: Machine Learning in Python

Scikit-learn: Python语言中的机器学习

Abstract

摘要

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.org.

Scikit-learn 是一个 Python 模块,集成了多种先进的机器学习算法,适用于中等规模的监督和无监督问题。该软件包专注于通过通用高级语言将机器学习带给非专业人士。重点放在易用性、性能、文档和 API 一致性上。它具有最少的依赖项,并在简化的 BSD 许可证下分发,鼓励在学术和商业环境中使用。源代码、二进制文件和文档可以从 http://scikit-learn.org 下载。

Keywords: Python, supervised learning, unsupervised learning, model selection

关键词:Python语言,监督学习,无监督学习,模型选择

1. Introduction

1. 引言

The Python programming language is establishing itself as one of the most popular languages for scientific computing. Thanks to its high-level interactive nature and its maturing ecosystem of scientific libraries, it is an appealing choice for algorithmic development and exploratory data analysis (Dubois, 2007; Milmann and Avaizis, 2011). Yet, as a generalpurpose language, it is increasingly used not only in academic settings but also in industry.

Python语言正在成为科学计算中最受欢迎的语言之一。得益于其高级交互特性和不断成熟的科学库生态系统,它成为算法开发和探索性数据分析 (exploratory data analysis) 的有力选择 (Dubois, 2007; Milmann and Avaizis, 2011)。然而,作为一种通用语言,它不仅越来越多地应用于学术环境,也逐渐在工业界得到广泛应用。

Scikit-learn harnesses this rich environment to provide state-of-the-art implementations of many well known machine learning algorithms, while maintaining an easy-to-use interface tightly integrated with the Python language. This answers the growing need for statistical data analysis by non-specialists in the software and web industries, as well as in fields outside of computer-science, such as biology or physics. Scikit-learn differs from other machine learning toolboxes in Python for various reasons: i) it is distributed under the BSD license ii) it incorporates compiled code for efficiency, unlike MDP (Zito et al., 2008) and pybrain (Schaul et al., 2010), iii) it depends only on numpy and scipy to facilitate easy distribution, unlike pymvpa (Hanke et al., 2009) that has optional dependencies such as R and shogun, and iv) it focuses on imperative programming, unlike pybrain which uses a data-flow framework. While the package is mostly written in Python, it incorporates the C++ libraries LibSVM (Chang and Lin, 2001) and LibLinear (Fan et al., 2008) that provide reference implementations of SVMs and generalized linear models with compatible licenses. Binary packages are available on a rich set of platforms including Windows and any POSIX platforms. Furthermore, thanks to its liberal license, it has been widely distributed as part of major free software distributions such as Ubuntu, Debian, Mandriva, NetBSD and Macports and in commercial distributions such as the “Enthought Python Distribution”.

Scikit-learn 利用这一丰富的环境,提供了许多知名机器学习算法的最先进实现,同时保持了与 Python 语言紧密集成的易用接口。这满足了软件和网络行业非专业人士以及计算机科学以外的领域(如生物学或物理学)对统计数据分析日益增长的需求。Scikit-learn 与其他 Python 机器学习工具箱的不同之处在于:i) 它在 BSD 许可证下分发;ii) 与 MDP (Zito et al., 2008) 和 pybrain (Schaul et al., 2010) 不同,它包含了编译代码以提高效率;iii) 与 pymvpa (Hanke et al., 2009) 不同,它仅依赖于 numpy 和 scipy 以方便分发,而 pymvpa 有可选的依赖项,如 R 和 shogun;iv) 它专注于命令式编程,而 pybrain 使用数据流框架。虽然该包主要用 Python 编写,但它包含了 C++ 库 LibSVM (Chang and Lin, 2001) 和 LibLinear (Fan et al., 2008),这些库提供了具有兼容许可证的 SVM 和广义线性模型的参考实现。二进制包可在包括 Windows 和任何 POSIX 平台在内的多种平台上使用。此外,由于其宽松的许可证,它已作为主要自由软件发行版(如 Ubuntu、Debian、Mandriva、NetBSD 和 Macports)以及商业发行版(如“Enthought Python Distribution”)的一部分广泛分发。

2. Project Vision

2. 项目愿景

Code quality. Rather than providing as many features as possible, the project’s goal has been to provide solid implementations. Code quality is ensured with unit tests—as of release 0.8, test coverage is 81%—and the use of static analysis tools such as pyflakes and pep8. Finally, we strive to use consistent naming for the functions and parameters used throughout a strict adherence to the Python coding guidelines and numpy style documentation. BSD licensing. Most of the Python ecosystem is licensed with non-copyleft licenses. While such policy is beneficial for adoption of these tools by commercial projects, it does impose some restrictions: we are unable to use some existing scientific code, such as the GSL. Bare-bone design and API. To lower the barrier of entry, we avoid framework code and keep the number of different objects to a minimum, relying on numpy arrays for data containers. Community-driven development. We base our development on collaborative tools such as git, github and public mailing lists. External contributions are welcome and encouraged. Documentation. Scikit-learn provides a $\sim300$ page user guide including narrative documentation, class references, a tutorial, installation instructions, as well as more than 60 examples, some featuring real-world applications. We try to minimize the use of machinelearning jargon, while maintaining precision with regards to the algorithms employed.

代码质量。项目的目标不是提供尽可能多的功能,而是提供稳健的实现。通过单元测试(截至0.8版本,测试覆盖率为81%)以及使用静态分析工具(如pyflakes和pep8)来确保代码质量。最后,我们努力在整个项目中为函数和参数使用一致的命名,并严格遵守Python编码指南和numpy风格文档。BSD许可。大多数Python生态系统都采用非copyleft许可。虽然这种政策有利于商业项目采用这些工具,但它也带来了一些限制:我们无法使用一些现有的科学代码,例如GSL。简洁的设计和API。为了降低入门门槛,我们避免使用框架代码,并将不同对象的数量保持在最低限度,依赖numpy数组作为数据容器。社区驱动的开发。我们的开发基于协作工具,如git、github和公共邮件列表。我们欢迎并鼓励外部贡献。文档。Scikit-learn提供了一本约300页的用户指南,包括叙述性文档、类参考、教程、安装说明以及60多个示例,其中一些示例展示了实际应用。我们尽量减少使用机器学习术语,同时保持对所使用算法的精确描述。

3. Underlying Technologies

3. 底层技术

Numpy: the base data structure used for data and model parameters. Input data is presented as numpy arrays, thus integrating seamlessly with other scientific Python libraries. Numpy’s view-based memory model limits copies, even when binding with compiled code (Van der Walt et al., 2011). It also provides basic arithmetic operations. Scipy: efficient algorithms for linear algebra, sparse matrix representation, special functions and basic statistical functions. Scipy has bindings for many Fortran-based standard numerical packages, such as LAPACK. This is important for ease of installation and portability, as providing libraries around Fortran code can prove challenging on various platforms. Cython: a language for combining C in Python. Cython makes it easy to reach the performance of compiled languages with Python-like syntax and high-level operations. It is also used to bind compiled libraries, eliminating the boilerplate code of Python/C extensions.

Numpy: 用于数据和模型参数的基础数据结构。输入数据以 numpy 数组形式呈现,因此可以与其他科学 Python 库无缝集成。Numpy 基于视图的内存模型限制了复制操作,即使在与编译代码绑定时也是如此 (Van der Walt et al., 2011)。它还提供了基本的算术运算。Scipy: 用于线性代数、稀疏矩阵表示、特殊函数和基本统计函数的高效算法。Scipy 与许多基于 Fortran 的标准数值包(如 LAPACK)有绑定。这对于安装的简便性和可移植性非常重要,因为在各种平台上提供围绕 Fortran 代码的库可能具有挑战性。Cython: 一种在 Python 中结合 C 的语言。Cython 使得通过类似 Python 的语法和高级操作轻松达到编译语言的性能。它还用于绑定编译库,消除了 Python/C 扩展的样板代码。

4. Code Design

4. 代码设计

Objects specified by interface, not by inheritance. To facilitate the use of external objects with scikit-learn, inheritance is not enforced; instead, code conventions provide a consistent interface. The central object is an estimator, that implements a fit method, accepting as

通过接口而非继承指定的对象。为了便于在 scikit-learn 中使用外部对象,不强制使用继承;相反,代码约定提供了一致的接口。核心对象是一个估计器 (estimator),它实现了 fit 方法,接受作为

-: Not implemented. ⋆: Does not converge within 1 hour.

scikit-learn mlpy pybrain pymvpa mdp shogun
支持向量分类 (Support Vector Classification) 5.2 9.47 17.5 11.52 40.48 5.63
Lasso (LARS) 1.17 105.3 - 37.35 - -
弹性网络 (ElasticNet) 0.52 73.7 - 1.44 - -
k-近邻 (k-Nearest Neighbors) 0.57 1.41 - 0.56 0.58 1.36
PCA (9 个成分) 0.18 - - 8.93 0.47 0.33
k-均值 (k-Means, 9 个簇) 1.34 0.79 - 35.75 0.68
许可证 BSD GPL BSD BSD BSD GPL

-: 未实现。⋆: 1 小时内未收敛。

Table 1: Time in seconds on the Madelon data set for various machine learning libraries exposed in Python: MLPy (Albanese et al., 2008), PyBrain (Schaul et al., 2010), pymvpa (Hanke et al., 2009), MDP (Zito et al., 2008) and Shogun (Sonnenburg et al., 2010). For more benchmarks see http://github.com/scikit-learn.

表 1: 在 Madelon 数据集上,使用 Python 语言暴露的各种机器学习库所需的时间(单位:秒):MLPy (Albanese et al., 2008)、PyBrain (Schaul et al., 2010)、pymvpa (Hanke et al., 2009)、MDP (Zito et al., 2008) 和 Shogun (Sonnenburg et al., 2010)。更多基准测试请参见 http://github.com/scikit-learn

arguments an input data array and, optionally, an array of labels for supervised problems. Supervised estimators, such as SVM class if i ers, can implement a predict method. Some estimators, that we call transformers, for example, PCA, implement a transform method, returning modified inp