首页 > 编程技术 > 使用MLBox的自动机器学习教程

[悬赏]使用MLBox的自动机器学习教程 (已翻译57%)

查看 (247次)
英文原文:Tutorial on Automated Machine Learning using MLBox
标签: 编程技术
admin 发布于 2017-07-10 15:50:51 (共 14 段, 本文赏金: 41元)
参与翻译(2人): greenflute cyt5969858 默认 | 原文

【已悬赏】 赏金: 2元

介绍

最近我和一个朋友在解决一个实践上的问题。我的朋友Shubham,经过了长达8个小时的苦干和编程,好不容易得了1153分,排名219。下面是榜单:

而我只需要8行代码就可以比他排名更靠前:

我是怎么做到的呢?

如果我告诉你有个现成的库叫做 MLBox,它替你完成了机器学习中大部分艰苦的工作,从而使你只需要写最少的代码呢?从缺值归纳到使用嵌入式实体对分类特征进行特征工程,MLBox全都有。

在使用 MLBox 的8行代码中,我还做了超参优化并以闪电般的速度对50多个模型进行了测试——是不是很棒?读完本文,你也能这么干。

greenflute
翻译于 2017-07-24 08:09:44
 

参与本段翻译用户:
greenflute

显示原文内容

【已悬赏】 赏金: 1元

目录

  1. MLBox 是什么?
  2. MLBox 与其他机器学习框架的对比
  3. 安装 MLBox
  4. MLBox 的布局/流水线
  5. 使用 MLBox 构建一个机器学习回归器
  6. 对 Drift (漂移)的基本理解
  7. 对 Entity Embedding (嵌入式实体)的基本理解
  8. MLBox 的优势与劣势
  9. 注意事项
greenflute
翻译于 2017-07-24 09:10:28
 

参与本段翻译用户:
greenflute

显示原文内容

【已悬赏】 赏金: 1元

1. MLBox 是什么?

据 MLBox 的开发者讲,

“MLBox 是一个强大的自动化机器学习库,它有以下特性

  • 快速读取以及分布式进行数据处理/清洗/格式化
  • 高度鲁棒性的特征选取和内存泄漏检测
  • 高维空间的精确超参优化
  • 分类与回归中完美的预测模型 (深度学习,堆叠,轻量级GBM,…)
  • 模型解读的预测
greenflute
翻译于 2017-07-24 09:38:29
 

参与本段翻译用户:
greenflute

显示原文内容

【已悬赏】 赏金: 1元

2. MLBox 与其他机器学习库的对比

相对于其他的库,MLBox 专注与以下三点:

  1. 漂移发现 – 一种使训练数据的分布与测试数据更相似的方法。
  2. 嵌入式实体 – 受Word2vec启发的一种特征分类编码技术。
  3.  超参优化

下面我们讲更加详细地讨论这几点以给大家更清晰的印象。

greenflute
翻译于 2017-07-24 09:42:30
 

参与本段翻译用户:
greenflute

显示原文内容

【已悬赏】 赏金: 3元

3. 安装 MLBox

MLBox 目前只能在 Linux 上运行。 MLBox 主要是采用Python 2 开发的,最近才支持 Python 3. 我们这里安装 MLBox 最新的开发版3.0. 以下为Linux下安装步骤:

  1. 使用下列命令建立Python 3.x 和 anaconda 的 Conda环境。
    conda create -n Python3 python=3 anaconda    #Python3 是我们建立的环境名
  2. 激活Python 3环境
    source activate Python3
  3. 下载MLBox
    curl -OL https://github.com/AxeldeRomblay/mlbox/tarball/3.0-dev
  4. 解压缩
    sudo tar -xzvf 3.0-dev
  5. 进入解压后目录
    cd AxeldeRomblay-MLBox-2befaee
  6. 安装MLBox
    cd python-package
    cd dist
    pip install *.whl
  7. 安装其他附加库
    pip install lightgbm
    pip install xgboost
    pip install hyperopt
    pip install tensorflow
    pip install keras
    pip install matplotlib
    pip install ipyparallel
    pip install pandas
  8. 确认安装正常
    python
    import mlbox
    如果加载无误,说明安装成功。接下来将安装其他的库。

注意 – 目前这个库的开发非常活跃,所以可能不太稳定。例如,两天以前在Python 2.7下还很正常,而不能在Python 3.6下运行。但是截至本文写作期间,2.7版反而不行了,3.0版倒没问题。另外如果有任何Bug发现或者意见请尽管在GitHub上开Issue或者注释。

greenflute
翻译于 2017-07-24 09:50:47
 

参与本段翻译用户:
greenflute

显示原文内容

【已悬赏】 赏金: 1元

4. MLBox 的布局/流水线

完整流水线:


整个MLBox流水线分为3个部分/子包:

  1. 预处理
  2. 优化
  3. 预测

下面将对各部分详细分别叙述。

greenflute
翻译于 2017-07-24 09:53:05
 

参与本段翻译用户:
greenflute

显示原文内容

【待悬赏】 赏金: 3元

Pre-Processing

All the functionalities inside this sub-package can be used via the command-
from mlbox.preprocessing import *

This sub-package provides functionalities related to two major functions.

  1. Reading and cleaning a file

    This package supports reading a wide variety of file formats like csv, Excel, hdf5, JSON etc. but in this article, we will be primarily seeing the most common “.csv” file format. Follow the below steps to read a csv file.

    Step1: Create an object of the Reader class with the separator as a parameter. “,” is the separator in the case of a csv file.
    s=","
    r=Reader(s)   #initialising the object of Reader Class

    Step2: Make a list of the train and test file paths and also identify the target variable name.
    path=["path of the train csv file","path of the test csv file "]
    target_name="name of the target variable in the train file"

    Step3: Performing the cleaning operation and creating a cleaned train and test file.
    data=r.train_test_split(path,target_name)
    The cleaning steps performed in the above step are-
    -deleting unnamed columns
    -removing duplicates
    -extracting month, year and day of the week from a Date column


  2. Removing the Drifting Variables

    The drifting Variables are explained in the later section. To remove the drifting variables, follow the below steps.

    Step1: Create an object of class Drift_thresholder
    dft=Drift_thresholder()

    Step2: Use the fit_transform method of the created object to remove the drift variables.
    data=dft.fit_transform(data)

 



【待悬赏】 赏金: 11元

Optimisation

All the functionalities inside this sub-package can be used via the command-
from mlbox.optimisation import *

This is the section where this library scores the maximum points. This hyper-parameter optimisation method in this library uses the hyperopt library which is very fast and you can almost optimise anything in this library from choosing the right missing value imputation method to the depth of an XGBOOST model. This library creates a high-dimensional space of the parameters to be optimised and chooses the best combination of the parameters that lowers the validation score.

Below is the table of the four broad optimisations that are done in the MLBox library with terms to the right of hyphen that can be optimised for different values.

Missing Values Encoder(ne) – numerical_strategy (when the column to be imputed is a continuous column eg- mean, median etc), categorical_strategy(when the column to be imputed is a categorical column e.g.- NaN values etc)

Categorical Values Encoder(ce)– strategy (method of encoding categorical variables e.g.- label_encoding, dummification, random_projection, entity_embedding)

Feature Selector(fs)– strategy (different methods for feature selection e.g. l1, variance, rf_feature_importance), threshold (the percentage of features to be discarded)

Estimator(est)–strategy (different algorithms that can be used as estimators eg- LightGBM, xgboost etc.), **params(parameters specific to the algorithm being used eg- max_depth, n_estimators etc.)

Let us take an example and create a hyperparameter space to be optimised. Let us state all the parameters that I want to optimise:

Algorithm to be used- LightGBM
LightGBM max_depth-[3,5,7,9]
LightGBM n_estimators-[250,500,700,1000]
Feature selection-[variance, l1, random forest feature importance]
Missing values imputation – numerical(mean,median),categorical(NAN values)
categorical values encoder- label encoding, entity embedding and random projection

Let us now create our hyper-parameter space. Before that, remember, hyper-parameter is a dictionary of key and value pairs where value is also a dictionary given by the syntax
{“search”:strategy,”space”:list}, where strategy can be either “choice” or “uniform” and list is the list of values.

space={'ne_numerical_strategy':{"search":"choice","space":['mean','median']},
'ne_categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce_strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs_strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs_threshold':{"search":"uniform","space":[0.01, 0.3]},
'est_max_depth':{"search":"choice","space":[3,5,7,9]},
'est_n_estimators':{"search":"choice","space":[250,500,700,1000]}}

Now we will see the steps to choose the best combination from the above space using the following steps:

Step1: Create an object of class Optimiser which has the parameters as ‘scoring’ and‘n_folds’. Scoring is the metric against which we want to optimise our hyper-parameter space and n_folds is the number of folds of cross-validation
Scoring values for Classification- "accuracy", "roc_auc", "f1", "log_loss", "precision", "recall"
Scoring values for Regression- "mean_absolute_error", "mean_squarred_error", "median_absolute_error", "r2"
opt=Optimiser(scoring="accuracy",n_folds=5)

Step2: Use the optimise function of the object created above which takes the hyper-parameter space, dictionary created by the train_test_split and number of iterations as the parameters. This function returns the best hyper-paramters from the hyper-parameter space.
best=opt.optimise(space,data,40)



【待悬赏】 赏金: 2元

Prediction

All the functions in this sub-package can be installed using the command below.
from mlbox.prediction import *

This sub-package predicts on the test dataset using the best hyper-parameters calculated using theoptimisation sub-package. To predict on the test dataset, go through the following steps.

Step1: Create an object of class Predictor
pred=Predictor()

Step2: Use the fit_predict method of the object created above which takes a set of hyperparameters and dictionary created through train_test_split as the parameter.
pred.fit_predict(best,data)

The above method saves the feature importance, drift variables coefficients and the final predictionsinto a separate folder named ‘save’.



【已悬赏】 赏金: 5元

5. 使用MLBox来建立一个机器学习回归器

我们现在要用带有超参数优化的短短7行代码建立一个机器学习分类器。我们将解决的大型超市的销售问题。下载训练集和测试集文件并将它们保存在同一个文件夹中。 应用MLBox包,我们甚至都不用看那些数据,就能提交我们的第一份预测结果。你可以在下面找到这些代码来进行这个问题的预测。


# coding: utf-8

# importing the required libraries
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

# reading and cleaning the train and test files
df=Reader(sep=",").train_test_split(['/home/nss/Downloads/mlbox_blog/train.csv',

 '/home/nss/Downloads/mlbox_blog/test.csv'],'Item_Outlet_Sales')

# removing the drift variables
df=Drift_thresholder().fit_transform(df)

# setting the hyperparameter space
space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}

# calculating the best hyper-parameter
best=Optimiser(scoring="mean_squared_error",n_folds=5).optimise(space,df,40)

# predicting on the test dataset
Predictor().fit_predict(best,df)

上述代码在公共排行榜中排名108(前1%),这甚至是在不需要打开训练集和测试集文件的情况下。我认为这太可怕了!

以下是由LightGBM计算出的特征重要度的图解。


cyt5969858
翻译于 2017-10-31 15:35:48
 

参与本段翻译用户:
cyt5969858

显示原文内容

【待悬赏】 赏金: 3元

6. Basic Understanding of Drift

Drift is not a common topic but a very important one and it deserves an article of its own. But I will try to explain the functionality of Drift_Thresholder in brief.

In general, we assume that train and test dataset are created through the same generative algorithm or process but this assumption is quite strong and we do not see this behaviour in the real world. In the real world, the data generator or the process may change. For example, in a sales prediction model, the customer behaviour changes over time and hence the data generated will be different than the data that was used to create the model. This is called drift.

Another point to note is that in a dataset, both the independent features and the dependent feature may drift. When the independent features changes, it is called the covariate shift and when the relationship between the independent and dependent features change, it is called the concept shift. MLBox deals with the covariate shift.

 

The general algorithm for detection of drift is as follows-



【待悬赏】 赏金: 3元

7. Basic Understanding of Entity Embedding

Entity Embeddings owe their existence to the word2vec embeddings in the sense that they function the same way as word vectors do. For example, we know that in word vector representation, we can do things like below.

 

In the similar sense, categorical variables could be encoded to create new informative features. Their effect was evident to the world in Kaggle’s Rossmann Sales Problem where a team used Entity Embeddings along with Neural Network and came third without performing any significant feature engineering. The entire code and the research paper on Entity Embeddings that resulted from the competition could be found here. The Entity Embeddings were able to capture the relationship between the German states as shown below.

I don’t want to bog you down with the explanation of Entity Embeddings here. It deserves its own article. In MLBox, you can use Entity Embedding as a black box for encoding categorical variables.



【待悬赏】 赏金: 3元

8. Pros and Cons of MLBox

This library has its own sets of pros and cons.

The pros are –

  1. Automatic task identification i.e Classification or Regression
  2. Basic Pre-processing while reading the data
  3. Removal of Drifting variables
  4. Extremely fast and accurate hyperparameter optimisation.
  5. A wide variety of Feature Selection Methods.
  6. Minimal lines of code.
  7. Feature Engineering via Entity Embeddings

The cons are-

  1. It is still under active development and things may break or make at any point in time.
  2. No support for Unsupervised Learning
  3. Basic Feature Engineering. You still have to create your own features.
  4. Purely mathematical based feature selection method. This method may remove variables which make sense from the business perspective.
  5. Not truly an Automated Machine Learning Library.

So, I suggest you weigh the pros and cons before making this your mainstream library for Machine Learning.



【已悬赏】 赏金: 2元

9. 结语

一在GitHub上看见这个库发布时,我就兴奋地去尝试使用这个库。我花了几天时间研究这个库并进行了简化,供你们马上使用。我不得不说我被这个库震撼了,并且我将深入探索。只用8行代码,我就能进入前1%,而不必花明确的时间进行数据处理和超参数优化,我可以把更多的时间投入到特征工程和在实战中检验它们。如有任何帮助或想法,请随时评论。

cyt5969858
翻译于 45天前
 

参与本段翻译用户:
cyt5969858

显示原文内容

GMT+8, 2018-1-22 14:26 , Processed in 0.084024 second(s), 11 queries .