[PYTHON] Run XGBoost on Bash on Ubuntu on Windows

(Added how to install R-package. Http://qiita.com/TomokIshii/items/b43321448ab9fa21dc10#%E8%BF%BD%E8%A8%98r-package-%E3%81%AE%E3% 82% A4% E3% 83% B3% E3% 82% B9% E3% 83% 88% E3% 83% BC% E3% 83% AB 2016/9/2)

Anniversary Update for Windows 10 has been released, and Bash on Ubuntu on Windows can be used. There is a report that "TensorFlow also worked!", But this time, I decided to try to improve the environment of XGBoost (Python-package). XGBoost is a library (Xgboost = eXtreme Gradient Boosting) that implements the gradient boosting method. Although it is a C ++ program, it also supports use from Python, R, Julia, and Java.

I remember that when I installed XGBoost on Windows a while ago, it took a lot of time to start with installing MinGW. Expecting improvement, I proceeded with the work while thinking of writing an article like "This time, installation is so easy!", But I had some troubles, so I will introduce the situation.

(The environment I worked on this time is Windows 10, ver.1607, Bash on Ubuntu on Windows (Windows Sybsystem for Linux), Python 3.5.2, pyenv, miniconda3-4.0.5, xgboost ver.0.6.)

Introduction of basic tools

If you install Bash on Ubuntu on Windows by referring to Microsoft's blog article, the environment of Ubuntu 14.04LTS will be included, but there are almost no programs as the development environment. Therefore, we first introduced the basic tools.

git installation

sudo apt-get install git

gcc, g ++ installation

sudo apt-get install gcc
sudo apt-get install g++
$ gcc --version
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So, gcc 4.8.4 is included.

make installation

Install make because it is not included by default.

sudo apt-get install make

XGBoost build and Python related installation

Now that gcc, g ++, etc. have been installed, build XGBoost.

XGBoost build

git clone --recursive https://github.com/dmlc/xgboost 
cd xgboost
make -j4

In the Windows 10 native environment I tried before, it took a lot of trouble to build, but this time, the build is completed in one shot. In the xgboost / lib directory, there were libxgboost.a and libxgboost.so.

pyenv installation

Since there may be cases where Python2 is still needed, create an environment for pyenv and switch between Python2 and 3. (Reference) https://github.com/yyuu/pyenv#installation

It is necessary to add environment variables related to PATH, but here I edited .bashrc using vi that was included from the beginning.

miniconda installation

Select python to install using pyenv. The list of choices is shown below.

pyenv install -l

Here, I am at a loss as to whether to select the (full package) Anaconda system or the (minimum package) miniconda system, but this time I chose miniconda3-4.0.5.

pyenv install miniconda3-4.0.5

After that, install the modules required for numerical calculation.

conda install numpy, scipy, scikit-learn
conda install ipython

XGBoost Python-package installation

It should have been all right so far, but at the end I stumbled. First, when you type the command according to the XGBoost documentation,

cd python-package; sudo python setup.py install

An error occurs with a message that various things are not enough. For this, python-setuptools is required. (It was properly written in the XGBoost documentation.)

sudo apt-get install python-setuptools

After this, go back to the next step and

sudo python setup.py install

I thought it would be recovered, but an error related to "command not found". The cause is a mismatch that the above command tried to execute at the system level (root authority) while the environment of pyenv was maintained at the user level. (Pyenv switches between multiple environments by laying shim under $ HOME / .pyenv by default (?).)

Therefore, install again at the user level.

python setup.py install

Installation is completed successfully. (I thought ...) Test xgboost with the distribution code predict_first_ntree.py.

$ python predict_first_ntree.py
OMP: Error #100: Fatal system error detected.
OMP: System error #22: Invalid argument
Cancel(Core dump)

This is the result of unknown cause (unexpected). The only clue is the word ** "OMP" **. Searching on the net, OMP = OpenMP (Open Multi-Processing). The one that seems to be related to this is the numerical calculation library ** MKL ** (Math Kernel Library) made by Intel, which was installed in Miniconda. (Installed as a prerequisite library for Numpy and Scipy.)

MKL supports numerical calculation related libraries such as Numpy and Scipy to improve performance, but it should be noted that it often causes troubles in terms of environment maintenance. Previously, in the Ubuntu environment (without virtual environment) (I do not know what triggered it), suddenly the Deep Learning Framework Theano and TensorFlow occurred at the same time, and as a result of hurrying investigation, it was caused by MKL. was there.

This time, it seems that MKL could not be supported because it is an Ubuntu virtual environment. I replaced the MKL related libraries and tried it. The replacement is a command to install nomkl. The MKL library was removed and openblas was installed instead.

conda install nomkl

After that, execute predict_fist_ntree.py, which is a demonstration of a simple binary classification problem. (At the beginning, the distribution code has been modified.)

import os
# check os environment
if os.name == 'nt':  # Windows case ... add mingw lib path
    mingw_path = 'C:\\usr\\mingw-w64\\x86_64-5.4.0-win32-seh-rt_v5-rev0\\mingw64\\bin'
    os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']
if os.name == 'posix':  # Linux case
    break

import numpy as np
import xgboost as xgb

# load data
dtrain = xgb.DMatrix('./data/agaricus.txt.train')
dtest = xgb.DMatrix('./data/agaricus.txt.test')
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
watchlist  = [(dtest,'eval'), (dtrain,'train')]
num_round = 3
bst = xgb.train(param, dtrain, num_round, watchlist)
print ('start testing prediction from first n trees')
# predict using first 1 tree
label = dtest.get_label()
ypred1 = bst.predict(dtest, ntree_limit=1)
# by default, we predict using all the trees
ypred2 = bst.predict(dtest)
print ('error of ypred1=%f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print ('error of ypred2=%f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))

Below are the calculation results.

$ python predict_first_ntree.py
[15:25:08] 6513x127 matrix with 143286 entries loaded from ./data/agaricus.txt.train
[15:25:08] 1611x127 matrix with 35442 entries loaded from ./data/agaricus.txt.test
[0]     eval-error:0.042831     train-error:0.046522
[1]     eval-error:0.021726     train-error:0.022263
[2]     eval-error:0.006207     train-error:0.007063
start testing prediction from first n trees
error of ypred1=0.042831
error of ypred2=0.006207

So it became normal operation. Information that this environment (Bash on Ubuntu on Windows) does not support MKL has already been posted on Qiita. (I learned later ...)

(Qiita article-Pikkaman V) http://qiita.com/PikkamanV/items/d308927c395d6e687a6a (Source) https://scivision.co/anaconda-python-with-windows-subsystem-for-linux/

Summary and impressions

I have just started using the environment of Bash on Ubuntu on ..., but I have high expectations for this environment. I'm sorry about not supporting MKL, but before --Windows + library "that" + compiler "this" + tool "it" From such a state, it seems to be a big improvement.

Also, since XGBoost itself has just been upgraded to ver.0.6 (skipping ver.0.5), I would like to continue studying and deepen my understanding of XGBoost.

At this point, Docker for Windows has also been released, and the programming environment for windows (although it may have a different purpose than Bash on ..) has become more interesting. (Although it may be annoying.)

(Addition) Installation of R-package

{Devtools} is required as a prerequisite package. Furthermore, since the C library required for the R package {devtools} is not included in the initial state of Bash on Ubuntu, it was necessary to install about two packages with sudo apt-get install.

When {devtools} is entered, launch the R interpreter and

library(devtools)
install('xgboost/R-package')

It should have been OK. ('Xgboost / R-package' is a relative path, so you need to specify the path appropriately according to the current directory.)

The result of executing the above script is as follows.

> library(devtools);install('R-package')
Installing xgboost
 URL 'https://cran.rstudio.com/src/contrib/Matrix_1.2-7.1.tar.gz'I'm trying
Content type 'application/x-gzip' length 1805890 bytes (1.7 MB)
==================================================
downloaded 1.7 MB

sh: 1: /bin/gtar: not found
sh: 1: /bin/gtar: not found
 system(cmd, intern = TRUE)Error in:An error occurred while executing the instruction
Additional Information:Warning message:
 utils::untar(src, exdir = target, compressed = "gzip")so:
  ‘/bin/gtar -xf '/tmp/RtmplJiuv1/Matrix_1.2-7.1.tar.gz' -C '/tmp/RtmplJiuv1/devtools24847e356e71'’ returned error code 127

This is an error message that / bin / gtar does not exist when expanding Matrix_1.2-7.1.tar.gz. Bash on Ubuntu has / bin / tar, so I'd like you to use it, but it seems that the installation script is not made that way. It's okay to link to the route, but when I looked it up on the net, there was a countermeasure on stackoverflow.

Error in untar( ) while using R

Sys.setenv(TAR = '/bin/tar')

After setting the above on the R interpreter, I ran'install (R-package)' to complete the installation.

Reference web site

Recommended Posts

Run XGBoost on Bash on Ubuntu on Windows
Run Jupyter on Ubuntu on Windows
Run Tensorflow from Jupyter Notebook on Bash on Ubuntu on Windows
Build XGBoost on Windows
Notes for using TensorFlow on Bash on Ubuntu on Windows
Run Openpose on Python (Windows)
Operate ubuntu on VScode (windows10)
Run Jupyter Notebook on windows
Install Bash on Ubuntu on Windows, Ruby, Python, Jupyter, etc.
Remove ubuntu installed on Windows 10 machine
Install and run dropbox on Ubuntu 20.04
Install xgboost (python version) on Windows
Run Tensorflow natively supported on windows
Run SwitchBot on Windows 10 with Bleak
Run Yocto on Ubuntu using QEMU.
Run Radeon GPU on Windows on QEMU / KVM
Run servo with Python on ESP32 (Windows)
Try using Bash on Windows 10 2 (TensorFlow installation)
[Note] Procedures for installing Ubuntu on Windows 10
Run bootgen on Debian GNU / Linux, Ubuntu
Run py.test on Windows Anaconda and MinGW
Run matplotlib on a Windows Docker container
Python on Windows
Shebang on Ubuntu 20.04
Run PIFuHD in Windows + Anaconda + Git Bash environment
Install and run Python3.5 + NumPy + SciPy on Windows 10
Put MicroPython on Windows to run ESP32 on Python
Run yolov4 "for the time being" on windows
How to run MeCab on Ubuntu 18.04 LTS Python
Run Kali Linux on Windows with GUI (without VirtualBox)
Pylint on Windows Atom
Linux (WSL) on Windows
Install Apache 2.4 on Ubuntu 19.10 Eoan Ermine and run CGI
Run Django on PythonAnywhere
Install PySide2 on Ubuntu
Run mysqlclient on Lambda
Use pyvenv on Windows
Install JModelica on Ubuntu
Anaconda on Windows Terminal
Install Anaconda on Windows 10
Install python on windows
Install pycuda on Windows10
Create an OpenAI Gym environment with bash on Windows 10
Run OpenMVG on Mac
Build TensorFlow on Windows
Try FEniCS on Windows!
Install pygraphviz on Windows 10
Use Ansible on Windows
build Python on Ubuntu
Try Poerty on Windows
Install Python 3.3 on Ubuntu 12.04
Install Chainer 1.5.0 on Windows
Installing pyenv on ubuntu 16.04
Use QuTiP on Windows
Install Theano on Ubuntu 12.04
How to run Django on IIS on a Windows server
Use pip on Windows
Install angr on Ubuntu 18.04
Install pip / pip3 on Ubuntu
Use Xming to launch an Ubuntu GUI application on Windows.
Install Numpy on virtualenv on Windows