[LINUX] Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~

Introduction

The title is "[Before Introduction to C Programming](https://www.amazon.co.jp/C%E3%83%97%E3%83%AD%E3%82%B0%E3%83%A9%E3" % 83% 9F% E3% 83% B3% E3% 82% B0% E5% 85% A5% E9% 96% 80% E4% BB% A5% E5% 89% 8D-% E6% 9D% 91% E5% B1% B1-% E5% 85% AC% E4% BF% 9D / dp / 4839920648 / ref = sr_1_2? adgrpid = 51730019485 & gclid = EAIaIQobChMI0Puty8bE5gIVQ7aWCh2ACQjIEAAYASAAEgIQqPD_BwE & hvadid = 338517772944 & hvdev = c & hvlocphy = 1028852 & hvnetw = g & hvpos = 1t1 & hvqmt = e & hvrand = 1357865022882897559 & hvtargid = kwd-333217628374 & hydadcr = 27264_11561112 & jp -ad-ap = 0 & keywords =% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0% E5% 85% A5% E9% 96% 80% E4% BB% A5% E5% 89% 8D & qid = 1576856073 & sr = 8-2) Written by: Yukio Murayama. In other words, I don't learn machine learning, but there are many skills required for machine learning. I will talk.

First of all, I would like to introduce myself.

Career

Both the undergraduate and graduate schools were in the artificial intelligence laboratory. At first, I was doing a lot of research based on Boltzmann machines, etc. For some reason, when I was a graduate student, I was hired by a company as a research part-time job. I graduated from the master's thesis after writing the research results there.

The era before entering machine learning

I was a person who could write a program relatively, ** I can't search for unknown words in the first place **, so I want to learn technology all the time ...! I couldn't search even though I thought I spent days of suffering.

After entering the laboratory, I was able to know the existence of Qiita and know Python. I was able to learn machine learning.

All thanks to Qiita since I couldn't even search because I didn't understand the words I gradually learned the word, and as a result, I was able to learn machine learning.

Basically, this is the end of "machine learning, how you learned". The rest is like a bonus, but I hope you can read it. (Like gum with toys)

Before the introduction to machine learning.

This time, with thanks to Qiita

Although it does not appear in the text of machine learning, it is an indispensable tool for machine learning

I will give a brief explanation by listing terms such as. Please go out with me.

Linux edition

Machine learning and Linux are inseparable from each other. It's not enough on Mac, and it's troublesome to put Python on Windows.

Therefore, machine learning is half-forced to use Linux. We will explain how to do this, useful commands, and necessary knowledge.

Working with Linux on windows

Windows 10 recently has a feature called * Windows Subsystem for Linux *. By using this, you can use a pseudo Linux environment on windows.

You can find out how to install it by google. Official Microsoft tool.

SSH

Abbreviation for Secured Shell, which is SSH. Think of it as a function for logging in to a remote server. In other words, it allows you to access remote servers.

What makes me happy when using SSH is while using the Mac interface. The calculation itself can be left to Linux. Also, it doesn't matter if your interface is Mac or Windows. You can use either.

Security precautions

You may want to use SSH on your home server or laboratory server while opening it to the outside. In that case, if you observe the following points in the settings on sshd.conf, you will basically have it.

--Set PermitRootLogin to no. --Set PasswordAuthentication to no. -(Even if no, you can log in with password to access directly in front of the main unit) --Set for public key authentication.

If you use SSH on public key authentication, you can securely enter the remote without a password. Let's use public key authentication. If you expose it to the internet, you have no other choice.

The mechanism etc. will not be explained in detail here. Simply put --You can create a public / private key pair by doing ssh-keygen. --Set the public key to authorized_keys and set various sshd_config --Set to ~ / .ssh / config on the accessing side like ʻIdentity File ~ / .ssh / id_rsa`

Then you can log in safely. Please gg for details.

tmux

tmux

When doing calculations while connecting remotely with SSH, If the network is disconnected during a long calculation time, ** the calculation result will return to nothing. ** **

You want to keep the state even if SSH is cut off, right? That's actually possible with tmux.

tmux has the concept of a session. It allows the pseudo terminal to remain in the process forever even if the SSH expires.

You probably don't need to install tmux as it's probably included in ubuntu 18.04 LTS.

Launch a new session

How to start a tmux session

python


tmux new -s session_name

is. Feel free to name session_name.

tmux operates anything by first pressing the basic prefix key. The prefix key is ctrl + b by default, but if you set it to ctrl + a I recommend it because it makes a lot of progress.

Log out session

If you want to lose the session itself, use logout.

Return to the original sh while holding the session

This is an operation called detach. Press prefix, d in that order

Return to session after disconnecting from SSH

Type tmux a in the terminal

Split the screen.

tmux can also split the screen. Entering prefix,% will split the screen vertically. If you enter prefix," , the screen will crack horizontally.

Show clock

In fact, you can also display the clock. You can do it with tmux clock-mode.

Whole cheat sheet

You can also get it with prefix,?, https://qiita.com/nmrmsys/items/03f97f5eabec18a3a18b I hope you can refer to this article.

~/.tmux.conf

You can also set various settings for tmux. There are various settings, but I referred to the following two articles.

Learn from the master.Basic settings of tmux.conf Show if Prefix key is pressed in tmux

The settings I always use are as follows.

#prefix key C-Change to a<img width="727" alt="Screenshot 2019-12-21 1.33.27.png " src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/182970/b5e6f309-53c3-0174-2b76-682a65156b75.png ">

set -g prefix C-a

# C-Unkey b binding
unbind C-b

#Reload the config file
bind r source-file ~/.tmux.conf \; display "Reloaded!"

# C-a*C to the program in tmux in 2-send a
bind C-a send-prefix

# |Split the pane vertically with
bind | split-window -h

# -Split the pane horizontally with
bind - split-window -v


#Use a 256-color terminal
set -g default-terminal "screen-256color"

#Allows you to see if the prefix key is pressed
set-option -g status-left '#[fg=cyan,bg=#303030]#{?client_prefix,#[reverse],} #H[#S] #[default]'

Basically, this is enough.

htop

htop is a tool that allows you to see resources.

htop

You can see how much CPU resources are actually being loaded with this.

nvtop

nvtop is the GPU version of htop.

nvtop

Is it like there is htop and there is nvtop? You can also see if you are using a GPU.

In ubuntu, in the case of 19.04, you can put it with ʻapt`, but Basically you need to build the source.

vi/vim

There is a high possibility that you will mess with files on Linux, such as on SSH. Vi and Vim are used in such a case. The difference between vi and vim is vi + various functions = vim. It's a lot of trouble with vi alone.

You can open it with the vi and vim commands.

Basically, it's okay if you remember the following.

normal mode

Basically, moving the cursor, undoing, and searching are also done here. Here is a list of features that are useful to know.

:q Finish
:q!forced termination
:w Overwrite
:100 Move to line 100
/word word search(+n to move to next matching word)
u     Undo(ctrl in windows+z-like behavior)
dd Delete current line(ctrl in windows+x-like behavior)
yy Copy current line(ctrl+c-like behavior)
p paste(ctrl+v-like behavior)     
Corresponds to hjkl ← ↓ ↑ →(If it is a mac, if you input Japanese and zh, you will get ←)

insert mode

Press the ʻi key or ʻO to enter insert mode. In insert mode, you can enter characters. Press ʻESC` to return to normal mode. (The reason why the ESC key was restored on mac is due to Vimmer ...?)

For details, there are as many operation methods as you can if you google. Please check it out

Other commands

find

It literally looks for the file.

find [start_dir]

In terms of usage

find ~/ |grep File you want to identify

You can search for the location of the file.

tree

Shows the file in Tree format. I always use this when I want to understand the whole structure. The problem is that a lot of logs flow (

wc

Check the number of file lines Useful when you want to know how many lines are in a tsv file Or let me find

df/du

It will measure the file size. df is the size of the entire file size, du will show you the size of each individual file size.

df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G  1.5M  3.2G   1% /run
/dev/sdb3       916G   33G  837G   4% /
tmpfs            16G   88K   16G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup

The -h option will display the unit for the capacity. In the File System section is dev (device) and its specific name. / dev / sdb3 is the concrete hardware such as SSD. You can basically name it sd [x] [n]. Please gg for details.

On the other hand, if you want to see the individual file size, the du command is effective. For example, if you want to see the capacity list in the current folder

du -hs ~/*

Then, it will see all the file sizes that are currently individually loaded. It will show you which one is heavier.

grep

Use this when you want to extract only the relevant notation from a large number of logs.

find ~/ | grep filename

I will send the log by pipe processing, Only the part corresponding to filename can be extracted It also supports regex described later.

cat

You can output the file directly. Combine pipes

cat /var/log/auth.log | grep sudo

You can do a file search.

less/head/tail

When a fucking stupid big tsv called 120GB was sent It takes a long time to die if you do vim logfile.tsv. (Let's send tsv at that level by parquet in the first place!)

In such a case, read only a part of the less command, It will be displayed on the screen. head displays the first few lines. tail will display the last few lines.

jq

It makes the json file look nice. For more information, please refer to Introduction to daily use of jq command.

sed

It will replace the string. s / a / b / g → Convert a to b Even in the engineer area, / of s / is often skipped by / g or slack. It is a common language.

python edition

I know Python, but how about installing it? Is a person I know a lot, but I'm wondering how to manage the version on Linux ... Recommended for people like.

version management

pyenv

It will install the Python version for each user. Click here for details: [Permanent preservation version] Put pyenv + venv in ubuntu [Don't hesitate anymore]

After installing pyenv, Enter the desired python version with pyenv install python-version. Basically, you should bring the simplest one such as anaconda. Like pyenv install 3.6.9 That way, it will put python in your personal folder and won't pollute other users' environments.

venv

venv is a python package management tool. Select the basic python with pyenv and It is good to create an environment with venv and use pip. Click here for details: [Permanent preservation version] Put pyenv + venv in ubuntu [Don't hesitate anymore]

IDE edition

vscode

vscode has an ssh function that automatically reads public key settings It feels good to mess with files on the remote server while SSHing with the private key. On the other hand, trying to read the package per venv in a mystery, The impression is that it is not very suitable for coding. (If you can get code candidates while coding with venv + vscode, I'd like information!)

jupyter_notebook

Jupyter_notebook is an IDE that starts on a web browser. Basically start it on a remote server, Writing a notebook is convenient when you want the remote mackerel to do only the calculation.

Google Colaboratory

The environment is an IDE that Google prepares for you. The feature is that you do not have to do anything such as building an environment. You can code Python using Google resources. Thankfully, it also uses GPU and TPU resources. The basics are the same as Jupyter notebook, but the resources are only managed by Google.

Convenient library

tqdm

It will bring up a progress management bar. How far is deep learning and other heavy processing going on? The good point is that you can grasp it immediately. Data science processes basic fucking big files, so If there is a progress bar, it will take about a few minutes (sometimes 30 hours), so play the game with the switch during that time. You can do anything. Essential for machine learning.

Please refer to the official for how to use. It also has functions such as multiprocess and numerical monitoring.

pandas

It is a tool that processes tsv and parquet files like a table. It will be almost indispensable for data science. I don't like how to use it when I use kaggle.

matplotlib

You can display the graph. グラフ Basically, if you do Kaggle, you will encounter it even if you don't like it. Other means include seaborn, plotly.

Pickle

Save any python object.

You want to save the state on the way. Want to save a model made with Keras, or save a model of XGBoost made over a long period of time In such a case, pickle saves the whole thing. The pickle saved as a whole also saves all its functions, so It works even if you want to unzip it and use it for prediction immediately.

Other good concepts to remember

regex

It's a regular expression. If you want to search for and get a phone number in a large number of sentences, use this to Yoshi In the case of a phone number

\d{3,4}[-]?\d{3,4}[-]?\d{4}

You can get it. (I don't know what it is, but it's a mysterious document)

Docker

It's not a so-called virtual machine, but it separates middleware such as MySQL. Docker can be made independent so that it does not pollute the environment. If you don't have knowledge of Docker, you can install MySQL on the main unit ... Troublesome things like ** Ah failed ** will occur.

Docker divides services such as MySQL and nginx and pushes them into units called containers. You can throw away as many containers as you like and produce as many as you want. Building the environment is very easy.

For details, if you search with Docker, a very large amount of information will come out, so I think it would be good to refer to that.

Local Forwarding / Port Forwarding

Only the calculation is done by letting a huge desktop PC with a strong GPU calculate Only coding and instructions are from your Mac. If you want to use it conveniently like We recommend port forwarding + jupyter Notebook.

I wrote an article about port forwarding before, so I will introduce it there. Summary of access method to Jupyter Notebook (Lab) on remote server that any data scientist can pass

Reverse lookup dictionary by case

case1. I want to use a MacBook, but I also want a GPU. what should I do?

answer

The basic operation is a MacBook, and let's prepare one as a strong PC.

Any strong PC is fine. It's okay to buy a GPU machine for gaming, In most cases, using GCP or AWS is more profitable or easier.

Anything is fine, but the basic thing you need to do is ** Somehow as sshd setting on GPU server If you do port forwarding and display jupyter on localhost, k **

case2. I want to put a strong ubuntu PC at home and access it with SSH because it is only necessary inside the house.

answer

Insuko ubuntu on a strong PC and fix only the strong PC with DHCP set in the router

I usually recognize it by mac address, so ** Be sure to specify 172.168.1.22 for this mac address by DHCP ** If you set After that, set SSH, send the public key, and SSH to 172.168.1.22.

case3. How to SSH access to my home server from outside via the Internet?

answer

It's better to do it after getting some knowledge from it. There are many ways to do it

First, get security knowledge by googled around sshd_config, then On the router side, you can set which inner port should be sent to which outer port. It depends on the environment and the provider, but since the basic outside IP changes, If you use a technology called ** DDNS **, you can access it from the outside by accessing a fixed domain.

case4. I don't have the money to prepare a strong PC, but can I do machine learning?

Use Colabratory! ... there's also a GPU!

Basic Colab is good to use. It's free However, if you use it too much, it will be cut off or it will be very slow. At that time, let's use GCP. It doesn't cost much, so it's okay ~~~~~

case5. I want to build a GPU mackerel shared by everyone, what should I do to build an environment?

Let's use pyenv + venv! That's all right.

Since pyenv can be installed without the need for sudo rights, permission management is easy. I think it's apt version dependent, so the basics are okay (Since the build runs at the time of installation, it is impossible to do it without it.) It is convenient because it is easy to manage sudo rights.

case6. I want to manage a notebook such as github, what should I do?

If you use the colab function, you can do commit + push, and the difference is easy to see.

This article Did you know that Colabratory can be pushed to GitHub and see the difference just on the screen? You can do it with.

case7. Is there any good way to manage the model version or troublesome research?

It's on GCP and AWS.

AI platform for GCP, SageMaker for AWS!

case8. How can I collect learning data?

Scraping, KaggleDatasets, papers, GCP / AWS annotator utilization, etc.

This depends, but In the case of research, there is an original paper that says that it is like competing for grades with one data set, so you can refer to it. In the case of independent research, you can find out how to do it by searching for ** crawling / scraping **. If you really want to create a new dataset, Cloud also has an annotation feature. Why don't you use it?

in conclusion

What should I do with this? I will answer questions like this in Case here as much as possible. If you have any questions, please do not hesitate to ask. Thank you for staying with us for a long time.

Recommended Posts

Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
An introduction to OpenCV for machine learning
An introduction to Python for machine learning
Introduction to machine learning
An introduction to machine learning for bot developers
[For beginners] Introduction to vectorization in machine learning
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
Super introduction to machine learning
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
Introduction to machine learning Note writing
Introduction to Machine Learning Library SHOGUN
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
How to use machine learning for work? 01_ Understand the purpose of machine learning
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
Introduction to Machine Learning: How Models Work
Introduction to ClearML-Easy to manage machine learning experiments-
[Introduction to machine learning] Until you run the sample code with chainer
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 2 [Model generation by machine learning]
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Upgrade the Azure Machine Learning SDK for Python
[Super Introduction to Machine Learning] Learn Pytorch tutorials
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
How to Introduce IPython (Python2) to Mac OS X-Preparation for Introduction to Machine Learning Theory-
GTUG Girls + PyLadiesTokyo Meetup I went to machine learning for the first time
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
A quick introduction to the neural machine translation library
The first step of machine learning ~ For those who want to implement with python ~
Introduction to machine learning ~ Let's show the table of K-nearest neighbor method ~ (+ error handling)
Preparing to start "Python machine learning programming" (for macOS)
An introduction to machine learning from a simple perceptron
I tried to compress the image using machine learning
Everything for beginners to be able to do machine learning
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 1 [Environment construction]
Try to evaluate the performance of machine learning / regression model
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Made icrawler easier to use for machine learning data collection
Try to evaluate the performance of machine learning / classification model
For those who want to start machine learning with TensorFlow2
How to use machine learning for work? 03_Python coding procedure
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Data set for machine learning
Japanese preprocessing for machine learning
Introduction to Deep Learning ~ Learning Rules ~
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Introduction to Python For, While
Introduction to Deep Learning ~ Backpropagation ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 3 [Character recognition using a model]
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Newton's method for machine learning (from one variable to multiple variables)
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
[Introduction to Python] How to use the in operator in a for statement?
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
Machine learning model management to avoid quarreling with the business side
People memorize learned knowledge in the brain, how to memorize learned knowledge in machine learning
"Vim other than Vim" for Vim light users