[PYTHON] A struggle after a machine learning amateur who changed jobs to a non-IT company was assigned to an AI project

Introduction

This article is the 8th day of How did you learn machine learning by Nikkei xTECH Business AI ② Advent Calendar 2019. I hope it will be helpful for beginners who are going to market in the future.

Programming experience

I studied C / C ++ and Matlab in college classes, and Python at Sololearn. For JavaScript and HTML, I went through the grammar with the standard Progate, and then read the language reference.

How I started learning machine learning

When I was asked about my programming experience during an in-service interview, I lightened up Python. "Honna, eh aiyana! (Original)" have become. With this reaction, I joined the company with an unpleasant feeling in my heart, and after that I was assigned to a new project because of business development utilizing AI. It's danced to the lie-like equation Python = AI. Even though I had written Python, I was an amateur in machine learning, so I started studying with this opportunity.

[Edited on December 09, 2019] I prepared the material on an ambiguous memory base, as if I saw an article by Mr. Mas * door * rise, but it was completely wrong with my memory. I'm sorry.

Learning method

I will list them in the order in which I started.

I think I'm not good at reading books and gaining knowledge (especially when learning new things), so I was thinking about attending seminars and study sessions first. Also, since the assigned project was a regression prediction of time-series data, I narrowed it down to time-series data rather than image processing.

Seminars I attended

*** I don't sell to Nikkei at all ***, but at this time the concept of AI = image processing is overwhelming the world, and seminars centered on time-series data are really valuable. did. Really, these are all MNIST and Semantic Segmentation ... (I think it's a good theme for beginners, just because the learning was completely task-oriented, so the situation is different.)

As for the content of the seminar, I was able to learn from the basics of machine learning and deep learning (that is, regression and classification are different things). I learned here that CNN is the basic form of the image system and RNN is the basic form of the regression system, and new forms such as the generating system are emerging. I think it will be a good experience to know the contents of the process by actually moving your hands and calculating. ~~ I was surprised that many of the participants couldn't calculate the matrix ~~

The story about time series, which was the original purpose, was mainly about machine tool anomaly detection. It didn't exactly match my content, but the teacher took some time after the lecture and gave me a lot of valuable advice.

Another thing I felt is that you don't ask too many questions. I don't know if I knew everything I already knew, or if I didn't understand it too much, but I was able to spend most of my time asking questions, which made me feel good. Since you pay a lot of money to participate, the one who gains knowledge wins.

Study session that I participated in

Participated here because I have no knowledge about the handling of time series data (trends, seasons, etc.) in the first place. Connpass and TECH PLAY are used to search for study sessions. In addition to theoretical learning, I went to listen to the story at Edge Deep Learning Summit 2019 sponsored by LeapMind. I also used it.

I haven't participated yet, but there are many interesting groups such as the Data Preprocessing Study Group and the Mokumoku Group, so I would like to increase the number of participants from now on.

Books I bought / want

Kaggle Kernel Search

I knew there was a competition, but when I was studying, I wasn't thinking about trying it myself, so I focused on ** reading the Kernel **. As a point to read, I picked up things related to time series and regression prediction as the theme of the competition, and I was looking at how to knead the data of other people and what kind of process is being taken when facing a task. It was. It's more primitive than an algorithm. I was trying to find out what steps to take when facing data (where to look? What to do?). What I felt here is that although I attended study sessions and read books, such as ** interpreting statistics on variables I have **, I still don't know much about it. ** I felt that there was no foundation before saying machine learning or deep learning **. As I will mention in the competition below, I still feel it.

Challenge the competition with SIGNATE

The competition itself was challenged with domestic SIGNATE-san instead of Kaggle. This is quite recent, and is Takeda's AI Drug Discovery: Pharmacokinetic Parameter Prediction. While there are many image classification and image generation competitions, drug discovery competitions and land competitions were valuable regression problems.

The first submit of the memorable first competition in my life was a start that I could not measure due to a format error of the CSV file and my nose was crushed ... w After that, I corrected only the format and resubmitted, and the score did not sound and did not fly I did. In the end, the ranking is settled as shown in the figure below.

スクリーンショット 2019-12-04 22.12.45.png

The result is not very good, but I think the result was great. What I strongly felt in this competition was that there were few data pre-processing drawers. I implemented the essential ones such as one-hot encoding of categorical variables, but "There is something between these variables!" Or "Isn't it possible to improve the score by converting to such features?" , It was particularly weak in terms of feature generation. The number of posts is 4, but I'm stuck with ** "What should I do from now on?" **. As mentioned in the book section I want to buy, I think that Kaggle's winning data analysis technology is completely necessary for me.

Also, as with programming itself, I think it's best to study with clear guidelines **. If you're like me who doesn't have an assigned project as a premise and you're about to start studying, I think the ** competition is a very meaningful environment ** with guidelines (tasks). .. If you try to focus on it, you will inevitably start collecting knowledge about the task, so it is recommended as it will be a starting point for expanding your knowledge.

Present and future

This is a regression problem project assigned after joining the company, but the variables were made non-dimensional and anonymized, and the time stamps were reassigned to Article. I am. Please read it if you like. I have calculated with two types of models, a two-layer LSTM + fully coupled model and XGBoost, but the problem that the output of the model is after the input [(LSTM for time series prediction --keras issues)](https: / /github.com/keras-team/keras/issues/2856) I'm having a hard time ... Those who have knowledge are very happy to receive advice.

Also, as mentioned above, the know-how of data preprocessing and feature generation is very important, and it is not enough for me at present. Therefore, it is very helpful for the famous Kaggler people to publish blogs and books, and to be active in Kernel. It provides what you have experienced so far.

In addition, the content I have studied is specialized in regression because my work was a regression problem. Therefore, it can be said that there are almost no image-related methods or preprocessing. Wouldn't it be possible to work with CNN? I feel like. I don't currently have a project for image-related (anomaly detection and classification), but I think I can use it for my future work and I would like to learn more. We will also challenge the image theme competition that has been through until now.

First of all, the Kaggle book.

To those who are about to start studying

** Since it is Qiita, I will talk about recommended programming environments other than Poem **

Our Python environment is configured on-premise with ~~ Anaconda ~~ vanila python. After destroying the environment I became a vanilla sect. Last year's Advent Calendar article has been updated, but it's better for someone who is careful about management, Anaconda.

Jupyter Notebook/Lab ** Jupyter Notebook is convenient ** I can't go out for the rest of my life. I really recommend running programs on Jupyter Notebook for anyone starting to study machine learning (or from Python). As I did, if you're new to Python, you don't know if the processing you just implemented has done the desired transformation (especially ʻAxis` around Numpy). In that respect, Jupyter has separate cells, so PDCA for pre-processing implementation and data formatting can be turned faster (REPL can also be done, but Notebook has a lower hurdle; a completely qualitative story). In addition, the expanded ** Jupyter Lab ** has been announced, and it seems that it will move to this development in the future (Reference). Please use this as well.

Install Jupyter Notebook

py -m pip install jupyter

Install Jupyter Lab

py -m pip install jupyterlab

Google Colaboratory

Google Colaboratory is recommended for people who cannot prepare computing resources in the first place. This is like a Jupyter Notebook environment running on Google's server. As explained in "Google Colaboratory Overview and Usage Procedure (TensorFlow and GPU can be used)", anyone with a Google account and an internet connection is free. Can be used with. Recently, it has become possible to select TPU in addition to GPU. Since the libraries required for machine learning and deep learning have already been installed, [Link with Google Drive](https://qiita.com/shoji9x9/items/0ff0f6f603df18d631ab#google-drive%E3%82%92%E3% 83% 9E% E3% 82% A6% E3% 83% B3% E3% 83% 88% E3% 81% 99% E3% 82% 8B% E6% 96% B9% E6% B3% 95) You can analyze and predict immediately from the data. [Time limit](https://qiita.com/shoji9x9/items/0ff0f6f603df18d631ab#90%E5%88%86%E3%83%AB%E3%83%BC%E3%83%AB%E3%81%A812 % E6% 99% 82% E9% 96% 93% E3% 83% AB% E3% 83% BC% E3% 83% AB) You can get a comfortable calculation environment by just paying attention.

At the end

Thank you to everyone who has read this far. I'm sorry that it is almost a long sentence.

It is said that AI is entering a period of disillusionment in the streets, but I think that the needs of engineers are increasing. Also, we human beings ** lose market value steadily with aging **, so I think it is good to increase what we can do in that sense as well.

We hope that this article will serve as a reference for beginners who are about to enter the market. (Second time at the beginning)

References / links

Recommended Posts

A struggle after a machine learning amateur who changed jobs to a non-IT company was assigned to an AI project
I changed my job to a machine learning engineer at AtCoder Jobs
An introduction to machine learning from a simple perceptron
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
An introduction to machine learning
How to use machine learning for work? 02_Overview of AI development project