[PYTHON] How the Information Systems Department (beginners) can start data science

Chapter 1: Introduction

Purpose of the article

In the world, big words such as "Digital Truss Formation (DX)", "Data Driven Management", and "AI Utilization" are flying around, and from the great people of the company who saw the slightly exciting press articles of competitors. There are many people in the Information Systems Department who are in trouble because they are told to work on it. As a consultant to such people, I sometimes talk about things like the title, so I have summarized the contents briefly. Of course, you can spend money from the beginning to place an order with a vendor or hire a data scientist, but the emotional person is somewhat angry with data science, tries to do it, and then outsources it. It is personally recommended to decide the direction such as.

Basically, I intend to write it plainly, but I hope that you can read the words and terms that you do not understand while checking each time.

Target case

The target of this article is organizations that do not do data science at all. (Except for organizations that have already created models by themselves or outsourced, or are using Auto ML tools such as DataRobot.) Various recommendations have been made, and BI tools have been introduced and visualization has been completed. Or, imagine an organization with a level of level that you bought and tried but couldn't do well. In addition, the target data is structured data. (It does not cover unstructured data such as documents and photographs.)

What kind of activity is meaningful?

Some people may have thought. From my point of view, every company has a fair amount of data, if not clean. Collecting data is essential for data science, but fortunately, the handling of system data in-house is a specialized field, and in many cases it can be done well. Once you have the data, you can often set up some valuable use cases and get results. It's also worth it to gain a better understanding of data science, even if the results aren't good.

Current location of data utilization by Japanese companies

What about other companies? Many people thought that. Most of my main customers are manufactured and distributed, but on the skin, more than 300 billion customers have started to work. I think that 100 to 300 billion is close to the company, and if it is less than that, it is often out of hand. As for the industry, the distribution system is more pervasive, and I have the impression that manufacturing is lagging behind. Also, as a whole, there is a big temperature difference between companies that are putting a lot of effort into it and companies that are not doing it at all. image.png It is important to try it first!

Chapter 2: Required Knowledge

First of all, it is necessary to acquire the necessary knowledge. The necessary knowledge is roughly divided into four categories: "data science overview and use cases," "domain knowledge," "IT knowledge," and "statistical knowledge."

Data science overview and use cases

First of all, it is necessary to grasp the overall feeling. Somehow it will be solved using AI! Then I will not talk. First of all, it is necessary to understand what can be done. There are many things that can be done, but considering "easy to understand (easy to attach)" and "usefulness", it is necessary to understand the outline of "classification" and "regression" and use cases. Let's catch up on this knowledge and use cases on the net etc. Also, analyzing data that changes over time can be postponed as it increases the difficulty.

Domen knowledge

It is a part of the so-called theme and business knowledge peculiar to the industry. I don't think it is necessary to study this point because it is our own story. (Of course, I think it will be necessary to conduct on-site hearings for deep analysis later.)

IT knowledge

It is roughly divided into "hard type" and "soft type". "Hard system" is the knowledge to prepare the environment. You can build a local environment on your own PC, set up a server on the cloud, or run it on a SaaS service. (I think that catch-up is not necessary for general people.) In "software", the basics are Python. (R may be used if you like.) Also, knowledge of SQL is required for data collection and processing. The execution method is basically to write the source code in Jupter Notebook, but recently, software that is visually implemented on the UI (Sagemaker Studio, Watson Stuio, etc.) can be used for free or at low cost, so I am allergic to the code. I think that one is fine. Specifically, the first half of "Python Practical Data Analysis 100 Knock" and "[Kaggle Start Book Starting with Python](https:: //www.amazon.co.jp/dp/4065190061/ref=cm_sw_em_r_mt_dp_U_PWniFb0KVRHC6) ”and so on, I think it's a good idea to add an image to the example. When dealing with a large amount of data, knowledge such as distributed processing is also required, but let's start with small level data of tens of thousands to hundreds of thousands.

Statistical knowledge

This is the highest hurdle, and if you enter from here, you will be frustrated. Auto ML is now commonplace, with "Amazon SageMaker Autopilot" and "[IBM Watson Studio Auto AI](https://www. There is a way to try data science without knowledge of statistics such as "ibm.com/jp-ja/cloud/watson-studio/autoai)", so this time we will assume that it will be used. Since there is a free frame, let's move it first by referring to Qiita's article etc. During the implementation of the above books and AutoML, I think it is good to increase your knowledge by searching for words you do not understand.

Summary

First of all, before considering your company's case, let's study until you feel hungry with the above knowledge. I think it's okay if it's about 30 hours.

image.png

It will also lead to career advancement, so study hard!

Practice

You should have some knowledge, so next is practice. The general work steps for data science are:

image.png

Use case definition

First, let's take a quick look at what's in your company's data. Then, recall the challenges you've heard while searching for use cases in the world and talking to your business departments. In that, we make a use case hypothesis that this may be possible. I'm sorry if I get stuck here, but there must be something. Please do your best and think about it.

Data preparation, cleansing

Data preparation is basically an image of collecting data from various places to create a single table. As you will understand if you proceed with the study of Chapter 2, prepare what you want to predict (objective variable) and what is likely to be involved in the decision (explanatory variable). As an example, let's say you want to cost-effectively send an e-commerce site promotion email. The base data at that time is the email transmission history. The objective variable is "a flag indicating whether or not the purchase was made on the EC site within one month and the number of purchases". I think that the explanatory variables include the age, gender, and past cumulative purchase amount of the person to be sent, so imagine them and collect what you can collect.

I don't think the hurdles for this work are that high as long as you can set up use cases.

Modeling

This is the highest hurdle. Normally, specialized knowledge such as feature design, model selection / ensemble, and hyperparameter tuning is required. However, this time we will assume Auto ML, so all you have to do is throw in the data and wait. !! Since it is basically a cloud service, let's sanitize personal information and confidential information before throwing it in! Taking the above example, here we will create a model that predicts whether or not to buy from the attributes of the person to be sent.

Evaluation

This requires some knowledge of statistics. Understand the results Auto ML returns by searching for jargon. If the result is actually good, try to predict the future using actual data. If you feel good, share it with your boss or business department. Few people say native things, and I think this is the first step in rooting data science.

image.png

Summary

The content has become a little abstract, but I think I have somehow got an image of what to do. From now on, IT will be the fate of the company, and the position of information is changing. On the other hand, we are seeing many organizations that cannot escape from the system maintenance unit and have a GAP with the flow of society. (It's also a sense of challenge as a consultant.) I hope there is someone who can take action by reading this article. If you have any questions, we will also answer your questions.

Finally

I started posting Qiita to organize my own machine learning code knowledge, but this time I tried to summarize what I am talking about with DX consulting. If it has a good reputation, I will continue to write it, so if you find it useful, please use LGTM or follow me. Next, I will dig a little deeper into the use case study. My summary article is ↓ Summary of knowledge required for implementing machine learning in Python

Recommended Posts

How the Information Systems Department (beginners) can start data science
Start data science on the cloud
How to implement 100 data science knocks for data science beginners (for windows10 Home)
How to start the program