[PYTHON] By arranging the difference between "statistics" and "machine learning", I can see the reason why "machine learning" cannot be used in many business companies!

What is the difference between statistics and machine learning after all? Why can't I make money by predicting with machine learning from today?

Perhaps everyone wonders when they start studying machine learning. And why aren't many operating companies available today in their daily work? The question also arises. There are various documents, but I had a hard time understanding them, so I reorganized them in my own way. I put a lot of self-discussion by combining information. This article explains why many companies cannot use machine learning approaches to drive business starting today...

First of all, I tried to summarize the difference between statistics and machine learning thinking and orientation in a table

統計か機械学習かVer2.png

Although many have mentioned it and are related to each other, they have different end goals. ** "Machine learning" makes predictions and judgments, but the reason why it happened is generally a black box. Forecasts and judgments made in "statistics", it is important to justify why such a conclusion was reached (it is a discipline for making reasons), and the reason is a white box. ** **

Why are so many companies "not able to use a machine learning approach"?

** "Statistics", whose purpose is to focus on organizing factors, is suitable for solving social science problems, and "Machine learning" is suitable for prediction of natural science and automatic processing by robots. *, I came to think from the arrangement result, and the reality is that. If there is a theme such as identifying the factors that fluctuate in sales and thinking about measures to increase sales for the factors, this is the story of "statistics" (solving social science problems) that sorts out the degree of influence and "machine learning". It's not a story like this. (In addition, there is an impact analysis function provided as an AI function of BI, but this requires specifying the items that may be affecting, and the related calculation is "statistics". It's an area. The recommended product display on the EC site often only displays combinations that other people have bought, and it should be closer to a simple query than a machine learning prediction.) On the other hand, robot processing such as typhoon course prediction, earthquake prediction (though I think it is still impossible), image recognition and voice recognition are machine learning approaches. For example, the course / intensity prediction of a typhoon may be affected by temperature, seawater temperature, jet stream, etc. as long as the course / strength is correct, regardless of whether the model is statistically beautiful or not (disliked). Since temperature and seawater temperature are interlocked by multicollinearity and multicollinearity, it is preferable not to use both at the same time in terms of a statistical model.) If the results are correct, it is not relevant to the people and is not of interest. ** Many business companies are interested in "social science" in areas such as sales, marketing, finance, and human resources, and often not in "natural science" or "robots". ( Marketing automation considers automation, so this area may be called a robot. It is limited.) **

Furthermore, why are so many operating companies "not able to use statistics or machine learning"?

● ** There is no data. ** That's all. A general business company does not have a clean data set like the Kaggle competition. ** (1) There is no attribute information data that should be the main factor. ** ** For example, there is no past, latest, or future attribute information for customers who purchase goods or services. Attribute information is fluctuating, past attributes at the past, current attributes at the present. As an example of a credit card company, it is relatively easy to automate credit to get the latest customer information, but it is almost impossible to predict what will happen in 10 years. The situation of educational background, family structure, and annual income changes, but the latest attribute information is not always managed correctly, so as a result, who is using the credit card (who is the attribute) Credit card companies (even companies that have a large amount of personal information with many attributes) know little about it. I always have the latest information that does not change, such as gender and age, which does not change once I get it, but it is too limited information considering customer attributes. It is impossible to connect past, present, and future consumption trends in such a state.

** (2) Transaction data and master data are not linked. ** ** Master data (product number, etc.) changes and the past, present, and future are not connected.

** (3) Even in-house information cannot comprehensively convert (all) strategies and measures into data for statistics and machine learning. ** ** It is Atarimae's story that corporate strategies and measures affect purchasing and sales (for example, advertisements, campaigns, etc.), but it is almost impossible to transfer data for statistics and machine learning. Even if it is possible to perform individual analysis as to whether or not the measures have been successful with pinpointing, it is impossible for the entire company to talk about it. But business owners want it.

Conclusion

● People who are not interested in natural science or robot processing / processing automation, the company does not seem to have a great advantage in getting into "machine learning" (unless each one uses crushing in that direction) ● If there is no appropriate data to be used for analysis, both "statistics" and "machine learning" are meaningless and wasted effort when waving the flag. ● If the prerequisite data for analysis has not been prepared, it is necessary to enter from data acquisition / maintenance. ● The future of data scientists who do not understand the basics and essence is in jeopardy! Results without results ...

Recommended Posts

By arranging the difference between "statistics" and "machine learning", I can see the reason why "machine learning" cannot be used in many business companies!
I tried to organize the evaluation indexes used in machine learning (regression model)
I made a familiar function that can be used in statistics with Python
About the difference between "==" and "is" in python
I tried "Lobe" which can easily train the machine learning model published by Microsoft.
Simple statistics that can be used to analyze the effect of measures on EC sites and codes that can be used in jupyter notebook