"Apache Flink" new machine learning interface and Flink-Python module

In this blog, we'll take a closer look at ** Apache ** Flink 1.9.0, including the new machine learning interface and the Flink-** Python ** module.

Most popular programming language

The image below is the RedMonk programming language ranking.

image.png

The top 10 rankings in the image above are based on popularity on GitHub and Stack Overflow.

Python is in third place. Other popular programming languages, R and Go, are ranked 15th and 16th, respectively. This prestigious ranking is a good testament to Python's vast fan base. Including Python support in your project is an effective way to grow your project's audience.

Internet industry trend areas

Big data computing is one of the hottest areas in the Internet industry today. The era of stand-alone computing is over. Standalone processing power lags far behind data growth. Let's see why big data computing is one of the most important areas in the Internet industry.

Increasing data in the big data era

Due to the rapid development of IT such as cloud computing, IoT, and artificial intelligence, the amount of data is increasing significantly. The image below shows that the total amount of data in the world is projected to increase from 16.1ZB to 163ZB in just 10 years, a significant increase that standalone servers can no longer meet the data storage and processing requirements. I am.

image.png

Let ZB be the data in the previous figure. Here, I would like to briefly touch on the statistical units of data in ascending order. Bit, Byte, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, DB. These units are converted as follows:

It is natural to question the size of global data and delve into the causes. In fact, I was skeptical when I saw the statistics. When I gathered information and looked it up, I found that global data was certainly increasing rapidly. For example, Facebook posts millions of photos every day, and the New York Stock Exchange creates TB of transaction data every day. At last year's Double 11 Promo Event, the transaction value reached a record scale of 213.5 billion yuan, but behind this success, Alibaba's internal monitoring log alone has a data processing capacity of 162 GB / s. Internet companies like Alibaba are also contributing to the rapid growth of data, as the double 11 transaction value over the last decade shows.

image.png

The value of data from data analysis

Undoubtedly to explore the value of big data, statistical analysis of big data can help you make informed decisions. For example, a recommendation system can analyze a buyer's long-term buying habits and purchase history to find out what the buyer likes and provide better recommendations. As mentioned earlier, a standalone server cannot handle such a large amount of data. So how can you statistically analyze all your data in a limited amount of time? In this regard, we need to thank Google for providing these three useful papers.

-** GFS : In 2003, Google published a paper on Google File System, a scalable distributed file system for large-scale distributed data-intensive applications. - MapReduce **: In 2004, Google published a MapReduce paper on distributed computing for big data. The main idea of MapReduce is to divide a task and process the divided tasks at the same time on multiple computing nodes that do not have very high data processing capacity independently. MapReduce is a programming model for processing and generating big data sets with a parallel distributed algorithm on a cluster.

image.png

-** BigTable **: In 2006, Google published a paper on BigTable. Thanks to these three Google papers, the open source Apache community has rapidly built three Hadoop ecosystems: HDFS, MapReduce (programming model), and HBase (NoSQL database). The Hadoop ecosystem has gained the attention of academia and industry around the world, quickly gained popularity, and has become widely applied around the world. In 2008, Alibaba launched the Hadoop-based YARN project, making Hadoop the core technology system for distributed computing in Alibaba. The project had a cluster of 1000 machines running in 2010. The photo below shows the development status of Alibaba's Hadoop cluster.

image.png

However, to develop MapReduce applications using Hadoop, developers need to be familiar with the Java language and have a good understanding of how MapReduce works. This raises the bar for MapReduce development. To facilitate the development of MapReduce, several open source frameworks have been created in the open source community, including the leading Hive project. HSQL allows you to define and write MapReduce calculations in a SQL-like way. For example, Word Count operations, which used to require dozens or hundreds of lines of code, can now be implemented with a single SQL statement, significantly lowering the threshold for using MapReduce for development. As the Hadoop ecosystem matures, Hadoop-based distributed computing for big data will become widespread throughout the industry.

Maximum value and timeliness of data

Each data entry contains specific information. The timeliness of information is measured by the time interval and information efficiency from the time the information is sent from the source to the time it is received, processed, transferred and used. The shorter the time interval, the more timely the information is, and in general, the more timely it is, the more valuable it is. For example, in a preference recommendation scenario, if a buyer recommends a bargain on an oven a few seconds after purchasing a steamer, the buyer is likely to buy the oven as well, and an analysis of the steamer's buying behavior. Based on, if you see an oven recommendation after a day, the buyer is unlikely to buy an oven. This reveals one of the disadvantages of Hadoop's batch calculation, which is its low timeliness. Several leading real-time computing platforms have been developed to meet the requirements of the big data era. In 2009, Spark was born at the AMP Lab at the University of California, Berkeley. In 2010, Nathan proposed BackType, Storm's core concept, and in 2010 launched Flink as a research project in Berlin, Germany.

AlphaGo and AI

In a Go game in 2016, Google's AlphaGo defeated Lee Sedol (4: 1), a ninth-tier Go player and winner of the World Go Championship. This has led more people to look at deep learning from a new perspective, provoking the AI epidemic. According to the definition given in the Baidu Baike, artificial intelligence (AI) is a new branch of computer science that researches and develops theories, methods, technologies, applications, and systems that stimulate, extend, and extend human intelligence. ..

Machine learning is a technique and tool for exploring artificial intelligence. Machine learning is a high priority for big data platforms such as Spark and Flink, and Spark has made huge investments in machine learning in recent years. PySpark integrates many great ML class libraries (eg Pandas, for example) and offers much more comprehensive support than Flink. As a result, Flink 1.9 allows the development of new ML interfaces and flink-python modules to make up for its shortcomings.

What is the relationship between machine learning and Python? Also, let's take a look at the stats to see what is the most popular language for machine learning.

IBM's data scientist Jean-Francois Puget once made an interesting analysis. He gathered information about changing job listings from well-known job hunting sites and sought out the most popular programming languages of the time. By searching for machine learning, he came to a similar conclusion.

image.png

At that time, it turned out that Python was the most popular programming language for machine learning. This study, which was done in 2016, is sufficient to prove that Python plays an important role in machine learning, and the RedMonk stats mentioned above can further show that. I can do it.

Not only research, but also the characteristics of Python and the existing Python ecosystem reveal why Python is the best language for machine learning.

Python is an interpreted, object-oriented programming language created in 1989 by Dutch programmer Guido van Rossum and first released in 1991. Interpretered languages are very slow, but Python's design philosophy is "the only way". When developing a new Python syntax and having many choices, Python developers usually choose a clear syntax with little or no ambiguity. Due to its simplicity, Python has many users. In addition, many machine learning class libraries have been developed in Python, such as NumPy, SciPy, and Pandas (for working with structured data). Not surprisingly, Python has become the most popular programming language for machine learning, as Python's rich ecosystem provides great convenience for machine learning.

Overview

In this article, I've tried to understand why Apache Flink has added support for the Python API. Looking at specific statistics, we can see that we are entering the era of big data. Big data analysis is needed to explore the value of data. Data timeliness gave birth to the well-known stream computing platform Apache Flink.

In the age of big data computing, AI is a hot development trend, and machine learning is one of the key aspects of AI. Due to the characteristics of the Python language and the benefits of the ecosystem, Python is the language of choice for machine learning. This is one of the key reasons Apache Flink plans to support the Python API. Apache Flink's support for the Python API is an unavoidable trend to meet the requirements of the big data era.

Recommended Posts

"Apache Flink" new machine learning interface and Flink-Python module
Apache Flink Challenges and Opportunities
Machine learning and mathematical optimization
Significance of machine learning and mini-batch learning
Classification and regression in machine learning
Organize machine learning and deep learning platforms
[Machine learning] OOB (Out-Of-Bag) and its ratio
Personal notes and links about machine learning ① (Machine learning)
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Machine learning
Study machine learning and computer science. Resource list
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Machine learning Training data division and learning / prediction / verification