[PYTHON] Apache Flink Challenges and Opportunities

In this blog post, ** Apache Flink ** and its ecosystem discuss the potential for something great to happen in the field of machine learning, despite the many challenges.

By Jian Feng

Before discussing the Apache Flink ecosystem, let's first look at what it is. In the IT world, an ecosystem is a community of components that derive from a common core component, and by using this core component directly or indirectly and using it with this core component, it becomes larger or larger. It can be understood that more specific types of tasks can be accomplished. Next, the Flink ecosystem refers to the ecosystem that surrounds Flink as a core component.

In the big data ecosystem, Flink is a computational component that deals only with the compute side and does not involve any proprietary storage system. However, in many practical scenarios, you may find that Flink alone cannot meet your requirements. For example, where to read data from, where to store data processed by Flink, how to consume data, how to use Flink to accomplish special tasks in the vertical business arena. It may be necessary to consider such things. In addition to both the downstream and upstream aspects, one strong ecosystem is needed to accomplish these tasks with a higher degree of abstraction.

Current status of the Flink ecosystem

Now that you understand what the ecosystem is, let's talk about the current state of the Flink ecosystem. Overall, the Flink ecosystem is still in its infancy. Today, the Flink ecosystem primarily supports a variety of upstream and downstream connectors and several types of clusters.

You can list the connectors that Flink currently supports all day long. However, to name a few, Kafka , Cassandra, [Elasticsearch](https: / /ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/elasticsearch.html?spm=a2c65.11461447.0.0.7a241797OMcodF), Kinesis /flink/flink-docs-stable/dev/connectors/kinesis.html?spm=a2c65.11461447.0.0.7a241797OMcodF), RabbitMQ -1.2 / dev / connectors / rabbitmq.html? spm = a2c65.11461447.0.0.7a241797OMcodF), [JDBC](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect .html? spm = a2c65.11461447.0.0.7a241797OMcodF), [HDFS](https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/filesystem_sink.html?spm=a2c65.11461447.0 .0.7a241797OM codF). Second, Flink supports almost all major data sources. Regarding clusters, Flink is currently Standalone and YARN is supported. Given the current state of this ecosystem, Flink is primarily used to calculate stream data. Using Flink in other scenarios (such as machine learning and interactive analytics) can be a relatively complex task, and there is still much hope for the user experience in these scenarios. But even in these challenges, there is no doubt that the Flink ecosystem has many opportunities.

Challenges and Opportunities in the Flink Ecosystem

While Flink serves as a big data computing platform primarily used for batch and stream processing, it has great potential for other uses as well. In my opinion, we need a stronger and more robust ecosystem to reach the full potential of Flink. To better understand Flink, you can evaluate the ecosystem from two different scaling dimensions.

1, horizontal scaling. In terms of horizontal scaling, the ecosystem needs to build a more complete end-to-end solution for what it already has. For example, this solution includes various connectors that connect different upstream and downstream data sources, integration with downstream machine learning frameworks, and even downstream BI tools. -Intelligence /? Spm = a2c65.11461447.0.0.7a241797OMcodF) integration, tools to simplify Flink job submission and maintenance, and provide a more interactive analytics experience Notebook / interactive-notebooks-data-analysis-visualization /? Spm = a2c65.11461447.0.0.7a241797OMcodF) may be included. 2, vertical scaling. In terms of scaling out to other areas, the more abstract Flink ecosystem had to meet requirements that went beyond the originally intended computational scenarios. For example, the vertical ecosystem includes batch and stream computing, Table API (https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tableApi.html?spm). = a2c65.11461447.0.0.7a241797OMcodF) (with a more advanced computational abstraction layer), [CEP](https://flink.apache.org/news/2016/04/06/cep-monitoring.html?spm= a2c65.11461447.0.0.7a241797OMcodF) (complex event processing engine), [Flink ML](https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/ml/?spm=a2c65 .11461447.0.0.7a241797OMcodF) (with a more advanced computing framework for machine learning), adaptations to various cluster frameworks, and more. The figure below assumes the Flink ecosystem scaled horizontally and vertically as described above.

image.png

Flink and Hive integration

Apache Hive is a top-level Apache project developed nearly 10 years ago. The project initially encapsulated the SQL statement on top of MapReduce. Users can now write simple, familiar SQL statements instead of writing complex MapReduce jobs. SQL statements from users are translated into one or more MapReduce jobs. As the project continues to evolve, Hive's computing engine has become pluggable. Hive currently supports three computing engines. MR, Tez, [Spark](https://cwiki.apache. It supports three computing engines (org / confluence / display / Hive / Hive + on + Spark% 3A + Getting + Started? Spm = a2c65.11461447.0.0.7a241797OMcodF). Apache Hive has become the industry standard for data warehousing in the Hadoop ecosystem. Many companies have been running data warehousing systems on Hive for years.

Flink is a computing framework that integrates batch and stream processing, so of course it needs to be integrated with Hive. For example, if you use Flink to run an ETL to build a real-time data warehouse, you need to use Hive's SQL for real-time data queries.

The Flink community has already created FLINK-10556 to enable better integration and support with Hive. Its main functions are as follows.

--Allows Flink to access Hive metadata. --Allows Flink to access Hive table data. --Flink is compatible with Hive data types. --You can use Hive UDF with Flink. --You can use Hive SQL with Flink (including DML and DDL).

The Flink community is taking step-by-step steps to implement the above features. For those who want to try these features in advance, the open source Blink developed by Alibaba Cloud -gains-707fdd583c26? spm = a2c65.11461447.0.0.7a241797OMcodF) You may want to try the project. The open source Blink project connects Flink and Hive with a metadata layer and a data layer. Users can use Flink SQL directly to query data in Hive and, in the real sense, seamlessly switch between Hive and Flink. To connect to the metadata, Blink rebuilt the implementation of the Flink catalog and added two catalogs: the memory-based FlinkInMemoryCatalog and the HiveCatalog that connects to the Hive MetaStore. This Hive Catalog allows Flink jobs to read metadata from Hive. To connect to data, Blink implements HiveTableSource, which allows Flink jobs to read data directly from Hive's regular or partitioned tables. Therefore, by using Blink, users can use Flink's SQL to read existing Hive metadata and data and perform data processing. Alibaba will continue to improve compatibility between Flink and Hive, including Hive-specific queries, data types, and support for Hive UDF. These improvements will gradually contribute to the Flink community.

Supports interactive analysis with Flink

Batch processing is also a common Flink application scenario. Interactive analysis is a big part of batch processing and is especially important for data analysts and data scientists.

When it comes to interactive analytics projects and tools, Fink itself needs further enhancements to improve its performance requirements. Take FLINK-11199 as an example. Currently, it is not possible to share data within the same Flink app that spans multiple jobs. The DAG for each job remains isolated. FLINK-11199 is designed to solve this problem and provides interactive analysis. We will support you more friendly.

In addition, an interactive analytics platform is needed to enable data analysts and data scientists to use Flink more efficiently. Apache Zeppelin has done a lot in this regard. Apache Zeppelin provides an interactive development environment, Scala, [Python](https://www. It is also Apache's top-level project that supports multiple programming languages such as python.org/?spm=a2c65.11461447.0.0.7a241797OMcodF) and SQL. Zeppelin also supports a high degree of scalability, Spark, [Hive](https://hive.apache. It supports many big data engines such as org /? spm = a2c65.11461447.0.0.7a241797OMcodF), Pig. Alibaba has put a lot of effort into implementing better support for Flink in Zeppelin. Users can write Flink code (Scala or SQL language) directly in Zeppelin. Also, instead of packaging locally and then running the bin / flink script to submit the job manually, users can submit the job directly in Zeppelin and see the job results. Job results can be displayed as text or visualized. Visualization is especially important for SQL results. Zeppelin mainly provides Flink support such as:

--Three run modes Local, Remote, Yarn --Scala, batch SQL, stream SQL --Visualization of static and dynamic tables --Automatic association with job URL --Job cancellation --Flink's Savepoint job listing --Advanced features of ZeppelinContext, such as creating controls --Three tutorial notes Streaming ETL, Flink batch tutorial, Flink stream tutorial

Some of these changes have been implemented in Flink and some have been implemented in Zeppelin. You can use this Zeppelin Docker image to test and use these features before all of these changes have contributed to the Flink and Zeppelin communities. For more information on downloading and installing the Zeppelin Docker image, see the example in the Blink documentation. To make it easier for users to try out these features, this version of Zeppelin has added three built-in Flink tutorials. One shows an example of Streaming ETL and the other two shows examples of Flink Batch and Flink Stream.

Machine learning support with #Flink

image.png

As the most important computing engine component in big data ecology, Flink is now primarily primarily a traditional segment of data computing and processing: traditional business intelligence (or BI) (eg, real-time data warehousing and It is used for real-time statistical reports). But the 21st century is the age of artificial intelligence (AI). Increasingly, companies in several different industries are choosing AI technology to radically change the way they do business. It can be said that the big data computing engine Flink is indispensable for such a wave of change in the entire business world. Even if Flink isn't developed specifically for machine learning, it's still playing an irreplaceable role in the Flink ecosystem. And in the future, it is expected that Flink will provide three major functions to support machine learning.

--Building a pipeline for machine learning --Supports traditional machine learning algorithms —— Enables integration with other deep learning frameworks Looking at the machine learning pipeline, it's easy to assume that machine learning can simply be boiled down into two main phases: training and prediction. However, training and prediction are only a small part of machine learning. Prior to training, the process of preparing data for a machine learning model involves tasks such as data cleaning, data transformation, and normalization. And after training, evaluating the model is also an important step. The same is true at the forecasting stage. In complex machine learning systems, the proper combination of individual steps is the key to generating a robust and scalable machine learning model. In many respects, FLINK-11095 is currently being worked on by the community to achieve this goal. Flink plays an important role in building machine learning models through all these steps.

Currently, Flink's flink-ml module implements some traditional machine learning algorithms, but needs further improvement.

The Flink community is actively supporting deep learning. Alibaba offers the TensorFlow on Flink project, where users run TensorFlow in a Flink job to process Flink data. You can use it to send the processed data to TensorFlow's Python process for deep learning training. For programming languages, the Flink community is working on Python support. Currently, Flink only supports Java and Scala APIs. Both languages are JVM-based. As a result, Flink is currently well suited for big data processing in systems, but not so well for data analysis and machine learning. Generally, people in the fields of data analysis and machine learning are Python and [R](https://www. I prefer to use more advanced languages such as r-project.org/about.html?spm=a2c65.11461447.0.0.7a241797OMcodF), but the Flink community also plans to support these languages in the near future. .. Flink first supports Python because it has evolved rapidly in recent years with the evolution of AI and deep learning. Currently, TensorFlow, Pytorch, All popular deep learning libraries, such as Keras, provide Python APIs. With Python support in Flink, users will be able to connect all their machine learning pipelines in just one language, which should dramatically improve development.

Sending and maintaining Flink jobs

In a development environment, Flink jobs are typically submitted with the shell command bin / flink run. However, when used in a production environment, this job submission method can actually cause a number of problems. For example, tracking and managing job status, retrying failed jobs, starting multiple Flink jobs, changing and submitting job parameters can be difficult. These problems can, of course, be solved with manual intervention, but manual intervention is extremely dangerous in the production floor, not to mention time consuming. Ideally, all operations that can be automated should be automated. Unfortunately, there are currently no suitable tools in the Flink ecosystem. Alibaba has already developed the right tools for internal use and has been in production for a long time, proving to be a stable and reliable tool for submitting and maintaining Flink jobs. Currently, Alibaba will remove some components that Alibaba internally depends on and publish the source code for this project. The project will be open sourced in the first half of 2019.

In summary, the current Flink ecosystem has many problems, but at the same time there is plenty of room for development. The Apache Flink community is human and we are constantly making great efforts to build a stronger Flink ecosystem to maximize the potential of Flink.

Do you have an idea? Do you feel inspiration? Join the community and build a better Flink ecosystem together.

Recommended Posts

Apache Flink Challenges and Opportunities
Integrate Apache and Tomcat
A quick guide to PyFlink that combines Apache Flink and Python
Apache Flink 1.9.0: Integrated Alibaba Blink functionality