[PYTHON] You will be an engineer in 100 days --Day 86 --Database --About Hadoop

Click here until yesterday

This time, I will continue talking about databases and talk about Hadoop.

About big data

As for the database You will become an engineer in 100 days --Day 36 --Database --About the database But I'm doing it, but as a story about data First, there is the word big data.

The definition of big data is rather vague, and if it is a little big, it's like big data. There seems to be some media that say According to the definition of 2012, data in the range of tens of terabytes to several petabytes It seems to be big data.

Data of about 1TB is not called big data. One machine is enough for this.

In recent years, the data itself has become bloated, exceeding the amount that can be handled by one machine. Big data exists.

In that case, one database installed on the machine is not enough You will need a number of machines.

What came out to solve such a problem was to process large-scale big data. The platform.

Hadoop overview

Hadoop stores data such as text, images and logs An open source distributed processing platform that processes at high speed.

Hadoop is characterized by distributed processing performed on multiple machines By distributing the data to multiple machines You can easily scale out.

By processing large-scale data on multiple machines at once It also realizes high speed.

Background of Hadoop appearance

Hadoop was originally published as a paper by Google, Inc. It is an open source implementation of Google's internal technology. How to handle the large amount of data that Google has It is based on a paper on the technical system.

Google File System (GFS): Google's distributed file system Google MapReduce: Distributed processing technology at Google

Hadoop itself is developed based on Java It is still under development and its trademark is the elephant.

How Hadoop works

Hadoop is mainly composed of the following two elements. HDFS: Hadoop distributed file system Hadoop MapReduce: A distributed processing framework

** HDFS (Hadoop Distributed File System) **

HDFS consists of two nodes, a master and a slave. Master node: NameNode On a slave node: DataNode

NameNode manages file system meta information DataNode stores the substance of the data.

Data (block) divided by a fixed size when storing a file Save as DataNode.

And so that some DataNode fails and data is not lost Save a replica of the block in multiple DataNodes. This is a mechanism to prevent data loss.

By default, the number of replicas is 3, so even if two servers fail The data will not be lost.

Reference: What is NTT DATA / distributed processing technology "Hadoop"? https://oss.nttdata.com/hadoop/hadoop.html

Regarding NameNode, if it is a single node, it will not be usable if it fails. Redundancy can be achieved by setting the Active-Standby configuration.

By increasing the number of servers, you can also increase the amount of data stored.

** Hadoop MapReduce (Distributed Processing Framework) **

Another component is MapReduce. This is a mechanism for dividing data and processing it in parallel to obtain results more quickly.

MapReduce consists of two nodes, a master and a slave. Masternode: JobTracker On slave nodes: TaskTracker

JobTracker manages MapReduce jobs and goes to TaskTracker Assign tasks and manage resources.

TaskTracker executes the task.

Even if TaskTracker fails, JobTracker will Assign the same task to other TaskTrackers You can continue to run without having to start over.

Reference: What is NTT DATA / distributed processing technology "Hadoop"? https://oss.nttdata.com/hadoop/hadoop.html

The MapReduce application does the following: Job definition: Defines information for processing map and reduce map processing: Meaning data in key value format for input data reduce processing: Performs processing on the data aggregated for each key of map processing

When the MapReduce process is finished, the final result will be output. As the number of servers performing processing increases, processing will be completed at higher speed. It is advantageous to have a large number of computable units.

Features of Hadoop

I think the following three points are the main features of Hadoop.

** Server scalability ** With HDFS, if you run out of resources, you can increase the number of servers You can secure data capacity and computational resources.

** Fault tolerance **

If you install the software on a commercially available server Since you can build a Hadoop environment, you don't need any dedicated hardware. In addition, by the mechanism such as HDFS and TaskTracker If there are only a few units, the system will operate even if it breaks down.

** Processing flexibility **

The difference from conventional databases is when storing data in HDFS. It means that no schema definition is required.

Because you only have to give meaning when retrieving data It can be said that the data is stored for the time being.

What kind of company is it introduced?

I first learned about Hadoop in 2012. At that time, there were almost no companies that adopted it at seminars, etc. We had introduced it, so I think it was quite a rare person.

As for the number of units, there are 3 master name nodes. I think there were 20 data nodes.

The amount of data is about 5 billion rows of log data with about 100 columns per row per month.

When it comes to data of this scale, it was several TB. It doesn't fit in one unit, and there is not enough memory at the time of aggregation.

The MAX of the work I have done on my PC Since it is about the total of files of about 200GB As a guide, if the data exceeds several TB Is it like thinking about distributed processing?

The place that handles this much data I think you are using the foundation of distributed processing.

However, it has a large amount of data and can only master Hadoop. It's worth using if you have human and financial resources That may be only a small part.

Unless you are a company that has money and data It's difficult to introduce and it doesn't make sense.

When I went to the place where I had announced it long ago Yahoo CyberAgent NTT DATA Hitachi Solutions Recruit Technologies

It ’s just big companies. If you want to acquire distributed processing technology such as Hadoop I have to work at a company around here You may not be able to learn it.

By the way, I'm not using even 1mm now www

Summary

Distributed processing is now commonplace in handling big data It is indispensable.

As a key player in spreading distributed processing to the general public I think the existence of Hadoop is quite large. Let's suppress it as knowledge.

14 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython