An Introduction to Hadoop for Noobs

The client-server model transformed the data industry because it helped break apart overloaded systems, however, client-server models of the 1980s wouldn’t be able to handle the processing required for big data.

Along Comes Google.

As an answer to some of the biggest limitations of early client-server models, such as long processing time, frequent system overload, and client-servers bottleneck, Google created the Google File System (GFS). The GFS was groundbreaking because it used industry-standard hardware on a large scale. It broke up data into chunks on nodes all over the world. With this model, data isn’t centralized in the way it once was.

Moving forward, some of the back work behind GFS was published on a white paper so that it could be available to the entire community, which created a big buzz around the product. That’s when the founders, Dave and Mike, developed the open-source software that would eventually become Hadoop.

But what is Hadoop?

Data comes in and lives on servers. Then, work is pushed on to localized servers so that one centralized location isn’t pushed over capacity. This eliminates the bottlenecking that occurred in some of the earlier client-server models. To achieve this process, Hadoop relies on two main processes, MapReduce and HDFS.

HDFS is a java-based and purposed-built system designed to handle the demands of big data. In Hadoop, data is stored in clusters known as Hadoop clusters, which make it easier and faster to find files.

MapReduce is designed to process large data sets across a cluster of computers. It has two parts, the Mapper and the Reducer.

To give an example, MapReduce would be helpful if you were looking to find out how many times the word “the” appeared throughout a book. You would first distribute and send every page of the book to mappers, which would perform the mapping function to find the word, “the”. Eventually, the data becomes reduced into a smaller key and you wind up with the final results.

Why would you want to use Hadoop?

The major advantage of HDFS is the scalability. It can be leveraged to scale up to 4500 server clusters and 200 PB of data. Storage is handled in a distributive way so that overloading does not occur.

Unstructured Data

It’s also a powerful way to manage unstructured data. Upwards of 90% of data is unstructured. Examples of this include television scripts, emails and blogs.

Importable Data Tools

Applications such as HBASE and HIVE can be imported or exported from other databases.

Open Source

Like the GFS that it was created from, Hadoop is open-source, too.

I highly suggest you consider using Hadoop for your next big data project.

An Introduction to Hadoop for Noobs

Artificial Intelligence: Disturbingly High Expectations

The Translation Revolution: How LLMs Are Cutting 90% of…

What are we solving with analytics?

Leave a Reply Cancel reply