Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Everything You’ve Wanted to Know About HDFS Federation

March 6, 2017 by Thomas Henson Leave a Comment

2017 might have just started, but I’ve already noticed a trend that I believe will be huge this year. Many of the people I talk with who are using Hadoop & friends are curious about HDFS Federation.

Here are a few of the questions I hear

How can we use HDFS Federation to extend our current Hadoop environment?

Is there anyway to offload some of the workloads from our NameNode to help speed it up?

Or my favorite……

We originally thought we were boxed in with our Hadoop architecture but now with HDFS Federation our cluster has more flexibility.

So what is HDFS Federation? First we need to level set on how the NameNode and Namespace work in HDFS.
hdfs federation

How the NameNode Works in Hadoop

The HDFS architecture is a master/slave architecture. The NameNode is the leader with the DataNodes being the followers in HDFS. Before data is ingested or moved into HDFS it must first pass through the NameNode to be indexed. The DataNodes in HDFS are responsible for storing the data blocks, but have no clue about the other DataNode or data blocks. So if the NameNode falls off the end of the earth your in trouble because what good are the data blocks without the indexing.

HDFS Federation

HDFS not only stores the data, but provides the file system for users/clients to access the data inside HDFS.  For example in my Hadoop environment I have Sales and Marketing data I want to logically separate. So I would, setup to different directories and populate sub directories in each depending on the data. Just like you have setup on your own work space environment. Pictures and Documents are in different directories or file folders. The key is that structure is stored as meta data and the NameNode in HDFS retains all that data.

HDFS Namespace

The NameNode is also responsible for the HDFS namespace in the Hadoop environment. The namespace is set at the file level, meaning all files are hierarchical and follow a tree structure. NameSpace gives the structure users need to traverse the file system. Imagine an organized toolbox with all the tools laid out in a structured way. Once the tools are used they are put back in the same place.

Back to our Windows example the “C” drive is the top level file and everything else on the computer resides under it. Try to create another “Program Files” directory and you will get an error stating that file name already exists. However, if you drop down one level into another file and create a “Program Files” because it would be C:/Program Files/Program Files.

 

HDFS Federation Namespace
Windows NameSpace Example

As data is accelerated into HDFS, the NameNode begins to grow out of it’s compute and storage. Just like a hermit crab moving into a new shell, so is the same for the NameNode (vicious and expensive cycle). What if we could begin using scale-out architecture without having to re-architect the entire Hadoop environment? Well this is where HDFS Federation helps big time.

Hadoop Federation to the Rescue

A little know change in HDFS 2.x was the addition of HDFS Federation. Oftentimes confused with the ability to create high availability (HA) in Hadoop clusters or secondary NameNodes. However HDFS Federation allows for Hadoop clusters to add another NameNode and namespace. This Federated NameNode is one that has access to the DataNodes and indexes data moved to those nodes, but only when the data flows through that NameNode.

For example, I have two NameNodes in my cluster NN1 and NN2. NN1 will support all data in hdfs/data/…and NN2 will handle the hdfs/users directory. So as data from users/applications comes my hdfs/data namespace NN1 will index it and move it to the DataNodes. However if an application connects to NN1 and tries to query data in the hdfs/user directory it will get an error saying no known directory. For the application to query data in that namespace requires a connection to NN2. Think of HDFS Federation as adding a new cluster, in the form of a NameNode, while still using the DataNodes for storage.

Benefits of HDFS Federation

Here are a few of the immediate benefits I see being played out with HDFS Federation in the Hadoop world.

  • NameNode Dockerization – The ability to set up multiple NameNodes allows for new Hadoop architectures now allows for a module Hadoop architecture. As we start to move into a Microservices world, we will see architectures that contain multiple NameNodes. Hadoop environment will have the ability to break down and spin up new NameNodes on the fly.
  • Logically Separate Namespaces – For charge back IT enterprise HDFS Federation gives another tool for Hadoop administrators to setup multiple environments. These environments will still have the cost saving of a single Hadoop environment.
  • Ease NameNode Bottlenecks – The pain of having all data index through a single NameNode can be eliminated by create multiple NameNodes.
  • Options for Tiering Performance –  Segmenting different NameNodes and namespaces by customer requirements instead of setting up multiple complicated performance quotas is now an option. Simply provision the NameNode specs and move customer to NameNode based on the initial requirements.

One of the big reasons for HDFS Federations uptick this year is based on the growing adoption of Hadoop and the sheer amount of data being analyzed. More data more problems and particularly those problems are at scale.

 Final Thoughts on HDFS Federation

HDFS Federation is helping solve problems at scale with the NameNode. Since Hadoop’s 1.x code release the NameNode has always been the soft underbelly of the architecture. The NameNode has continued to struggle with high availability, bottlenecks, and replications. The community is continually working on improving the NameNode. HDFS Federation and the movement of Virtualized/Dockerized Hadoop are moving to mitigate these issues. As the Hadoop community continues to innovate with projects like Kudu and others, look for HDFS Federation to play a bigger role.

Related

Filed Under: Hadoop Tagged With: Big Data, Data Analytics, Hadoop, HDFS

Subscribe to Newsletter

Archives

  • February 2021 (2)
  • January 2021 (5)
  • May 2020 (1)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Machine Learning Engineer Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Follow me on Twitter

My Tweets

Recent Posts

  • Tips & Tricks for Studying Machine Learning Projects
  • Getting Started as Big Data Product Marketing Manager
  • What is a Chief Data Officer?
  • What is an Industrial IoT Engineer with Derek Morgan
  • Ultimate List of Tensorflow Resources for Machine Learning Engineers

Copyright © 2023 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in

 

Loading Comments...