Thomas Henson, Author at Thomas Henson

Big Data MBA Book Review

May 1, 2017 by Thomas Henson 1 Comment

Big Data MBA Book Review Video

Today’s book review on the Big Data MBA holds a special place in my library. I had read this book before meeting Bill Scharmzo and after sharing a steak with him I reread it. It was already an amazing book in my eyes because as a developer it opened my eyes to many of the problems I’ve had on projects. Hadoop and Big Data projects are especially bad about missing the business objective. Many times the process for using a new framework goes down like this…

Manager/Developer 1: “We have to start using Hadoop”

Questions the team should ask:

What is the business purpose of taking this project problem
How will this help us solve a problem we are having
Will this project generate more revenue? How much more?

What the team really does:

Quick search on StackOverflow for Hadoop related questions and answers
Research on best tools for using Hadoop
Find a Hadoop conference to attend

Boom! Now we are doing Hadoop. Forget the fact we don’t have a business case identified yet.

One of the things stressed in the Big Data MBA is connecting a single business problem to Big Data Analytics. Just like how User Stories in Scrum describe the work developers will do, our business problem will describe the data used to solve the problem.

The Big Data MBA is a book about setting up the business objectives to tackle. Once those objectives are fettered out and the data is identified, developers can work their magic. Understanding how to map the business objectives to your technology is key for any developer/engineer. In fact the more you understand this the further you can go in your career. For this reason, I highly recommend this book for anyone working with Big Data.

Transcript

(forgive any errors it was transcribed by a machine)
Hi folks and welcome back to thomashenson.com and today’s episode is all about Big Data and so the book review that I’ve been wanting to do for quite some time so stay tuned.

[Music]

So today’s book review is the Big Data MBA it’s by Bill Schmarzo a fellow DELL EMC employee and a person who worked at Yahoo during the early days of data analytics ad buying and also the Hadoop era – this book focuses on the business objectives of Big Data a lot of things that we as developers and Technology technologists actually kind of overlook and I know I have in the past and it’s all about okay you know we want to be able to take a dupe and ready to implement it but this comes into what the business objectives are – one of the things that I really like about this book is Bill talks about anything over six months is really just a science experiment right and so that’s really an agile principle – so if you’re in the DevOps and agile software development you’ll kind of understand the concepts of hey let’s find one or two small objectives that we can make a quick impact on and you know anywhere from 6 to 12 weeks and then we can just build those use cases – and so a couple of the examples he uses on just single products right so instead of trying to increase like all your products you know revenue by like 10 or 20% you said let’s just pick one or two and I really like that approach right because what you can do is you get everybody together so it’s not just your you know developers your business analyst and the product owners it’s you know people from marketing your executives everybody gets in a room and you know a lot of whiteboards up and you actually sit down and you talk about these objectives so if we’re willing to you know increase the process of one product in two months what are we going to do so we’ll look at what we have from a data perspective and we’ll start data mapping so currently this is the data that we have what are some outside you know data sources that we can bring in what would help us answer questions right so what what questions will we love to be able to answer about our customer and is there data already out there about that and so I really like this book I think anybody that’s involved in big data or data analytics should be it I mean it’s definitely you know high-level business objectives but even for I do you know developers and you know your big data you know developers I think I think it’s really important for them to understand those objectives one of the big reasons there’s a lot of projects in software development and Big Data fail is we don’t tie to a business objective you know we have a tool or widget or a framework that we want to use and it’s great but we’re having a hard time really bringing it to into the CFO or the upper level management on what objective and what benefit we’re going to get out of using this tool and so for people you know they’re involved in Big Data you know even from the development side I think this will help you be able to champion those initiatives and be able to you know have more successful projects too so make sure you check out the big data MBA by Bill Schmarzo and to keep in tune with more Big Data tips make sure to subscribe to my YOUTUBE channel or check out thomashenson.com thanks

[Music]

Isilon User Working Group Huntsville

April 24, 2017 by Thomas Henson Leave a Comment

Isilon’s OneFS makes it easier to create to your own data lake managed in a single namespace. It’s a great time to learn more about leveraging your data lake with all the features in OneFS.

The Isilon User Working Group

I’m excited to announce our first ever Isilon User Working Group in Huntsville. During this event myself and Matt Russell will cover Isilon tips and tricks we’ve learned through blood, sweat, and tears. After a live Isilon Quick Tips Demo we will have a special demo from Varonis on protecting your Isilon data. Lastly we will have an “Ask the Isilon Guru” session with Matt Russell. All of this great content will be provided over lunch at Roise’s Mexican Cantina in South Huntsville. Make sure to register to ensure your seat at the first Isilon User Working group in Huntsville.

Register Below (Limited Seats Available)

https://www.meetup.com/Isilon-User-Working-Group-UWG/

Creating Your Own Data Lake

Interested in learning about starting your own data lake? Join us at the Huntsville Isilon User Working group to learn how to create your own Data Lake using Isilon OneFS. This is great opportunity to learn more about Isilon and socialize with other Isilon users in the North Alabama area.

Preview from one of the Isilon Quick Tips to be released at the Isilon User Working Group.

Isilon Quick Tips: Enabling FTP in OneFS

March 10, 2017 by Thomas Henson 1 Comment

In this episode of Isilon Quick, I’ll demo enabling FTP in OneFS. Isilon supports FTP, but to take advantage you have to enable it on your cluster. Learn setup FTP on your cluster in under 3 minutes with the video below.

FTP In OneFS

Transcript

(forgive any errors it was transcribed by a machine)

Hi welcome back to another episode of Isilon quick tips. In today’s episode I’m going to show you how to enable FTP on your Isilon cluster. So get ready to follow along so to enable your ftp access the first thing we’re Going do is we’re Going go to our protocols and go to our ftp settings so that page loads up you can see that we only really have four options here and the first option is just to enable the ftp service that’s something that doesn’t come to fault enabled but you see that i already have it checked here so now i know that have enabled the ftp service and so now I can move data back and forth there are a couple different options here in the settings one of them i want to point out is the enable anonymous exes and that’s something that ninety-five percent of the time you’re not going to want to set that up but if you wanted to set that up this is where you would do it so after we have that setup let’s go back and look at our members and rolls and i just want to show you the account that I’m going to be using so I’m going to be using my file system account settings and this user admin here in your environment you might have active directory which you can access your ftp users for their you just have to make sure you use your full domain name but I’m going to use this admin account here now we just need to pull up an ftp client so I’m going to use WinSCP but you’re able to use anything you want to put in our host name and I’m going use an IP address because i don’t have my smart connect zones or my DNS server setup on my local machine here in most cases you’re going to that smart connect name here for hosting and then once you’re logged in we’re going to move over our slide Powerpoint here and i put that just in the IFS directory and now we’re going to verify it just in RIFs share here and we can see that yes in the IFS directory we have our slots and so our data was able to move over using our ftp service so this is how to enable ftp on your Isilon cluster just remember all you have to do is enable that ftp and then those users can log in using their own credentials in a future episode I’ll cover some more options around the ftp servers and something things you can customize on thanks again for tuning into Isilon quick tips and make sure to subscribe so you never miss an episode.

Everything You’ve Wanted to Know About HDFS Federation

March 6, 2017 by Thomas Henson Leave a Comment

2017 might have just started, but I’ve already noticed a trend that I believe will be huge this year. Many of the people I talk with who are using Hadoop & friends are curious about HDFS Federation.

Here are a few of the questions I hear

How can we use HDFS Federation to extend our current Hadoop environment?

Is there anyway to offload some of the workloads from our NameNode to help speed it up?

Or my favorite……

We originally thought we were boxed in with our Hadoop architecture but now with HDFS Federation our cluster has more flexibility.

So what is HDFS Federation? First we need to level set on how the NameNode and Namespace work in HDFS.

How the NameNode Works in Hadoop

The HDFS architecture is a master/slave architecture. The NameNode is the leader with the DataNodes being the followers in HDFS. Before data is ingested or moved into HDFS it must first pass through the NameNode to be indexed. The DataNodes in HDFS are responsible for storing the data blocks, but have no clue about the other DataNode or data blocks. So if the NameNode falls off the end of the earth your in trouble because what good are the data blocks without the indexing.

HDFS not only stores the data, but provides the file system for users/clients to access the data inside HDFS. For example in my Hadoop environment I have Sales and Marketing data I want to logically separate. So I would, setup to different directories and populate sub directories in each depending on the data. Just like you have setup on your own work space environment. Pictures and Documents are in different directories or file folders. The key is that structure is stored as meta data and the NameNode in HDFS retains all that data.

HDFS Namespace

The NameNode is also responsible for the HDFS namespace in the Hadoop environment. The namespace is set at the file level, meaning all files are hierarchical and follow a tree structure. NameSpace gives the structure users need to traverse the file system. Imagine an organized toolbox with all the tools laid out in a structured way. Once the tools are used they are put back in the same place.

Back to our Windows example the “C” drive is the top level file and everything else on the computer resides under it. Try to create another “Program Files” directory and you will get an error stating that file name already exists. However, if you drop down one level into another file and create a “Program Files” because it would be C:/Program Files/Program Files.

HDFS Federation Namespace — Windows NameSpace Example

As data is accelerated into HDFS, the NameNode begins to grow out of it’s compute and storage. Just like a hermit crab moving into a new shell, so is the same for the NameNode (vicious and expensive cycle). What if we could begin using scale-out architecture without having to re-architect the entire Hadoop environment? Well this is where HDFS Federation helps big time.

Hadoop Federation to the Rescue

A little know change in HDFS 2.x was the addition of HDFS Federation. Oftentimes confused with the ability to create high availability (HA) in Hadoop clusters or secondary NameNodes. However HDFS Federation allows for Hadoop clusters to add another NameNode and namespace. This Federated NameNode is one that has access to the DataNodes and indexes data moved to those nodes, but only when the data flows through that NameNode.

For example, I have two NameNodes in my cluster NN1 and NN2. NN1 will support all data in hdfs/data/…and NN2 will handle the hdfs/users directory. So as data from users/applications comes my hdfs/data namespace NN1 will index it and move it to the DataNodes. However if an application connects to NN1 and tries to query data in the hdfs/user directory it will get an error saying no known directory. For the application to query data in that namespace requires a connection to NN2. Think of HDFS Federation as adding a new cluster, in the form of a NameNode, while still using the DataNodes for storage.

Benefits of HDFS Federation

Here are a few of the immediate benefits I see being played out with HDFS Federation in the Hadoop world.

NameNode Dockerization – The ability to set up multiple NameNodes allows for new Hadoop architectures now allows for a module Hadoop architecture. As we start to move into a Microservices world, we will see architectures that contain multiple NameNodes. Hadoop environment will have the ability to break down and spin up new NameNodes on the fly.
Logically Separate Namespaces – For charge back IT enterprise HDFS Federation gives another tool for Hadoop administrators to setup multiple environments. These environments will still have the cost saving of a single Hadoop environment.
Ease NameNode Bottlenecks – The pain of having all data index through a single NameNode can be eliminated by create multiple NameNodes.
Options for Tiering Performance – Segmenting different NameNodes and namespaces by customer requirements instead of setting up multiple complicated performance quotas is now an option. Simply provision the NameNode specs and move customer to NameNode based on the initial requirements.

One of the big reasons for HDFS Federations uptick this year is based on the growing adoption of Hadoop and the sheer amount of data being analyzed. More data more problems and particularly those problems are at scale.

Final Thoughts on HDFS Federation

HDFS Federation is helping solve problems at scale with the NameNode. Since Hadoop’s 1.x code release the NameNode has always been the soft underbelly of the architecture. The NameNode has continued to struggle with high availability, bottlenecks, and replications. The community is continually working on improving the NameNode. HDFS Federation and the movement of Virtualized/Dockerized Hadoop are moving to mitigate these issues. As the Hadoop community continues to innovate with projects like Kudu and others, look for HDFS Federation to play a bigger role.

Isilon Quick Tips: Setting SMB Shares in OneFS

February 10, 2017 by Thomas Henson Leave a Comment

SMB Shares in Isilon’s OneFS

One of the keys capabilities with Isilon’s OneFS is creating Server Message Block (SMB) shares for network storage. In this episode of Isilon Quick Tips learn how to create SMB shares in OneFS.

Setting SMB Shares in OneFS

Transcript

Welcome back to another episode of Isilon Quick Tip and today we ‘re actually going to map a shared drive using SMB so think of your windows environment being able to set up shares for home directories to share data between it maybe share files between some sort of organization and today we ‘re going to actually look at how to do that through the protocols

So from here in the OneFS directory the first thing to go to use will go to our protocol and yes and B shares as you can see from here that we already have one created this one comes with default this is that ifs directory and so the IFS directories you know is everything that in I salon is under that directory so if we were back and we look at our file system storage using the file system explorer that ‘s pulling up you can see that under RIFs directory we have a home directory data directory in a couple different other ones so we can actually drive drill down and look here now i ‘m going to switch over look at our file explorer and you can see that i already have the IFS directory match here this message is that what we ‘re seeing within the one of us web doing until you see that i have my data directory in my home directory when my sub directories in this and you ‘ll notice here when we map the directory that i use the IP address of the know that i was using but that ‘s only because i don ‘t have the net set up on my local machine but in most your instances you ‘ll have that DNS name that you ‘ll use here that actually you ‘re smart connect name let ‘s go through and actually set up another FB share so say that within our directory that we have a file share that we want to set up for all of our movie date and say that you know we have different movie research that we ‘re doing and we ‘ve set up specific share around that so it ‘s really simple to do so back into our protocols when you are going to create an SMB share will call this one movie and this is just all of our movie research data will set it for everyone to share and then here we ‘ll just actually say the past so anything and dr s remember back it ‘s under data directory you see that I ‘ve got some files under here or I ‘m DB movie information in a couple different other ones the director has already been created so we don ‘t need to check this box here and we ‘re going let it apply the default apo but you can actually set it so that it doesn ‘t change any existing this comes in really handy whenever you ‘re setting up a share on a subdirectory file or you didn ‘t want to reverse all the subdirectories under it just wanted to make that share available we ‘re going to keep the account things all the same here and we ‘ll go ahead and create that share and now if we want to set up that share so that we can see it and open our file explorer going here map network drive going to select the specific drugs just as a reminder if you have that smart connect name you want to use that here I don ‘t have smart connect setup for dns so I ‘m going use my future and now you have that file share and now I can start moving over my movie data and open up my files and being able to share documents for all of our movie research and that ‘s how simple it is the setup an SMB share in I salons oneFS be sure to subscribe to my channel so you can get more Isilon quick tips

How to Use a Splunk Universal Forwarder

January 23, 2017 by Thomas Henson Leave a Comment

Imagine you’re a Systems Administrator responsible for keeping your companies’ custom developed application up and running. It is a critical application responsible all the ordering and payments for your company and is the sole interface for your customers to buy your products.

Today that application went down for 4 hours. During that time your company lost 10 million dollars in sales. You have been called into the CIO’s office for a debriefing.

You walk in her office, she quickly asks you “How could this have been prevented”.

By using Splunk and specifically using Splunk with Universal forwarders to proactively monitor those critical applications.

How are you going to stay out in front of issues that may happen?

What about preventive fixes and DevOps?

Splunk is the answer to keeping you from having system crashes and pulling all nighters. Analyzing applications with Splunk can allow developers and administrators to test scenarios before going to production with applications. How did Nissan test their Website before their first ever Super Bowl commercial? Nissan used Splunk to thoroughly

What is Splunk?

Splunk is huge in the data center when it comes to analyzing log files and IT security. Splunk is an application that allows for machine data to be stored, indexed and visualized quickly. In the past log files were parsed and stored by writing custom scripts with regular expressions to make the files human readable. Splunk simplifies all that with setting up default parsers for many common and uncommon log files and letting users start visualizing their data with in the Splunk application.

Since Splunk setup is so easy to setup the popularity of Splunk has been going through enormous growth. Recently I attended an Big Data conference where they said 70% of companies are using Splunk in some fashion. Gartner placed Splunk in the Leader Magic Quadrant for 2016.

What is a Splunk Universal Forwarder

So how can you analyze application server log files while running the application in production?

Splunk has forwarders for sending data between different instances of Splunk. Using a forwarder allows to move log files from one machine to another without having to write custom batch scripts and clog up bandwidth. Let’s talk about how the Splunk forwarder is used in the data center.

FedEx as a Splunk Forwarder?

FedEx is amazing at moving packages. Here recently my cousin graduated from college and I wanted to send her the book The Obstacle is the Way (seriously check it out the book). Think of the book as the data and my cousin and myself as machines and FedEx as the forwarder. I was able to package up the book (data) and send it off to my cousin. The package was wrapped (encrypted) and the correct address (URL) was placed on it. The Splunk Universal Forwarder is like FedEx. It will deliver machine data to other instances of Splunk.

When installing universal forwarders, Splunk has two option to chose from depending on the use case.

What is a Splunk Light Forwarder

The first type of Splunk forwarder is the light or universal forwarder. Think of it as a lightweight or minimal installation. The light forwarder has minimal features and its main objective to move data from one machine to another. No analysis or indexing. It’s even limited in the data that will parse because it’s goal is to move data to an Splunk Indexer. Another thing missing with the light forwarder is the Web CLI so it’s strictly from the command line for this forwarder. Since the goal of the light forwarder is low impact not having a Web CLI isn’t a deal breaker. Why add features not needed if we are going to analyze the data else where.

What is a Splunk Heavy Forwarder

The second type of forwarder is called a heavy forwarder. Think of a full blown instance of Splunk. It’s similar to what we have running in local development environment. The only difference is what we choose to disable. Remember depending of the scenario we want to the option to have the lowest impact to the CPU of the machine we are hosting on. So the heavy forwarder allows for us to disable features we aren’t going use. Management of the heavy forwarder can be done through the Web CLI, which we have been using, or the command line like in the universal forwarder.

All the Splunk forwarders have build in enterprise features like encryption and compression. Encryption offers the ability to protect data in-flight and prevent unwanted reads of log files from packet capture. The compression option will vary on the amount off data that is duplicated and white space in the log file. So if you looking to calculate the compression just know it’s going to depend. data Both encryption and compression are opt in features and are not enabled by default.

Learn more about Splunk in my Analyzing Machine Data with Splunk course

Where are Universal Forwarders used?

Anywhere you don’t want to install a full blown instance of Splunk or remote offices where you want to use Splunk for data analysis but also forward the data on another instance of Splunk. Think about multiple smaller Splunk hubs that can forward data to larger Splunk instance for a system wide view.

Use Cases

Application Servers
Database Servers
Networking Infrastructure
Web Servers
Internet of Things
Continuous Integration and Testing
Detecting Insider Threats
Securing Networks

How to Install Splunk Universal Forwarder

Let’s look at how to setup a Splunk Universal Forwarder. Just like the full blown Splunk instance you have to pick the flavor of OS for the host machine. After getting the correct Splunk version you will run the default install unless you are the light version (which I recommend) it will all be down for the command line.

For example below are the steps for installing the light Forwarder for Ubuntu server:

Download Specific version of Splunk for Windows, Mac or Linux Distributions – Download directly or use wget -url to download from command line.
Install on Ubuntu Machine – Move downloaded package to Ubuntu /tmp directory. Once .tgz is in /tmp directory run dpkg -i splunk-verison-xxx.tgz. Command will kick off the installation of the Splunk Universal forwarder.
Start up Splunk Forwarder – After running the dpkg command for installation move to Splunk directory cd /opt/splunkforwarder/bin. Next start up Splunk server ./splunk start.
Set up forwarding machine on Ubuntu – Last configuration change is to ensure log files will be forwarder through port 9997. Port 9997 is default but it won’t hurt to run the following command ./splunk add forward-server hostmachineIP:9997.
Configure receiving on Splunk instance – Finally now that the install is complete on the host machine you will need to configure Splunk to receive the log files from Ubuntu server. On the Splunk instance enable receiving from UI in settings –> receiving. Ensure that Splunk is listening on the default port of 9997.

Final Thoughts on Splunk Universal Forwarder

Splunk forwarding is the secret sauce for Splunking. It allows for data to be streamed in real time to the main Splunk instance with little performance concern on host machine. Installation for Splunk Universal Forwarders is a little tricky at first but once you get one installed the next one are simple.

[Read more…]

New Video Series: Isilon Quick Tips

December 27, 2016 by Thomas Henson 4 Comments

How can I protect my data in HDFS?

What is Isilon and how does it work with HDFS?

In the coming post I will explain how Isilon makes Hadoop so much easier to manage. First I thought I’d cover the basics on Isilon in my Isilon Quick Tips series below.

Hadoop Career

Over a year ago I switched teams to join Dell EMC working on the Data Lake team. One of the platforms I work with is the Isilon Scale-out NAS (Gartner #1 in Scale-out NAS). It’s a really mind blowing system that supports HDFS as a protocol but also NFS, SMB, REST, SWIFT, HTTP, FTP protocols as well. Think of being able to move data into HDFS by just moving a file in your Windows environment. Oh and by the way it scales up to 90 PB of data (talking about BIG DATA).

What makes Isilon so awesome isn’t just the hardware but the software that runs Isilon. OneFS is the software that gives Isilon it’s power to store data at astronomical heights. One file system or OneFS is key to giving developers the ability to access Hadoop data thru HDFS using other protocols. Think about not having to land your data on your machine before ingesting into to HDFS. All of this is possible because OneFS treats HDFS as a protocol not storage system. So data can sit on Isilon, but be read as HDFS.

A huge benefit to using Isilon for HDFS storage is the when replicating data for data protection. I’ll follow up with a blog post dedicated to data protection in Hadoop in the future. Just know Isilon provides that missing piece in Hadoop for replication and data protection. Want to replicate or copy over 20 PB of data? No problem just use SyncIQ in OneFS.

Share the Isilon Knowledge

Along the way on the Data Lake team I’ve acquired some knowledge about managing Isilon clusters and wanted to get it out to the community. All these demos can be done using the Isilon Simulator on your local machine. The demos are meant to be easily consumable and all should be around 5 minutes long with a few outliers that bump up to an hour.

Isilon Quick Tips Videos Links

Isilon Quick Tips: Demo using SnapShotIQ to retrieve delete files with Windows Shadow Copy
Isilon Quick Tips: Quick walk through on setting up a one-time SyncIQ job in OneFS
Isilon Quick Tips: Deep Dive into SyncIQ options for customizing your backup strategy
Isilon Quick Tips: Setting SmartQuotas to manage capacity on your Isilon Cluster
Isilon Quick Tips: Learn how to setup an NFS export in OneFS
Isilon Quick Tips: Changing Password through the Web interface in OneFS 8.0
Isilon Quick Tips: Setting Up SMB Shares in OneFS
Isilon Quick Tips: Enabling FTP in OneFS
Isilon Quick Tips: Compare Snapshots in OneFS

Be sure to subscribe to my YouTube channel to ensure that you never miss an Isilon Quick Tip or other Hadoop related tutorials. As always leave a comment or drop me an email with any ideas you have about new topics or things I’ve missed in my posts.

Splunking on Hadoop with Hunk (Preview)

December 23, 2016 by Thomas Henson Leave a Comment

Splunking on Hadoop with Hunk

So I’ve seen a lot of people asking what does your Pluralsight: Analyzing Machine Data with Splunk course cover.

Well, for starters it covers a ton about starting out in Splunk. Admins and Developers will quickly setup a Splunk development environment then fast forward to using Splunkbase to expand use cases. However the most popular portion of the course is the deep dive into Hunk.

Hunk is Splunk’s plugin that allows for data to be imported from Hadoop or exported into Hadoop. Both Splunk and Hadoop are huge in analytics (big understatement here) and with Hunk, users can visualize their Hadoop data in Splunk. One of the biggest complaints with Hadoop is the poor visualization tools to support this thriving community. Many admins are already using Splunk so it’s no wonder Splunk is filling that gap.

In my Analyzing Machine Data with Splunk course I dig into using Hunk with the Splunking on Hadoop with Hunk module. This module is close to 40 minutes of Hunk material from setting up Hunk to moving stock data from HDFS to Hunk. I’ve worked with Pluralsight to setup a quick 8 minute preview video on the Splunking on Hadoop with Hunk module checkout it out and be sure to watch on Pluralsight for the full Hunk deep dive.

Splunk on Hadoop with Hunk (Preview)

Never miss an update on Hadoop, Splunk, and Data Analytics.

Top 9 SPL Commands in Splunk For Splunk Ninjas

December 19, 2016 by Thomas Henson 1 Comment