Thomas Henson

  • Data Engineering Courses
    • Installing and Configuring Splunk
    • Implementing Neural Networks with TFLearn
    • Hortonworks Getting Started
    • Analyzing Machine Data with Splunk
    • Pig Latin Getting Started Course
    • HDFS Getting Started Course
    • Enterprise Skills in Hortonworks Data Platform
  • Pig Eval Series
  • About
  • Big Data Big Questions

Archives for May 2017

Big Data Big Questions: Do I need to know Java to become a Big Data Developer?

May 31, 2017 by Thomas Henson Leave a Comment

know Java to become a Big Data Developer

Today there are so many applications and frameworks in the Hadoop ecosystem, most of which are written in Java. So does this mean anyone wanting to become a Hadoop developer or Big Data Developer must learn Java? Should you go through hours and weeks of training to learn Java to become an awesome Hadoop Ninja or Big Data Developer? Will not knowing Java hinder your Big Data career? Watch this video and find out.

Transcript Of The Video

Thomas Henson:

Hi, I’m Thomas Henson with thomashenson.com. Today, we’re starting a new series called “Big Data, Big Questions.” This is a series where I’m going to answer questions, all from the community, all about big data. So, feel free to submit your questions, and at the end of this episode, I’ll show you how. So, today, the first question I have is a very common question. A lot of people ask, “Do you need to know Java in order to be a big data developer?” Find out the answer, right after this.

So, do you need to know Java in order to be a big data developer? The simple answer is no. Maybe that was the case in early Hadoop 1.0, but even then, there were a lot of tools that were being created like Pig, and Hive, and HBase, that are all using different syntax so that you can extrapolate and kind of abstract away Java. Because the key is, if you’re a data analyst or a Hadoop administrator, most of those people aren’t going to have Java skills. So, for the community to really move forward with this big data and Hadoop, we needed to be able to say that it was a tool that not only Java developers were going to be able to use. So, that’s where Pig, and Hive, and a lot of those other tools came. Now, as we start to look into Hadoop 2.0 and Hadoop 3.0, it’s really not the case.

Now, Java is not going to hinder you, right? So, it’s going to be beneficial if you do know it, but I don’t think it’s something that you would want to run out and have to learn just to be able to become a big data developer. Then, the question is, too, when you say big data developer, what are we really talking about? So, are we talking about somebody that’s writing MapReduce jobs or writing Spark jobs? That’s where we look at it as a big data developer. Or, are we talking about maybe a data scientist, where a data scientist is probably using more like R, and Python, and some of those skills, to pull their insights back? Then, of course, your Hadoop administrators, they don’t need to know Java. It’s beneficial if they know Linux and some of the other pieces, but Java’s not really necessary.

Now, I will say, in a lot of this technology… So, if you look at getting out of the Hadoop world but start looking at Spark – Spark has Java, so you can write your Spark jobs in Java, but you can also do it in Python and Scala. So, it’s not a requirement for people to have Java. I would say that there’s a lot of developers out there that are big data developers that don’t have any Java skills, and that’s quite okay. So, don’t let that hinder you. Jump in, join an open-source community project, do something to expand your big data knowledge and become a big data developer.

Well, that’s all we have today. Make sure to submit your questions. So, I’ve got a space on my blog where you can submit the questions or just submit them here, in the comments section, and I’ll answer your big data big questions. See you again!

 

Filed Under: Big Data Tagged With: Big Data, Big Data Big Questions, Hadoop, Learn Hadoop

Complete Guide to Splunk Add-Ons

May 24, 2017 by Thomas Henson Leave a Comment

Splunk is a popular application for analyzing machine data in the data center. What happens when Splunk Administrators want to add new data sources to their Splunk environment outside the default list?

The Administrators have two options:

  • First they can import the data source using the regular expression option. Only fun if you like regular expressions.
  • Second they can use a Splunk Ad-On or Application.

Let’s learn how Splunk Add-Ons are developed and how to install them.

Splunk Add-Ons

How to Create Splunk Plugins

Developers have a couple of options to create Splunk Application or Add-Ons. Let’s step through the options for creating Splunk Add-Ons by going from the easiest to hardest.

The first option to create a Splunk Add-Ons is by using the dashboard editor inside the Splunk app. Using the dashboard editor you can create custom visualizations of your Splunk data. Simply click to add custom searches, tables, and fields. Next save the dashboard and test out the Splunk Application.

The second option developers have is to use XML or HTML markup inside the Splunk dashboard. Using either markup language gives developers more flexibility into the look and feel of their dashboards. Most developer with basic HTML, CSS, and XML skills will choose this option over the standard dashboard editor.

The last option inside the local Splunk environment is SplunkJS. Out of all the option for creating application in the local Splunk environment SplunkJS allows the greatest control for developing Splunk applications. Developer with intermediate JavaScript skills will find using SplunkJS fairly easy while those without JavaScript skills will have a more difficult time.

Finally for developers who want the most control and flexibility for their Splunk Ad-Ons Splunk offers Application SDK options. These applications leverage the Splunk API and allow for developer to write the application in their favorite language.   By far using the SDK is the most difficult but also creates the ultimate Splunk Application.

Splunk Application SDK options:

  • Javascript
  • C#
  • Python
  • Java

What is Splunkbase

After developers create their applications they can then be uploaded to the Splunkbase. Splunkbase is the de facto marketplace for Splunk Add-Ons and applications. It’s a community driven market place for both licensed (paid) and non-licensed (not paid) Splunk Ad-Ons and Applications. Splunk certified applications ensure  secure and mature Splunk Applications.

Think of Splunkbase as Apple’s App store. Users download applications that run on top of iOS to extend the functionality of the iPhone. Both the community and corporate developers build Apple’s iOS Apps. Just like the iOS App store, Splunkbase offers both paid and free applications and Ad-Ons as well.

How to install from Splunkbase

The local Splunk environment integrates with Splunkbase. Meaning Splunkbase install are seamless. Let’s walk through a scenario below installing the Splunk Analytics for Hadoop in my local Splunk environment.

Steps for Installing App from Splunkbase:

  1. First log into local Splunk environment
  2. Second click Splunk Apps
  3. Next browse for “Splunk Analytics for Hadoop”
  4. Click Install & enter log in information
  5. Finally view App to begin using App

Another option is to install directly to the local Splunk environment. Simply download application directly and upload to local Splunk environment. Make sure to practice good Splunk hygiene by only downloading trusted Splunk Apps.

Closing thoughts on Splunk Apps & Add-Ons

In addition to extending Splunk, Add-Ons increase the Splunk environment’s use cases. The problem with Splunk is as user begin using they want to add new data sources. While often the new data sources are supported, times when data sources aren’t default Splunk’s community of App developers fill that gap. Splunk’s hockey adoption comes from the ability to add new data sources. New insights are constantly pushing new data sources in Splunk.

Looking to learn more about Splunk? Checkout my Pluralsight Course Analyzing Machine Data with Splunk.

 

Filed Under: Splunk Tagged With: Data Analytics, IT Operations, Splunk

Isilon Quick Tips: Deep Dive FTP

May 17, 2017 by Thomas Henson Leave a Comment

Deep Dive into FTP on OneFS

(Part 2 of my Isilon Quick Tips on FTP and talk at the Huntsville Isilon User Working Group talk) 

The FTP protocol is one of the most overlooked protocols in OneFS. On the surface there doesn’t appear to be more than a couple of options for FTP, but jump in CLI you will find a ton of options. One of options I stumbled on recently was the ability to lock users to their home directories using chroot jail.

Isilon OneFS FTP

OneFS FTP CLI Commands

List out verbose FTP settings in OneFS

1
isi ftp settings view

Command to restrict all users to their home directory after they login

1
2
3
4
5
6
7
$ isi ftp settings modify chroot --local-mode=all
 
**options
all
none
all-with-exceptions
none-with-exceptions

 

Isilon Quick Tips: FTP Deep Dive Transcript

(Excuse any errors, transcribe by a machine)

Hi, I’m Thomas Henson with Thomas Henson.com and welcome back to another episode of Isilon quick tips. Today we’re going to talk about how to lock our users down to their own directories when we’re using the FTP protocol to access our data inside of one effects so without hesitation let’s jump into our virtual Isilon cluster inside of OneFS I’ve set up a test user account this is the account that I’m going to be using to access my data via FTP so the login just using winscp set up an FTP connection to our Isilon cluster and you can see that I’m put out here in the root directory so I can traverse I can go to the temp directory I can also traverse answer our Isilon cluster and I can look at different data directories even though I don’t have access to these directories I’m still able to see them but I can’t actually look at the file so it’s a good thing that I can’t access the files what happens if I want to lock users down when they can’t even look at that directory well there’s this thing called easy FTP settings modify and that’s going to give us a lot more commands than what we have from the web CLI you can see here there’s a command called chroot local mode all and so by setting that local mode all we’re going to lock all the users that connect down to their own home directory now if I look at these settings that we’ve modified by using the FTP settings view you can see that all our local users are locked down to their own directory now if we try to log in with our test user account we can go to our own directory but we’re not able to traverse to any other directory it’s as if the only directory that exists is ours which is a good thing you want to lock our users downs they couldn’t see the other files their own nice cluster but we can still move data over so if we wanted to move over our running log file we’re able to move that over and so now you can see how you can lock users down to their own directories the FTP protocol make sure you subscribe so that you never miss an episode of Isilon quick tips see you next time
[Music]

Filed Under: Isilon Tagged With: Isilon, quick tips

7 Commands for Copying Data in HDFS

May 15, 2017 by Thomas Henson Leave a Comment

What happens when you need a duplicate file in two different locations?

It’s not a trivial problem you just need to copy that file to the new location. In Hadoop and HDFS you can copy files easily. You just have to understand how you want to copy then pick the correct command. Let’s walk though all the different ways of copying data in HDFS.
Copying Data in HDFS

HDFS dfs or Hadoop fs?

Many commands in HDFS are prefixed with the hdfs dfs – [command] or the legacy hadoop fs – [command]. Although not all hadoop fs commands and hdfs dfs are interchangeable. To ease the confusion, below I have broken down both the hdfs dfs and hadoop fs copy commands. My preference is to use hdfs dfs prefix vs. the hadoop fs.

Copy Data in HDFS Examples

The example commands assume my HDFS data is located in /user/thenson and local files are in the /tmp directory (not to be confused with the HDFS /tmp directory). The example data will be loan data set from Kaggle. Using the data set or same file structure isn’t necessary it’s just for a frame of reference.


Hadoop fs Commands

Hadoop fs cp – Easiest way to copy  data from one source directory to another. Use the hadoop fs -cp [source] [destination].

1
hadoop fs -cp /user/thenson/loan.csv /loan.csv

Hadoop fs copyFromLocal – Need to copy data from local file system into HDFS? Use the hadoop fs -copyFromLocal [source] [destination].

1
hadoop fs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv

Hadoop fs copyToLocal – Copying data from HDFS to local file system? Use the hadoop fs -copyToLocal [source] [destination].

1
hadoop fs -copyToLocal /user/thenson/loan.csv /tmp/

Copying Data in HDFS

HDFS dfs Commands

HDFS dfs CP – Easiest way to copy  data from one source directory to another. The same as using hadoop fs cp. Use the hdfs  dfs cp [source] [destination].

1
hdfs dfs -cp /user/thenson/loan.csv /loan.csv

HDFS dfs copyFromLocal -Need to copy data from local file system into HDFS? The same as using hadoop fs -copyFromLocal. Use the hdfs dfs -copyFromLocal [source] [destination].

1
hdfs dfs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv

HDFS dfs copyToLocal – Copying data from HDFS to local file system? The same as using hadoop fs -copyToLocal. Use the hdfs dfs -copyToLocal [source] [destination].

1
hdfs dfs -copyToLocal /user/thenson/loan.csv /tmp/loan.csv

Hadoop Cluster to Cluster Copy

Distcp used in Hadoop – Need to copy data from one cluster to another? Use the MapReduce’s distributed copy to move data with a MapReduce job. For the listed command below the original data exist on cluster namenode in the /user/thenson directory and is being transferred to the newNameNode cluster.  Make sure to use the full hdfs url in command. Command hadoop -distcp [source] \ [destination].

1
hadoop -distcp hdfs://namenode:8020/user/thenson \ hdfs://newNameNode:8020/user/thenson

It’s the Scale that Matters..

While copying data is a simple matter in most application, everything in Hadoop is more complicated because of the scale. Make sure when copying data in HDFS to understand the use case and scale, then choose one of the commands above.

Interested in learning more HDFS commands? Checkout out my Top HDFS Commands post.

Filed Under: Hadoop Tagged With: Hadoop, Hadoop Distributed File System, HDFS, HDFS Commnads, Learn Hadoop

Ultimate Big Data Battle: Batch Processing vs. Streaming Processing

May 8, 2017 by Thomas Henson 2 Comments

Today developers are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. There are many projects that are helping to accelerate and speed up this innovation. All of these projects rely on batch and streaming processing, but what is the difference between batch and streaming processing? Let’s dive into the debate around batch vs. streaming.

Batch Processing vs. Streaming Processing

What is Streaming Processing in the Hadoop Ecosystem

Streaming processing typically takes place as the data enters the big data workflow. Think of streaming as processing data that has yet to enter the data lake. While the data is queued it’s being analyzed. As new data enters the data is read and the results are recalculated. A streaming processing job is often times named a real-time application because the ability to process quickly changing data. While streaming processing is very fast it has yet to be truly real-time (maybe some day).

The reason streaming processing is so fast is because it analyzes the data before it hits disk. Reading data from disk incurs more latency than reading from RAM. Of course this all comes at a cost. You are only bound by how much data you can fit in the memory (for now..).

To understand the differences between batch and streaming processing let’s use a real-time traffic application as an example. The traffic application is a community driven driving application that provides real-time traffic data. As drivers report conditions on their commute the data is processed to share data with other commuters. The data is extremely time sensitive since finding out about traffic stop or fender bender an hour later would be worthless information. Streaming processing is used to provide the updates on traffic conditions, estimate time to destination and recommend alternative routes.


What is Batch Processing in the Hadoop Ecosystem

Batch processing and Hadoop are thought of as being the same thing. All data is loaded into HDFS  and then MapReduce kicks off a batch job to process that data. If the data changes the job needs to be ran again. Step by Step processing that can be paused or interrupted, but not changed from a data set perspective. For a job in MapReduce typically the data already exist on the disk in HDFS. Since the data already exists on the DataNodes, the data must be read from each disk in the cluster where the data is contained. The processing of shuffle this data and results becomes the constraint in batch processing.

Not a big deal unless batch process takes longer than the value of the data. Using the data lake analogy the batch processing analysis takes place on data in the lake (on disk) not the streams (data feed) entering the lake.

Let’s step back into the traffic application to see how batch is used. What happens when a user wants to find out what her commute time will be for a future trip. In that case real-time data will be less important (the further away from the commute time) and the historic data will the key to setting that model. Predicting the commute could be processed with a batch engine because typically the has already been collected.

Batch Processing vs. Streaming Processing

Is Streaming Better Than Batch?

Asking if streaming is better than batch is like asking if a which Avenger is better. If the Hulk can tear down buildings does that make him better than Ironman? What about the Avengers in Age of Ultron when they were trying to reflect off the Vibranium core? How would that all strength have helped here? In this case Ironman and Thor were better than the Hulk.

Just like with the Avengers, streaming and batch are better when working together. Streaming processing is extremely suited for cases when time matters. Batch processing shines when all the data has been collect and ready to test models. There is no one is better than the other augement right now with batch vs. streaming.

The Hadoop eco-system is seeing a huge shift into the world of streaming and batch coupled together to provide both processing models. Both workflows types come at a cost. So analyzing data with a streaming workflow that could be analyzed in a batch workflow is putting added cost to the results. Be sure the workflow will match the business objective.

Batch and Streaming Projects

Both workflows are fundamental to analyzing data in the Hadoop eco-system. Here are some of the projects and which workflow camp they fall into:

MapReduce – MapReduce  is where it all began. Hadoop 1.0 was all about storing your data in HDFS and using MapReduce to analyze that data once it was loaded into HDFS. The process could take hours or days depending on the amount of data.

Storm – Storm is a real time analysis engine. Where MapReduce process data in batches, storm does analysis in streams or as data is ingested. A project once seen as the defacto streaming analysis engine but has lost some of that momentum with emergence of other streaming projects.

Spark – Processing engine for streaming data at scale. Most popular streaming engine in the Hadoop eco-system with the most active contributors. Does not require data to be in HDFS for analysis.

Flink – Hybrid processing engine that uses streaming and batch processing models. Data is broken down into bound(batch) and unbound (unbound) sets. Stream processing engine that incorporates batch processing

Beam – Another hybrid processing engine breaking processing into streaming and batch processing. Runs with both Spark and Flink. Heavy support from the Google family. A project with a heavy amount of optimism right now in the Hadoop eco-system because of it’s ability to run both batch and streaming processing depending on the workload.

Advice on Batch and Streaming Process

At the end of the day, a solid developer will want to understand both workflows. It’s all going to come down to the use case and how either workflow will help meet the business objective.

Filed Under: Big Data

Big Data MBA Book Review

May 1, 2017 by Thomas Henson 1 Comment

Big Data MBA Book Review

Big Data MBA Book Review Video

Today’s book review on the Big Data MBA holds a special place in my library. I had read this book before meeting Bill Scharmzo and after sharing a steak with him I reread it. It was already an amazing book in my eyes because as a developer it opened my eyes to many of the problems I’ve had on projects. Hadoop and Big Data projects are especially bad about missing the business objective. Many times the process for using a new framework goes down like this…

Manager/Developer 1:  “We have to start using Hadoop”

Questions the team should ask:

  • What is the business purpose of taking this project problem
  • How will this help us solve a problem we are having
  • Will this project generate more revenue? How much more?

What the team really does:

  • Quick search on StackOverflow for Hadoop related questions and answers
  • Research on best tools for using Hadoop
  • Find a Hadoop conference to attend

Boom! Now we are doing Hadoop. Forget the fact we don’t have a business case identified yet.

One of the things stressed in the Big Data MBA is connecting a single business problem to Big Data Analytics. Just like how User Stories in Scrum describe the work developers will do, our business problem will describe the data used to solve the problem.

The Big Data MBA is a book about setting up the business objectives to tackle. Once those objectives are fettered out and the data is identified, developers can work their magic. Understanding how to map the business objectives to your technology is key for any developer/engineer. In fact the more you understand this the further you can go in your career. For this reason, I highly recommend this book for anyone working with Big Data.

 

Transcript

(forgive any errors it was transcribed by a machine)
Hi folks and welcome back to thomashenson.com and today’s episode is all about Big Data and so the book review that I’ve been wanting to do for quite some time so stay tuned.

[Music]

So today’s book review is the Big Data MBA it’s by Bill Schmarzo a fellow DELL EMC employee and a person who worked at Yahoo during the early days of data analytics ad buying and also the Hadoop era – this book focuses on the business objectives of Big Data a lot of things that we as developers and Technology technologists actually kind of overlook and I know I have in the past and it’s all about okay you know we want to be able to take a dupe and ready to implement it but this comes into what the business objectives are – one of the things that I really like about this book is Bill talks about anything over six months is really just a science experiment right and so that’s really an agile principle – so if you’re in the DevOps and agile software development you’ll kind of understand the concepts of hey let’s find one or two small objectives that we can make a quick impact on and you know anywhere from 6 to 12 weeks and then we can just build those use cases – and so a couple of the examples he uses on just single products right so instead of trying to increase like all your products you know revenue by like 10 or 20% you said let’s just pick one or two and I really like that approach right because what you can do is you get everybody together so it’s not just your you know developers your business analyst and the product owners it’s you know people from marketing your executives everybody gets in a room and you know a lot of whiteboards up and you actually sit down and you talk about these objectives so if we’re willing to you know increase the process of one product in two months what are we going to do so we’ll look at what we have from a data perspective and we’ll start data mapping so currently this is the data that we have what are some outside you know data sources that we can bring in what would help us answer questions right so what what questions will we love to be able to answer about our customer and is there data already out there about that and so I really like this book I think anybody that’s involved in big data or data analytics should be it I mean it’s definitely you know high-level business objectives but even for I do you know developers and you know your big data you know developers I think I think it’s really important for them to understand those objectives one of the big reasons there’s a lot of projects in software development and Big Data fail is we don’t tie to a business objective you know we have a tool or widget or a framework that we want to use and it’s great but we’re having a hard time really bringing it to into the CFO or the upper level management on what objective and what benefit we’re going to get out of using this tool and so for people you know they’re involved in Big Data you know even from the development side I think this will help you be able to champion those initiatives and be able to you know have more successful projects too so make sure you check out the big data MBA by Bill Schmarzo and to keep in tune with more Big Data tips make sure to subscribe to my YOUTUBE channel or check out thomashenson.com thanks

[Music]

Filed Under: Book Review Tagged With: Big Data, Book Review, Books

Subscribe to Newsletter

Archives

  • November 2019 (1)
  • October 2019 (9)
  • July 2019 (7)
  • June 2019 (8)
  • May 2019 (4)
  • April 2019 (1)
  • February 2019 (1)
  • January 2019 (2)
  • September 2018 (1)
  • August 2018 (1)
  • July 2018 (3)
  • June 2018 (6)
  • May 2018 (5)
  • April 2018 (2)
  • March 2018 (1)
  • February 2018 (4)
  • January 2018 (6)
  • December 2017 (5)
  • November 2017 (5)
  • October 2017 (3)
  • September 2017 (6)
  • August 2017 (2)
  • July 2017 (6)
  • June 2017 (5)
  • May 2017 (6)
  • April 2017 (1)
  • March 2017 (2)
  • February 2017 (1)
  • January 2017 (1)
  • December 2016 (6)
  • November 2016 (6)
  • October 2016 (1)
  • September 2016 (1)
  • August 2016 (1)
  • July 2016 (1)
  • June 2016 (2)
  • March 2016 (1)
  • February 2016 (1)
  • January 2016 (1)
  • December 2015 (1)
  • November 2015 (1)
  • September 2015 (1)
  • August 2015 (1)
  • July 2015 (2)
  • June 2015 (1)
  • May 2015 (4)
  • April 2015 (2)
  • March 2015 (1)
  • February 2015 (5)
  • January 2015 (7)
  • December 2014 (3)
  • November 2014 (4)
  • October 2014 (1)
  • May 2014 (1)
  • March 2014 (3)
  • February 2014 (3)
  • January 2014 (1)
  • September 2013 (3)
  • October 2012 (1)
  • August 2012 (2)
  • May 2012 (1)
  • April 2012 (1)
  • February 2012 (2)
  • December 2011 (1)
  • September 2011 (2)

Tags

Agile AI Apache Pig Apache Pig Latin Apache Pig Tutorial ASP.NET AWS Big Data Big Data Big Questions Book Review Books Business Data Analytics Data Engineer Data Engineers Data Science Deep Learning DynamoDB Hadoop Hadoop Distributed File System Hadoop Pig HBase HDFS IoT Isilon Isilon Quick Tips Learn Hadoop Machine Learning Management Motivation MVC NoSQL OneFS Pig Latin Pluralsight Project Management Python Quick Tip quick tips Scrum Splunk Streaming Analytics Tensorflow Tutorial Unstructured Data

Follow me on Twitter

My Tweets

Recent Posts

  • Kubernetes vs. Hadoop Career Growth
  • Learning to Filtering Client Traffic in OneFS
  • O’Reilly AI Conference London 2019
  • Deep Learning Python vs. Java
  • 5 Types of Buckets in Splunk

Copyright © 2019 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in