Big Data Lambda Architecture Explained

June 29, 2017 by Thomas Henson 2 Comments

What is Lambda Architecture?

Since the Spark, Storm, and other streaming processing engines entered the Hadoop ecosystem the Lambda Architecture has been the defacto architecture for Big Data with a real-time processing requirement. In this episode of Big Data Big Questions I’ll explain what the Lambda Architecture is and how developers and administrators can implement in their Big Data workflow.

Transcript

(forgive any errors text was transcribed by a machine)

Hi folks Thomas Henson here with thomashenson.com and today is another episode of big data big questions and so today’s question is what is the lambda architecture and how does that relate to our big data and Hadoop ecosystem? Find out right after this so when we talk about the lambda architecture and how that’s implemented into do we have to go back and look at Hadoop 1.0 and 2.0 when we really didn’t have a speed layer or Spark for streaming analytics and so back in the traditional days of Hadoop 1.0 and 2.0 we were using MapReduce for most of our processing and so the way that that would work is our data would come in we would pull our data into HDFS once our data was in HDFS we would run some kind of MapReduce job so you know we need to use pig or hive or to write our own custom job or some of the other frameworks that are in the ecosystem so that was all you know mostly transactional right so all our data had to be in HDFS so we had to have a complete view of our data to be able to process it later on we started looking at it and seeing that hey we need to be able to pull data in and do it in when data is not really complete right so unless transactional when we maybe have incomplete parts of the data or the data is continuing to be updated and so that’s where spark and Flink and some of the other streaming analytics and streaming processing engines came in is that we wanted to be able to process that data as it came came in and do it a lot faster too and so we took out the need really to even put it into HDFS for when we first we’re starting to process it because that takes time to write so we wanted to be able to move our data and process it before it even hit you know our HDFS and our disconnect that whole system but we still needed to be able to process that for batch processing right so some analytics some data that we’re going to pull we want to do that in real time right but then there’s other insights like maybe some monthly reports quarterly reports that are just better for transactional right and even when we start to talk about you know how we run a process and hold on to historical data and kind of use as a traditional enterprise data warehouse but in a larger you know more Hadoop platform basis like hi presto and some of the other SQL engines that are working on top of us do and so the need came where we were having these two different you know two different systems and how we were going to process data so we started adapting the lambda architecture so both the land architecture was was as your data come in it would sit and maybe a queue so maybe you can have it sitting in Kafka or just some kind of message queue any data that needed to be pulled out and processed streaming we would take and we will process that and what would call our speed layer so we have our speed layer maybe using smart or flee to pull out some insights and push those right out to our dashboards for our data that was going to exist for battleship for the you know transactional processing and just hold them for historical data we would have our MapReduce layer so we’re all a batch and so if you think about two different prongs so you have your speed layer coming in here pulling out your insights but your data as it sits in the cube goes into HDFS and still there to you know run hide will top up or hold on for historical data or maybe to still run some MapReduce jobs and pull up to a dashboard and so what we would have is you have two pronged approach there with your speed layer being your speed letter being on top and then your bachelor being on the bottom and then so as that dated would come in you still have your data in HDFS but you’re still be able to pull your data from you know your real time processing as the data is coming in and so that’s what we started talking about when we were saying lambda architecture is just a two layer system to be able to do our MapReduce and our best job and then also a speed layer to do our streaming analytics you know whether it be through spark flee or attaching beam and some of the other pieces so it’s a really good process to know it’s and you know it’s something that’s been in the industry for quite a long time so if you’re new to the Hadoop environment definitely want to know and be able to reference it back to but there are some other architecture that we’ll talk about in some future episodes so make sure you subscribe so that you never miss an episode so go right now and subscribe so that the next time that we talk about an architecture that you don’t miss it and I’ll check back with you next time thanks folks

Is Hadoop Killing the EDW?

June 27, 2017 by Thomas Henson Leave a Comment

Is Hadoop Killing the EDW? Fair question since in it’s 11th year Hadoop is known as the innovative kid on the block for analyzing large data sets. If the Hadoop ecosystem can analyze large data sets will it kill the EDW?

The Enterprise Data Warehouse has ruled the data center for the past couple of decades. One of the biggest question big data question I get is what’s up with the EDW. Most database developers and architects want to know what is the future of the EDW.

In this video I will give my views on if Hadoop is killing the EDW!

Transcript

(forgive any errors text was transcribed by a machine)

Hi I’m Thomas Henson with thomashenson.com and today is another episode of big data big questions today’s question this made it a little bit controversial but it is big data killing the enterprise data warehouse let’s find out so is the death of universe data warehouse coming all because of big data the simple answer is in the short-term the medium-term no but it really is hampering the growth of those enterprise traditional data warehouses right and part of the reason is the deluge of all this unstructured data so 80% of all the data in the data center and in the world is all unstructured data and if you think about enterprise data warehouses they’re very structured and they’re very structured because they need to be fast right so they support our applications and I support our dashboards but when it comes to you know trying to analyze that data and trying to get that unstructured data into a structured version it really starts to blow up your storage requirements on your enterprise data warehouse and so part of the reason that the enterprise data warehouse growth is slow is because of 70% of that data that’s in those enterprise data warehouses is all really cold data so really you know only thirty percent of the data in your enterprise data warehouse is what’s used and normally that’s your newest date so that cold data is sitting there on some of your premium fast storage you know taking up that space that has your licensing fees for your enterprise data warehouse and also the premium storage and a premium hardware that is sitting on then couple that with the fact that we talked about 80% of all new data that’s coming in and data is created in the world is all in structured data right so they could clean up data from Facebook any kind of social media platforms but video and you know log files and some of your semi structured data that’s coming you know often for your Fitbit or any kind of those IOT or any of the new emerging technologies and so all this data if you’re trying to pack it into your presentation warehouse just going to explode that license fee and then also that hardware and then you don’t even know if this data has any value to it soon and that’s where big data in Hadoop and spark and that whole ecosystem comes because we can store that data that unstructured on local storage and be able to analyze that data before we need to you know put it into the dashboard or some kind of application that’s supporting it so in the long term I think that the enterprise data warehouse will start to sunset and we’re starting to see that right now but for the immediate term still you’re seeing a lot of people doing enterprise data warehouse uploads so they’re taking some of that 70% of that cold data the transfer in Hadoop environment to save on calls to the sable net licensee but also to marry that with this new data this new instruction data that’s coming in from whether it be from sensors social media or anywhere in the world and they’re marrying that data to see if they can pull any insights from it then once they have insights depending on the workload sometimes they’re pushing it back up to the enterprise data warehouse and sometimes they’re using some of the newer projects to actually use their new environment and they’re you know big data architecture to support those production you know type enterprise data warehouse applications so so that’s all we have for today if you have any questions make sure you submit up to big data big questions you can do that in the comments below or you can do it on my website thomashenson.com thanks and I’ll see you again next time.

DataWorks Summit 2017 Recap

June 19, 2017 by Thomas Henson Leave a Comment

All Things Data

Just coming off an amazing week with a ton of information in the Hadoop Ecosystem. It’s been a 2 years since I’ve been to this conference. Somethings have changed like the name from Hadoop Summit to DataWorks Summit. Other things stayed the same like breaking news and extremely great content.

I’ll try to sum up my thoughts from the sessions I attended and people I talked with.

First there was an insanely great session called The Future Architecture of Streaming Analytics put on by a very handsome Hadoop Guru, Thomas Henson. It was a well received session where I talked about how to architect streaming application for the next 2-5 years where we will see some 20 billion plus connected devices worldwide.

Hortonwork & IBM Partnership

Next there was breaking news with Hortonworks and IBM partnerships. The huge part of the partnership was that IBM’s BigInsights will merge with Hortonworks Data Platform. Both IBM and Hortonworks are part of the open data platform .

What does this mean to the Big Data community? Well more consolidation of Hadoop distros packages, but more collaboration into the big data frameworks. This is good for the community because it allows us to focus on the open-source frameworks inside the big data community. Now instead of having to work though the difference of BigInsights vs. HDP, development will be poured into Spark, Ambari, HDFS, etc.

Hadoop 3.0 Community Updates

New updates coming the with the next release of Hadoop 3.0 was great! There is a significant amount of changes coming with the release which is slated for GA August 15, 2017. The big focus is going to be with the introduction of Erasure Coding for data striping, supporting containers for YARN, and some minor changes. Look for an in-depth look at Hadoop 3.0 in a follow up post.

Hive LLAP

If you haven’t looked deeply at Hive in the last year or so….you’ve really missed out. Hive is really starting to mature to a EDW on Hadoop!! I’m not sure how many different breakout sessions there were on Hive LLAP but I know it was mentioned in most I attended.

The first Hive breakout session was hosted by Hortonworks Co-founder Alan Gates. He walked through the latest updates and future roadmap for Hive. Also the audience was posed a question: What do we except in a Data Warehouse?

Governance
High Performance
Management & Monitoring
Security
Replication & DR
Storage Capacity
Support for BI

We walked through where the Hive community was in addressing these requirements. Hive LLAP was certainly there on the higher performance. More on that now….

Another breakout session focused on a shoot off for the Hadoop SQLs. Wow this session was full and very interesting. Here is the list of SQL engines tested in the shoot out:

MapReduce
Presto
Spark SQL
Hive LLAP

All the test were run using the Hive Benchmark Testing on the same hardware. Hive LLAP was the clear winner with MapReduce the huge loser (no surprise here). The Spark SQL performed really well but there were issues using the thrift server which might have skewed the results. Kerberos was not implemented on the testing as well.

Pig Latin Updates

Of course there were sessions on Pig Latin! Yahoo presented their results on converting all Pig jobs from MapReduce to Tez jobs. After seeing the keynote about Yahoo’s conversation rate from MapReduce jobs to Tez/Spark/etc jobs shows that Yahoo is still running a ton of Pig jobs. Moving to Tez has increased the speed and efficiency of the Pig jobs at Yahoo. Also in the next few months Pig on Spark should be released.

Closing Thoughts

After missing last year at the Hadoop Summit or DataWorks Summit it was fun to be back. DataWorks Summit is still the premier events for Hadoop developer/admins to come and learn new features developed by the community. For sure this year the theme seemed to be benchmark testing, mix between Streaming Analytics, and Big Data EDW. It’s definitely an event I will try to make again next year to keep up with the Hadoop community.

Big Data Big Questions: Do I need to know Java to become a Big Data Developer?

May 31, 2017 by Thomas Henson 1 Comment

Today there are so many applications and frameworks in the Hadoop ecosystem, most of which are written in Java. So does this mean anyone wanting to become a Hadoop developer or Big Data Developer must learn Java? Should you go through hours and weeks of training to learn Java to become an awesome Hadoop Ninja or Big Data Developer? Will not knowing Java hinder your Big Data career? Watch this video and find out.

Transcript Of The Video

Thomas Henson:

Hi, I’m Thomas Henson with thomashenson.com. Today, we’re starting a new series called “Big Data, Big Questions.” This is a series where I’m going to answer questions, all from the community, all about big data. So, feel free to submit your questions, and at the end of this episode, I’ll show you how. So, today, the first question I have is a very common question. A lot of people ask, “Do you need to know Java in order to be a big data developer?” Find out the answer, right after this.

So, do you need to know Java in order to be a big data developer? The simple answer is no. Maybe that was the case in early Hadoop 1.0, but even then, there were a lot of tools that were being created like Pig, and Hive, and HBase, that are all using different syntax so that you can extrapolate and kind of abstract away Java. Because the key is, if you’re a data analyst or a Hadoop administrator, most of those people aren’t going to have Java skills. So, for the community to really move forward with this big data and Hadoop, we needed to be able to say that it was a tool that not only Java developers were going to be able to use. So, that’s where Pig, and Hive, and a lot of those other tools came. Now, as we start to look into Hadoop 2.0 and Hadoop 3.0, it’s really not the case.

Now, Java is not going to hinder you, right? So, it’s going to be beneficial if you do know it, but I don’t think it’s something that you would want to run out and have to learn just to be able to become a big data developer. Then, the question is, too, when you say big data developer, what are we really talking about? So, are we talking about somebody that’s writing MapReduce jobs or writing Spark jobs? That’s where we look at it as a big data developer. Or, are we talking about maybe a data scientist, where a data scientist is probably using more like R, and Python, and some of those skills, to pull their insights back? Then, of course, your Hadoop administrators, they don’t need to know Java. It’s beneficial if they know Linux and some of the other pieces, but Java’s not really necessary.

Now, I will say, in a lot of this technology… So, if you look at getting out of the Hadoop world but start looking at Spark – Spark has Java, so you can write your Spark jobs in Java, but you can also do it in Python and Scala. So, it’s not a requirement for people to have Java. I would say that there’s a lot of developers out there that are big data developers that don’t have any Java skills, and that’s quite okay. So, don’t let that hinder you. Jump in, join an open-source community project, do something to expand your big data knowledge and become a big data developer.

Well, that’s all we have today. Make sure to submit your questions. So, I’ve got a space on my blog where you can submit the questions or just submit them here, in the comments section, and I’ll answer your big data big questions. See you again!

7 Commands for Copying Data in HDFS

May 15, 2017 by Thomas Henson Leave a Comment

What happens when you need a duplicate file in two different locations?

It’s not a trivial problem you just need to copy that file to the new location. In Hadoop and HDFS you can copy files easily. You just have to understand how you want to copy then pick the correct command. Let’s walk though all the different ways of copying data in HDFS.

HDFS dfs or Hadoop fs?

Many commands in HDFS are prefixed with the hdfs dfs – [command] or the legacy hadoop fs – [command]. Although not all hadoop fs commands and hdfs dfs are interchangeable. To ease the confusion, below I have broken down both the hdfs dfs and hadoop fs copy commands. My preference is to use hdfs dfs prefix vs. the hadoop fs.

Copy Data in HDFS Examples

The example commands assume my HDFS data is located in /user/thenson and local files are in the /tmp directory (not to be confused with the HDFS /tmp directory). The example data will be loan data set from Kaggle. Using the data set or same file structure isn’t necessary it’s just for a frame of reference.

Hadoop fs Commands

Hadoop fs cp – Easiest way to copy data from one source directory to another. Use the hadoop fs -cp [source] [destination].
hadoop fs -cp /user/thenson/loan.csv /loan.csv
Hadoop fs copyFromLocal – Need to copy data from local file system into HDFS? Use the hadoop fs -copyFromLocal [source] [destination].
hadoop fs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv

Hadoop fs copyToLocal – Copying data from HDFS to local file system? Use the hadoop fs -copyToLocal [source] [destination].
>hadoop fs -copyToLocal /user/thenson/loan.csv /tmp/

HDFS dfs Commands

HDFS dfs CP – Easiest way to copy data from one source directory to another. The same as using hadoop fs cp. Use the hdfs dfs cp [source] [destination].
hdfs dfs -cp /user/thenson/loan.csv /loan.csv
HDFS dfs copyFromLocal -Need to copy data from local file system into HDFS? The same as using hadoop fs -copyFromLocal. Use the hdfs dfs -copyFromLocal [source] [destination].
hdfs dfs -copyFromLocal /tmp/loan.csv /user/thenson/loan.csv
HDFS dfs copyToLocal – Copying data from HDFS to local file system? The same as using hadoop fs -copyToLocal. Use the hdfs dfs -copyToLocal [source] [destination].
hdfs dfs -copyToLocal /user/thenson/loan.csv /tmp/loan.csv

Hadoop Cluster to Cluster Copy

Distcp used in Hadoop – Need to copy data from one cluster to another? Use the MapReduce’s distributed copy to move data with a MapReduce job. For the listed command below the original data exist on cluster namenode in the /user/thenson directory and is being transferred to the newNameNode cluster. Make sure to use the full hdfs url in command. Command hadoop -distcp [source] \ [destination].
hadoop -distcp hdfs://namenode:8020/user/thenson \ hdfs://newNameNode:8020/user/thenson

It’s the Scale that Matters..

While copying data is a simple matter in most application, everything in Hadoop is more complicated because of the scale. Make sure when copying data in HDFS to understand the use case and scale, then choose one of the commands above.

Interested in learning more HDFS commands? Checkout out my Top HDFS Commands post.

Everything You’ve Wanted to Know About HDFS Federation

March 6, 2017 by Thomas Henson Leave a Comment

2017 might have just started, but I’ve already noticed a trend that I believe will be huge this year. Many of the people I talk with who are using Hadoop & friends are curious about HDFS Federation.

Here are a few of the questions I hear

How can we use HDFS Federation to extend our current Hadoop environment?

Is there anyway to offload some of the workloads from our NameNode to help speed it up?

Or my favorite……

We originally thought we were boxed in with our Hadoop architecture but now with HDFS Federation our cluster has more flexibility.

So what is HDFS Federation? First we need to level set on how the NameNode and Namespace work in HDFS.

How the NameNode Works in Hadoop

The HDFS architecture is a master/slave architecture. The NameNode is the leader with the DataNodes being the followers in HDFS. Before data is ingested or moved into HDFS it must first pass through the NameNode to be indexed. The DataNodes in HDFS are responsible for storing the data blocks, but have no clue about the other DataNode or data blocks. So if the NameNode falls off the end of the earth your in trouble because what good are the data blocks without the indexing.

HDFS not only stores the data, but provides the file system for users/clients to access the data inside HDFS. For example in my Hadoop environment I have Sales and Marketing data I want to logically separate. So I would, setup to different directories and populate sub directories in each depending on the data. Just like you have setup on your own work space environment. Pictures and Documents are in different directories or file folders. The key is that structure is stored as meta data and the NameNode in HDFS retains all that data.

HDFS Namespace

The NameNode is also responsible for the HDFS namespace in the Hadoop environment. The namespace is set at the file level, meaning all files are hierarchical and follow a tree structure. NameSpace gives the structure users need to traverse the file system. Imagine an organized toolbox with all the tools laid out in a structured way. Once the tools are used they are put back in the same place.

Back to our Windows example the “C” drive is the top level file and everything else on the computer resides under it. Try to create another “Program Files” directory and you will get an error stating that file name already exists. However, if you drop down one level into another file and create a “Program Files” because it would be C:/Program Files/Program Files.

HDFS Federation Namespace — Windows NameSpace Example

As data is accelerated into HDFS, the NameNode begins to grow out of it’s compute and storage. Just like a hermit crab moving into a new shell, so is the same for the NameNode (vicious and expensive cycle). What if we could begin using scale-out architecture without having to re-architect the entire Hadoop environment? Well this is where HDFS Federation helps big time.

Hadoop Federation to the Rescue

A little know change in HDFS 2.x was the addition of HDFS Federation. Oftentimes confused with the ability to create high availability (HA) in Hadoop clusters or secondary NameNodes. However HDFS Federation allows for Hadoop clusters to add another NameNode and namespace. This Federated NameNode is one that has access to the DataNodes and indexes data moved to those nodes, but only when the data flows through that NameNode.

For example, I have two NameNodes in my cluster NN1 and NN2. NN1 will support all data in hdfs/data/…and NN2 will handle the hdfs/users directory. So as data from users/applications comes my hdfs/data namespace NN1 will index it and move it to the DataNodes. However if an application connects to NN1 and tries to query data in the hdfs/user directory it will get an error saying no known directory. For the application to query data in that namespace requires a connection to NN2. Think of HDFS Federation as adding a new cluster, in the form of a NameNode, while still using the DataNodes for storage.

Benefits of HDFS Federation

Here are a few of the immediate benefits I see being played out with HDFS Federation in the Hadoop world.

NameNode Dockerization – The ability to set up multiple NameNodes allows for new Hadoop architectures now allows for a module Hadoop architecture. As we start to move into a Microservices world, we will see architectures that contain multiple NameNodes. Hadoop environment will have the ability to break down and spin up new NameNodes on the fly.
Logically Separate Namespaces – For charge back IT enterprise HDFS Federation gives another tool for Hadoop administrators to setup multiple environments. These environments will still have the cost saving of a single Hadoop environment.
Ease NameNode Bottlenecks – The pain of having all data index through a single NameNode can be eliminated by create multiple NameNodes.
Options for Tiering Performance – Segmenting different NameNodes and namespaces by customer requirements instead of setting up multiple complicated performance quotas is now an option. Simply provision the NameNode specs and move customer to NameNode based on the initial requirements.

One of the big reasons for HDFS Federations uptick this year is based on the growing adoption of Hadoop and the sheer amount of data being analyzed. More data more problems and particularly those problems are at scale.

Final Thoughts on HDFS Federation

HDFS Federation is helping solve problems at scale with the NameNode. Since Hadoop’s 1.x code release the NameNode has always been the soft underbelly of the architecture. The NameNode has continued to struggle with high availability, bottlenecks, and replications. The community is continually working on improving the NameNode. HDFS Federation and the movement of Virtualized/Dockerized Hadoop are moving to mitigate these issues. As the Hadoop community continues to innovate with projects like Kudu and others, look for HDFS Federation to play a bigger role.

Splunking on Hadoop with Hunk (Preview)

December 23, 2016 by Thomas Henson Leave a Comment

Splunking on Hadoop with Hunk

So I’ve seen a lot of people asking what does your Pluralsight: Analyzing Machine Data with Splunk course cover.

Well, for starters it covers a ton about starting out in Splunk. Admins and Developers will quickly setup a Splunk development environment then fast forward to using Splunkbase to expand use cases. However the most popular portion of the course is the deep dive into Hunk.

Hunk is Splunk’s plugin that allows for data to be imported from Hadoop or exported into Hadoop. Both Splunk and Hadoop are huge in analytics (big understatement here) and with Hunk, users can visualize their Hadoop data in Splunk. One of the biggest complaints with Hadoop is the poor visualization tools to support this thriving community. Many admins are already using Splunk so it’s no wonder Splunk is filling that gap.

In my Analyzing Machine Data with Splunk course I dig into using Hunk with the Splunking on Hadoop with Hunk module. This module is close to 40 minutes of Hunk material from setting up Hunk to moving stock data from HDFS to Hunk. I’ve worked with Pluralsight to setup a quick 8 minute preview video on the Splunking on Hadoop with Hunk module checkout it out and be sure to watch on Pluralsight for the full Hunk deep dive.

Splunk on Hadoop with Hunk (Preview)

Never miss an update on Hadoop, Splunk, and Data Analytics.

Ultimate Big Data Podcast List

December 13, 2016 by Thomas Henson 3 Comments

My Ultimate Agile Podcast blog post was such a hit I though it only appropriate to do one for Big Data. Who doesn’t need to data geek out when in the car, plane, train, or treadmill? Listening to podcast is one of the easiest ways to keep or skill up. However find a cultivated list of podcasts on just Big Data is not easy.

The list is intended to be a resource for the Big Data/Hadoop/Data Analytics community. So I will continue to update the list with new Big Data podcast or episodes.

If you a host of a big data related podcast below or new podcast and would like to interview me on your show, reach out by Twitter, comments, or etc..

Big Data Podcast — podcast rustic sign – letterpress wood type over grained cedar plank against red barn wood

Let me know you notice a podcast missing or broken links. Just add a comment or contact me and I will make the changes.

Since I’ve created this list, I’m putting the episodes of the podcast I was in first.

Big Data Podcast List by Category

Hadoop/Spark/MapReduce

Big Data Beard Podcast – Newly released podcast exploring the trends, technology, and talented people making Big Data a big deal. Host are Brett Roberts, Cory Minton, Kyle Prins, Robert Hout, Keith Quebodeaux, and myself. Join us as we talk to about our Big Data journey and with others in the community.

Get Up And Code 093: All About Running With Thomas Henson – In this Podcast episode my friend John Sonmez and I talk about how I ran my first 1/2 Marathon and the release of my Pig Latin Getting Started course. Pig Latin was one of my first languages I learned in the Hadoop ecosystem and I was excited to be able to give back to the community with this course.
My Life for the Code 02: Big Data Niche, Pluralsight, Family, and more with Thomas Henson – Another podcast I appeared on talking more about Pig Latin and where I see big data going on the next 10 years. Shawn and I also jump into to talking about pursuing your passion(spoiler mine is data analytics) while raising a family. We even threw in a couple of my books recommendations and teased my 2nd Pluralsight course HDFS Getting Started.

LinkedIn’s Kafka, Digital Ocean gets deep about cloud and Red Sox data! – LinkedIN’s Kafka processing 1 trillion messages…..

All Things Hadoop – Favorite episode Hadoop and Pig from Alan gates at Yahoo the title alone gives you an indication of how old it is but still awesome listen.

Puppet Podcast Provisioning Hadoop Clusters with Puppet – Learn how to use Puppet to automate your CDH environment with Puppet. Mike Arnold the creator of the Puppet module talks about to deploy CDH on a large scale with Puppet. If you virtualizing Hadoop (and you should be) then you’ll want to take note in the episode on how speed up your deployment process. My prediction is in the next year we will see more automation tools in the Hadoop ecosystem.

Roaring Elephant Podcast – Awesome insight from two guys working out in the field in Europe. They talk through hot topics in Hadoop ecosystem and also give some real world story from the customers they speak with. Great Podcast if you are just starting out in your Hadoop journey.

Episode 49: Thomas Henson on IoT Architectures

TechTarget Talking Data – Quick short digestible episodes all about data Build vs rent, Kafka and Spark Streaming

Data Engineering Podcast – Podcast dedicated to those who are running the Data pipelines in Big Data and Analytics workflows. Host Tobias does an amazing job keeping Data Engineers up to date with data workflows and tools to create those workflows.

Business of Big Data

Hot Aisle with Bill Smarzo – One of my favorite podcast episodes (full disclosure: I work with both the hosts of the Hot Aisle and Bill Schmarzo) on the topic of the business of big data. Bill’s insight into to what Big Data can mean for a business is something a lot of us as developers/admin lack when talking outside of the wall of IT. One of the biggest reasons Hadoop projects fail is because they aren’t tied to a business objective. In this episode learn about how to tie your Hadoop project to a business to generate more revenue for the company, which brings in more money to expand your Hadoop cluster (win-win-win).

Cloud of Data – Wow talk about an all-star cast of interview it looks like a who’s who of Data CEOs . The first episode was with InfoChimp’s CEO, which I actually worked at CSC during the InfoChimp’s acquisition. Those were some really bright data scientist.

Data Analytics/ Machine Learning

Data Skeptic – usually short format on specific topics in data analytics. the podcast is great. It’s about data analytics and not just about big data but confused as the same thing. My favorite episodes are the algorithm explanations b/c as someone who mostly stays on the software side I like to keep up with the use of these algorithms b/c it helps when working with the DS team.

Partially Derivative – Another great podcast on data analytics, my favorite episode was done live from Stitchfix my wife’s favorite product and mine to but for a different reason. Stichfix is a monthly subscription company that matches a customer with their own personal stylist, but behind the dressing room curtain Stichfix is really a data company. Listen in to hear about all the experimentation that take place on a daily basis at Stichtfix. Also hear about how they are using machine learning to pick out clothes you’d like.

Linear Digressions – Another short quick hit on Data analytics Machine learning on Genomics, how polls got Brexit, and Election forecasting.

Data Crunch -Podcast devoted to highlighting how data analytics is changing the world. Released 1 -2 times a month coming in under 30 minutes per episode.

Internet of Things (IoT)

Inquiring Minds Understanding Heart Disease with Big Data – Not a podcast dedicated IT or Big Data but in this episode Greg Marcus talks about analyzing the heart with IOT. Think that smartwatch is just for tracking steps and sending text messages? That smartwatch could help advance the science behind heart disease by giving your doctor access. Really great episode to hear how IOT is offering lower cost research in healthcare and provide more data than traditional studies.

Oh, and if you are looking for a quick tips on Hadoop/Data Analytics/Big Data, subscribe to my YouTube channel which is all about getting started in Big Data. Make sure to bookmark this page to check for frequent updates to the list. As Big Data gets more popular this list is sure to grow.

Analyzing Machine Data with Splunk

November 7, 2016 by Thomas Henson 4 Comments

My newest Pluralsight course has just released Analyzing Machine Data with Splunk. It might appear as a step outside of the Hadoop ecosystem but read on to find out how it actually ties back.

The past 6 months I’ve taken a deep dive into the Splunk. I had a lot of questions when I first started….

Is this just like the ELK stack?

How is all this data stored?

What’s the integration with Hadoop look like? (spoiler alert it’s awesome and named HUNK)

All I can say is I was blown away with how amazing Splunk is at data analytics. It’s no wonder why Splunk is #1 for analyzing machine data in IT Organizations around the world, however, it’s not just for machine data. Splunk started out with analyzing log files, but because of it great dashboard tools and ability to parse different data types, it’s quickly jumped outside of IT Operations.

Analyzing Machine Data with Splunk is broken into 7 different modules

What is Splunk? – First thing we do is dive into what Splunk is. What’s Splunk’s history and who is using Splunk. Lastly in this module we talk about careers in Splunk and what the options are for Splunk Admin/Developers.
Setting Up the Splunk Environment – Once we have the level set on Splunk it’s time to setup our own local Splunk environment. Splunk offers a few options for Splunk environments in this module we discuss each of them. At the end of this module we walk through setting up your own Splunk environment in a Windows environment.
Basic Splunking Techniques – During this module we are ready to dig into using our local Splunk environment to analyze log files. Basic Splunk searches, creating reports and alerts are essentially building blocks taught in this module. The last part of this module walks through using the Search Processing Language (SPL) which is Splunk’s search language.
Splunking in the Enterprise – Next we jump into the Enterprise features in Splunk. Encrypting and compressing data in flight is essential when working in the Enterprise and Splunk has you covered here. Also we work through setting up scaleable Splunk environments because data is only going to grow so let’s go ahead and be ready.
Splunking for DevOps and Security – Security and DevOps are hot topics and careers right now. Splunk plays in both these fields. Security is the top use case for Splunk because it gives Enterprises the ability to have a 360 view of their IT environments. The demo in this module walks though using Splunk to analyze log4j files in DevOps.
Application Development in Splunkbase – In this module we’ll dive into the Splunkbase to learn how to extend the Splunk environment. Splunkbase in simple terms is like App store for iPhones. Need to import a new data source and don’t want to write your own Regular Expression? Check out Splunkbase. Want to develop your own customer Splunk Apps using the SDK? Splunkbase has you covered with that. Learn about all the things you can do with Splunkbase in this module.
Splunking on Hadoop with Hunk – Ahhhh! Now we are talking. Hadoop on Splunk = HUNK. When I started playing with Hunk it was like the first time I heard the Jay-Z / Linkin Park Collision Course Album. Only this was bigger I mean talk about two world colliding! Splunk provides great dashboards and tools to help ingest machine data without having to do the ETL. With Hunk you can import or export that data into HDFS.

Pluralsight Course

After all this hard work and Splunk goodness be sure to checkout Analyzing Machine Data with Splunk. This course will help you learn how to leverage Splunk in your everyday IT Operations. As always let me know any feedback you have or ideas for more courses in Data Analytics.

How I Failed Using the Pig Subtract Function

October 14, 2016 by Thomas Henson Leave a Comment