Archives for June 2017

Big Data Lambda Architecture Explained

June 29, 2017 by Thomas Henson 2 Comments

What is Lambda Architecture?

Since the Spark, Storm, and other streaming processing engines entered the Hadoop ecosystem the Lambda Architecture has been the defacto architecture for Big Data with a real-time processing requirement. In this episode of Big Data Big Questions I’ll explain what the Lambda Architecture is and how developers and administrators can implement in their Big Data workflow.

Transcript

(forgive any errors text was transcribed by a machine)

Hi folks Thomas Henson here with thomashenson.com and today is another episode of big data big questions and so today’s question is what is the lambda architecture and how does that relate to our big data and Hadoop ecosystem? Find out right after this so when we talk about the lambda architecture and how that’s implemented into do we have to go back and look at Hadoop 1.0 and 2.0 when we really didn’t have a speed layer or Spark for streaming analytics and so back in the traditional days of Hadoop 1.0 and 2.0 we were using MapReduce for most of our processing and so the way that that would work is our data would come in we would pull our data into HDFS once our data was in HDFS we would run some kind of MapReduce job so you know we need to use pig or hive or to write our own custom job or some of the other frameworks that are in the ecosystem so that was all you know mostly transactional right so all our data had to be in HDFS so we had to have a complete view of our data to be able to process it later on we started looking at it and seeing that hey we need to be able to pull data in and do it in when data is not really complete right so unless transactional when we maybe have incomplete parts of the data or the data is continuing to be updated and so that’s where spark and Flink and some of the other streaming analytics and streaming processing engines came in is that we wanted to be able to process that data as it came came in and do it a lot faster too and so we took out the need really to even put it into HDFS for when we first we’re starting to process it because that takes time to write so we wanted to be able to move our data and process it before it even hit you know our HDFS and our disconnect that whole system but we still needed to be able to process that for batch processing right so some analytics some data that we’re going to pull we want to do that in real time right but then there’s other insights like maybe some monthly reports quarterly reports that are just better for transactional right and even when we start to talk about you know how we run a process and hold on to historical data and kind of use as a traditional enterprise data warehouse but in a larger you know more Hadoop platform basis like hi presto and some of the other SQL engines that are working on top of us do and so the need came where we were having these two different you know two different systems and how we were going to process data so we started adapting the lambda architecture so both the land architecture was was as your data come in it would sit and maybe a queue so maybe you can have it sitting in Kafka or just some kind of message queue any data that needed to be pulled out and processed streaming we would take and we will process that and what would call our speed layer so we have our speed layer maybe using smart or flee to pull out some insights and push those right out to our dashboards for our data that was going to exist for battleship for the you know transactional processing and just hold them for historical data we would have our MapReduce layer so we’re all a batch and so if you think about two different prongs so you have your speed layer coming in here pulling out your insights but your data as it sits in the cube goes into HDFS and still there to you know run hide will top up or hold on for historical data or maybe to still run some MapReduce jobs and pull up to a dashboard and so what we would have is you have two pronged approach there with your speed layer being your speed letter being on top and then your bachelor being on the bottom and then so as that dated would come in you still have your data in HDFS but you’re still be able to pull your data from you know your real time processing as the data is coming in and so that’s what we started talking about when we were saying lambda architecture is just a two layer system to be able to do our MapReduce and our best job and then also a speed layer to do our streaming analytics you know whether it be through spark flee or attaching beam and some of the other pieces so it’s a really good process to know it’s and you know it’s something that’s been in the industry for quite a long time so if you’re new to the Hadoop environment definitely want to know and be able to reference it back to but there are some other architecture that we’ll talk about in some future episodes so make sure you subscribe so that you never miss an episode so go right now and subscribe so that the next time that we talk about an architecture that you don’t miss it and I’ll check back with you next time thanks folks

Is Hadoop Killing the EDW?

June 27, 2017 by Thomas Henson Leave a Comment

Is Hadoop Killing the EDW? Fair question since in it’s 11th year Hadoop is known as the innovative kid on the block for analyzing large data sets. If the Hadoop ecosystem can analyze large data sets will it kill the EDW?

The Enterprise Data Warehouse has ruled the data center for the past couple of decades. One of the biggest question big data question I get is what’s up with the EDW. Most database developers and architects want to know what is the future of the EDW.

In this video I will give my views on if Hadoop is killing the EDW!

Transcript

(forgive any errors text was transcribed by a machine)

Hi I’m Thomas Henson with thomashenson.com and today is another episode of big data big questions today’s question this made it a little bit controversial but it is big data killing the enterprise data warehouse let’s find out so is the death of universe data warehouse coming all because of big data the simple answer is in the short-term the medium-term no but it really is hampering the growth of those enterprise traditional data warehouses right and part of the reason is the deluge of all this unstructured data so 80% of all the data in the data center and in the world is all unstructured data and if you think about enterprise data warehouses they’re very structured and they’re very structured because they need to be fast right so they support our applications and I support our dashboards but when it comes to you know trying to analyze that data and trying to get that unstructured data into a structured version it really starts to blow up your storage requirements on your enterprise data warehouse and so part of the reason that the enterprise data warehouse growth is slow is because of 70% of that data that’s in those enterprise data warehouses is all really cold data so really you know only thirty percent of the data in your enterprise data warehouse is what’s used and normally that’s your newest date so that cold data is sitting there on some of your premium fast storage you know taking up that space that has your licensing fees for your enterprise data warehouse and also the premium storage and a premium hardware that is sitting on then couple that with the fact that we talked about 80% of all new data that’s coming in and data is created in the world is all in structured data right so they could clean up data from Facebook any kind of social media platforms but video and you know log files and some of your semi structured data that’s coming you know often for your Fitbit or any kind of those IOT or any of the new emerging technologies and so all this data if you’re trying to pack it into your presentation warehouse just going to explode that license fee and then also that hardware and then you don’t even know if this data has any value to it soon and that’s where big data in Hadoop and spark and that whole ecosystem comes because we can store that data that unstructured on local storage and be able to analyze that data before we need to you know put it into the dashboard or some kind of application that’s supporting it so in the long term I think that the enterprise data warehouse will start to sunset and we’re starting to see that right now but for the immediate term still you’re seeing a lot of people doing enterprise data warehouse uploads so they’re taking some of that 70% of that cold data the transfer in Hadoop environment to save on calls to the sable net licensee but also to marry that with this new data this new instruction data that’s coming in from whether it be from sensors social media or anywhere in the world and they’re marrying that data to see if they can pull any insights from it then once they have insights depending on the workload sometimes they’re pushing it back up to the enterprise data warehouse and sometimes they’re using some of the newer projects to actually use their new environment and they’re you know big data architecture to support those production you know type enterprise data warehouse applications so so that’s all we have for today if you have any questions make sure you submit up to big data big questions you can do that in the comments below or you can do it on my website thomashenson.com thanks and I’ll see you again next time.

DataWorks Summit 2017 Recap

June 19, 2017 by Thomas Henson Leave a Comment

All Things Data

Just coming off an amazing week with a ton of information in the Hadoop Ecosystem. It’s been a 2 years since I’ve been to this conference. Somethings have changed like the name from Hadoop Summit to DataWorks Summit. Other things stayed the same like breaking news and extremely great content.

I’ll try to sum up my thoughts from the sessions I attended and people I talked with.

First there was an insanely great session called The Future Architecture of Streaming Analytics put on by a very handsome Hadoop Guru, Thomas Henson. It was a well received session where I talked about how to architect streaming application for the next 2-5 years where we will see some 20 billion plus connected devices worldwide.

Hortonwork & IBM Partnership

Next there was breaking news with Hortonworks and IBM partnerships. The huge part of the partnership was that IBM’s BigInsights will merge with Hortonworks Data Platform. Both IBM and Hortonworks are part of the open data platform .

What does this mean to the Big Data community? Well more consolidation of Hadoop distros packages, but more collaboration into the big data frameworks. This is good for the community because it allows us to focus on the open-source frameworks inside the big data community. Now instead of having to work though the difference of BigInsights vs. HDP, development will be poured into Spark, Ambari, HDFS, etc.

Hadoop 3.0 Community Updates

New updates coming the with the next release of Hadoop 3.0 was great! There is a significant amount of changes coming with the release which is slated for GA August 15, 2017. The big focus is going to be with the introduction of Erasure Coding for data striping, supporting containers for YARN, and some minor changes. Look for an in-depth look at Hadoop 3.0 in a follow up post.

Hive LLAP

If you haven’t looked deeply at Hive in the last year or so….you’ve really missed out. Hive is really starting to mature to a EDW on Hadoop!! I’m not sure how many different breakout sessions there were on Hive LLAP but I know it was mentioned in most I attended.

The first Hive breakout session was hosted by Hortonworks Co-founder Alan Gates. He walked through the latest updates and future roadmap for Hive. Also the audience was posed a question: What do we except in a Data Warehouse?

Governance
High Performance
Management & Monitoring
Security
Replication & DR
Storage Capacity
Support for BI

We walked through where the Hive community was in addressing these requirements. Hive LLAP was certainly there on the higher performance. More on that now….

Another breakout session focused on a shoot off for the Hadoop SQLs. Wow this session was full and very interesting. Here is the list of SQL engines tested in the shoot out:

MapReduce
Presto
Spark SQL
Hive LLAP

All the test were run using the Hive Benchmark Testing on the same hardware. Hive LLAP was the clear winner with MapReduce the huge loser (no surprise here). The Spark SQL performed really well but there were issues using the thrift server which might have skewed the results. Kerberos was not implemented on the testing as well.

Pig Latin Updates

Of course there were sessions on Pig Latin! Yahoo presented their results on converting all Pig jobs from MapReduce to Tez jobs. After seeing the keynote about Yahoo’s conversation rate from MapReduce jobs to Tez/Spark/etc jobs shows that Yahoo is still running a ton of Pig jobs. Moving to Tez has increased the speed and efficiency of the Pig jobs at Yahoo. Also in the next few months Pig on Spark should be released.

Closing Thoughts

After missing last year at the Hadoop Summit or DataWorks Summit it was fun to be back. DataWorks Summit is still the premier events for Hadoop developer/admins to come and learn new features developed by the community. For sure this year the theme seemed to be benchmark testing, mix between Streaming Analytics, and Big Data EDW. It’s definitely an event I will try to make again next year to keep up with the Hadoop community.

Isilon Quick Tips: Compare Snapshots in OneFS

June 10, 2017 by Thomas Henson Leave a Comment

How to Compare Snapshots in OneFS

At least once every Isilon Administrator will need to compare snapshots in OneFS. It might be a situation where a user has upload files to the wrong directory or you need to roll back to a different version of a directory. Whatever the case OneFS has the ability to compare snapshots from the CLI>

In this episode of Isilon Quick tips I will walk through using the CLI to view and compare snapshots in OneFS. Watch this video and learn how!

Transcript

(forgive any errors it was transcribed by a machine)

Hi and welcome back to another episode of Isilon quick tips! Today we’re going to talk about how to compare some snapshot images all from the CLI find out more right after this.

In this episode what we want to do is we want to look at some snapshots and see how we can compare these snapshots. So you can see here from the Web CLI I have a lot of snapshots but if I wanted to compare them how can I do that? Look do all that from the command line so SSH back into our cluster.

The first thing we’re going to do is we’re going to list out all of our snapshots you can see that all of our snapshots are here so all my snapshots are on this ifs NASA directory and you see that I have an ID here that specifies each one and then also I have a default name here for the snapshot schedule name and so if we wanted to compare a couple of these so what is the difference between our first snapshot so ID two and let’s just say that we wanted to compare it with ID 20 what would be the difference between those two and so there’s a way that we can actually compare that the first thing we want to do is let’s just look and see what information is available if we just view that individual ID number so we can use our easy snapshot snapshots view and then just put in the ID number you can also put in the name but I have a default name that’s very long so it’s just easier for me with managing the smaller data set to just use that ID number so let’s see what information is available here and so it gives us our path and our name it’s also going to tell us how much space is holding up and when the snapshot was created if it’s law or if it’s going to expire but there’s not a lot of information in telling us what’s actually in it right because it’s just a snapshot of a point in time and so how do we compare this so we want to take our snapshot ID number two and let’s compare it to number 20 and see what data has changed and so to do that we’ll be using a change list modification but to do that we’ll have to kick off a job to start it so I’m going to clear out the screen and let’s type in our easy job and so what we’ll do is we’ll do an easy job jobs start and we’re going to create a change list and so that’s changed list we’re going to put in the old snap ID so the old snap ID was two and we’re going to compare it with our newer snap and so the newer snap ID was 20 so we started the job and so if we wanted to go out and list it out let’s go ahead and view our change list so we use easy change list modification and we’ll just use L to list out all our change lists we have a change list here for to underscore 20 and so this is going to be the change list that we just created that’s comparing ID 2 and ID 20 sometimes you’ll get an in progress at the end and that’s just because the job is still processing and so you can’t view it just yet so just come back and check in a few different times but it looks like our jobs complete here so we can view those so to view it we’re just going to use – a instead of L and that ID number so to underscore 20 so easy change list mod – a to underscore 20 so we have a lot of information that’s compared in this change modification between snapshot 2 and snapshot 20 one of the big things is we have two files that were created here that I was looking for so this is NASA I uploaded a facility’s CSV then I also uploaded a report CSV and so you can see some of the timestamps or some of the other information but if you’re looking at this information you’re saying man this it’s kind of hard to look at what’s really the objective here well this is a way that we can look and look at this change modification date from the CLI but for the most part this is really used by some other applications order through the Isilon onefs api to be able to pull that information out so if you’re looking to write some kind of process that’s going to look and compare these changes to move some of the backups then you would use this so the best way to look and see what all these different CLI flags and some of these path names are is to go back and look at the Isilon documentation so if you look at the Isilon documentation you can see what all these flags mean here so that if you’re writing some kind of code or some kind of application that’s using the API to kind of do a backup process or something like that then you can use this information here but if you’re just looking quickly on how you want to see what changes happen between two different snapshots you can definitely just use this and pull out some information like I said the biggest thing for me is I wanted to see the different path names so I wanted to see were there any files that are different in snapshot two versus snapshot twenty and we’re able to see that here be sure to subscribe so that you never miss an episode of Isilon quick tips and see you next time [Music]

DataWorks Summit: Future Architecture of Streaming Analytics

June 5, 2017 by Thomas Henson 1 Comment

Ready to learn about the Future Architecture of Streaming Analytics?

Next week I will be heading to the DataWorks Summit in San Jose (formerly Hadoop Summit). The DataWorks summit is one of the top conferences for the Hadoop Ecosystem. Last year was first DataWorks Summit I’ve missed the past 3 years, but this year I’m back. I’m happy to announce this year I have a breakout sessions.

My session will focus on the Future Architectures of Streaming Analytics. I will cover how these architectures will support the future of Streaming Analytics. In the past few years the Hadoop community has focused on the processing of data from streaming data sources with Storm, Spark, Flink, Beam and other projects. Now as we enter in era of massive streams of data it’s time to focus on how we store and scale theses systems. Gartner predicts that by 2020 we will reach 20.4 billion connected devices. Now more than ever we are going to need systems with auto-scaling and unlimited retention. Projects like Pravega emerging to abstract away the storage layer in massive data analytics architectures. Stop by my session to learn about Pravega and architecture recommendations for Streaming Analytics.

Information on my session

Future Architecture of Streaming Analytics: Capitalizing on the Analytics of Things (AoT)

The proliferation of connected devices and sensors is leading the Digital Transformation. By 2020 there will be over 20 billion connected devices. Data from these devices need to be ingested at extreme speeds in order to be analyzed before the data decays. The life cycle of the data is critical in revealing what insight can be revealed and how quickly they can be acted upon.

In this session we will look at the past, present and future architecture trends of streaming analytics. Next we will look at how to turn all the data from these devices into actionable insights. We will also dive into recommendations for streaming architecture depending on the data streams and time factor of the data. Finally, we will discuss how to manage all the sensor data, understand the life cycle cost of the data, and how to scale capacity and capability easily with a modern infrastructure strategy.

When: Tuesday June 13th 3:00 PM

Where: Ballroom C