Big Data Impact of GDPR

May 7, 2018 by Thomas Henson Leave a Comment

How does GDPR Impact Data Engineers

The General Data Protection Regulation (GDPR) goes into effect in May 2018. Many organizations are scrambling to understand how to implement these regulations. In this video we will be discussing Big Data Impact of GDPR.

Transcript – Big Data Impact of GDPR

Hi, folks! Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today is a very special episode. We’re going to talk a little bit more about regulation than we’ve probably talked about before.

We’re going to tackle the GDPR and what that means for big data, big data analytics, and why data engineers and even data scientists should understand the regulation and know it at least from a high level. Find out more right after this.

[Sound effects]

Welcome back. Today is a special episode. We’re going to talk about the GDPR, which is the general data protection regulation, and we’re going to talk about what that means for a data engineer, and why you should understand that.

Just to have a high-level overview, this is going to be one of those things where understanding this regulation is really going to help you. You’re going to have meetings about it. This is such a big change for our industry. If we think about it from an IT perspective or a big data perspective, think of changes that have happened in other industries.

Think of what happened in the US with the SEC in accounting, around Enron and some of the other financial accounting problems that happened in the early 2000s, and then also think about healthcare. Healthcare regulation is, if you know anything about healthcare, you probably know at least the HIPAA requirement. This GDPR is going to be similar to that. Nowadays, taking place in the EU, but the ramifications are going to happen, I believe, everywhere, because one, data exists everywhere. Most companies are global companies, and the way that we handle and capture that data, whether it be from a user in the EU or a user anywhere else in the world, we’re going to have to have those regulations, and have those systems in place, so that we can comply to that.

Just from a high level, if you’re a data engineer, and we focus on the technology, and the hardware, but from non-technical careers, remember we’ve talked about this before, so some of these non-technical careers. We talk about data governance in other places. If you’re interested in that, get head first, dive into the general data protection regulation. Find out as much as you can, because that’s really going to make yourself, one, valuable in the meetings, but also if you’re looking to do a career change, maybe you’re already doing some kind of compliance or something like that, and you want to get involved in big data, here’s your opportunity. Become an expert at this, because we’re moving fast to have to comply.

Just to talk a little bit about it. It’s the EU agreement on how data is processed and stored. It’s a replacement for the data protection directive 95/46, so this is a more stringent, more all-encompassing. You’re probably like, “Why are we going down this route? Why is a regulation coming out?” If you think about it, a lot of things have been happening over the last few years.

How often do we hear about a data breach? There was a huge one last year, right? Affecting millions and millions of users, people’s credit card, people’s social security numbers. Our data is constantly under attack, and it’s, from a big data perspective, we hold onto data so that we can analyze it and make better products, make more efficient products, make better websites, better clickthrough rates on your ads, there’s so many different things that we do with these data, but also, there’s so much danger in having it.

We have to make sure that we’re protecting it, and then also, we want to make sure from a privacy perspective, and this is where this is really going to hit, is allowing users to opt in or opt out. Knowing what’s being collected and how long they’re going to have it, and then also giving you the ability to say, “You know what? Let’s get rid of that data.” I don’t want you to hold onto it.

Those are some of the things that you’re going to be tackling with it. Also, just as a note, it was approved on April 14th, 2018. Must be complied with by May 25th. We’ve got some time, here, between it, and that’s where I’m really encouraging people, even if you’re watching this video after the date, you’re wanting to get in big data, on the governance side, maybe you have non-technical career options, learn this. I’m serious. Just learn this. This is going to be huge. You’re seeing, if you follow anything from Hortonworks, or Cloudera, or anybody involved in big data or even IT, you’re getting bombarded with information about it, because it’s such a big deal, and then the compliance on this, like I said, it’s industry-shifting, just like HIPAA was, and just like some of the SEC regulations and accounting regulations that came out in the early 2000s. If you’re looking for, I’ve got the official site listed here, so you can see where to go from the EU and see it.

Like I said, you’re going to see a ton of blog posts. There’s a ton of resources out there. Some of the tools, if you’re on the technical side, and you’re wondering, okay, I’ve got to go into a meeting. If somebody’s going to ask me what we’re doing about some of the data governance, and some of the other pieces, where can I focus, or where can I say, “Hey, you know what? Give me a week or two. Let me look at some of the things maybe you weren’t doing, and maybe the way that you’re protecting the data is a little bit different.”

Maybe the way that you’re tracking and holding onto the data, so that you can comply by getting rid of users’ data or opting not to track it, or even using a way to mask it, right? Using a way so that you can mask it, so you’re protecting the identities a little bit better. Maybe those are some of the weak points that you are…

Look into Apache Atlas, Apache Rancher, and Cloudera Navigator. Depending on the flavor of the Hadoop framework you’re using, or Hadoop package you’re using, whether it be Hortonworks or Cloudera, if you’re using one of those two main ones, look into these two tools, these three tools right here. This will give you some kind of framework, so you’re starting to see. So, you walk into the meeting, somebody says, “Hey, we’ve got to look at how we’re complying with GDPR, we want to really focus on data governance. What are we doing?” You’re sitting there saying, “I don’t know how to tackle this,” if you’re not doing it.

Go. Know these tools. Understand them from a high level. If you need to implement them, it’s a whole different story, but you can start getting trained up, start implementing those, too. Hope this was very helpful. This is something that I’m sure we will make some more videos on. We’ll be talking about constantly. I predict that this is kind of, like I said, industry-shifting regulation for IT and especially for big data, for all of us. I’m sure there’s going to be follow-on. I’m sure other countries in other areas, they’re starting to look at regulations. I’m sure here in the US, I’m sure Russia, Japan, I’m sure everywhere, they’re starting to look at some of these regulations. It’s not going to be just for the EU. Even if it was, it’s still affecting us. Everything’s global. If you have any questions, make sure you put them in the comments section here below. I will answer them here on Big Data Big Questions. You can go to my website, thomashenson.com. Look for the Big Questions, send me a comment. Also, make sure that you’re subscribing, so that you never miss an episode, and I will see you next time.

Big Data Skills For Data Scientist

February 11, 2018 by Thomas Henson 1 Comment

What Big Data Skills should data scientist understand to be able to take advantage of the Hadoop Ecosystem? Today’s episode of Big Data Big Questions we tackle the tools and frameworks that Data Scientist should know in order to work in Big Data. Also we’ll break down how much of Spark, Hadoop, Mahout, MADLIB, and other tools Data Scientist need to understand. Lastly I’ll give tips for Data Engineers that want to begin to move toward the Data Scientist role.

Find out about Big Data Skills for Data Science by watching the video below.

Video – Big Data Skills for Data Science

Better Career: Hadoop Developer or Administrator?

January 16, 2018 by Thomas Henson 1 Comment

The Hadoop Ecosystem is booming and so is the demand for Hadoop Developers/Administrators.

How do you choose between a Developer or Administrator path?

Is there a more demand for Hadoop Developers or Administrators?

Finding the right career path is hard and creates a lot of anxiety about how to specialize in your field. To land your first job or move up in your current role specializing will help. In this video I will help Data Engineers choose a path between Hadoop Developer or Administrator. Watch the video to get a breakdown of the Hadoop Developer and Administrator roles.

Video – Better Career: Hadoop Developer or Administrator

Transcript

Big Data Beard Podcast Announcement

November 7, 2017 by Thomas Henson Leave a Comment

How do you keep up with all the news going on in the Big Data community?
Announcing the Big Data Beard Podcast, a Podcast devoted to Big Data news, architecture, and the software powering the big data ecosystem. Watch the video below to learn how I feel about the new podcast.

Transcript – Big Data Beard Podcast Announcement

Hi, folks! Thomas Henson here with thomashenson.com. Today, I’m in a different location. Looks like a construction site, right?

That’s because changes are coming. I’m building an office right now, and at some point in the future, I’m going to have a video maybe showing that office off.

With all these changes coming, I wanted to announce another big change. That’s a new podcast for you to listen to. If you follow me on Twitter, you’ve probably heard about the Big Data Beard Podcast. Look at all these tweets.

That’s a good way to keep in touch, but the Big Data Beard Podcast just released a couple weeks ago. Big Data Bears Podcast? This is going to be epic!

Beard. Check. Talks about big data, check. The Big Data Beard Podcast is a Podcast with a group of engineers I’ve been working with, and what we decided was, we decided we should take our conversations that we have over coffee, or beers, or just at conferences, start recording those, and maybe have some guests along the way. This is a great way for you to be able to find out what’s going on in big data and data analytics, and also a great way to get more information.

I’m all about learning, and all about finding ways to get more involved in the community, and find out what’s going on. This is a great way, in under an hour, once a week, to be able to be involved, have some information, and then even interact with us.

If you’d like to appear on the show, or you have any ideas for the show, post them in the comments here. Send them on Twitter. However, you can get in touch with me, just give us those ideas, and we’ll be sure to field those questions.

Make sure you subscribe and check out the Big Data Beard Podcast. Thanks, folks!

Kappa Architecture Examples in Real-Time Processing

October 11, 2017 by Thomas Henson Leave a Comment

“Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

Yes we can build models based on the real-time processing and in fact there are some you use every day….

In today’s episode of Big Data Big Questions, we will explore some real-world Kappa Architecture examples. Watch out this video and find out!

Video

Transcription

Hi, folks. Thomas Henson here with thomashenson.com. And today we’re going to have another episode of Big Data, Big Questions. And so, today’s episode, we’re going to focus on some examples of the Kappa Architecture. And so, stay tuned to find out more.

So, today’s question comes in from a user on YouTube, Yaso1977 . They’ve asked: “Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

And so, I think this user is stemming this question from their defense for either their master’s degree or their Ph.D. So, first off, Yaso1977, congratulations on standing on your defense and creating your research project around this. I’m going to answer this question as best I could and put myself in your situation where if I was starting out and had to come up with a research project to be able to stand for either my Master’s or my Ph.D. What would I do, and what are some of the things I would look at?

And so, I’m going to base most of these around the Kappa Architecture because that is the future, right, of streaming analytics and IoT. And it’s kind of where we’re starting to see the industry trend. And so, some of those examples that we’re going to be looking for are not just going to be out there just yet, right? We still have a lot of applications and a lot of users that are on Lambda. And Kappa is still a little bit more on the cutting edge.

So, there are three main areas that I would look for to find those examples. The first one is going to be in IoT. So your newer IoTs to the Internet of things workflows, you’re going to start to see that. One of the reasons that we’re going to see that is because there’s millions and millions of these devices that are out there.

And so, you can think of any device, you know, whether be it from a manufacturer that has sensors on manufacturing equipment, smart cards, or even smartphones, and just information from multiple millions of users that are all streaming back in and doing some kind of prediction modeling doing some kind of analytics on that data as it comes in.

And so, on those newer workflows, you’re probably going to start to see the Kappa Architecture being implemented in there. So, I would focus first off looking at IoT workflows.

Second, this is the tried and true one that we’ve seen all throughout Big Data since we’ve started implementing Hadoop, but fraud detection, specifically with credit cards and some of the other pieces. So, you know, look at that from a security perspective, and so a lot of security. I mean, we just had the Equifax data breach and so many other ones.

So, I would, for sure, look at some of the fraud detection around, you know, maybe, some of the major credit card companies and see kind of what they’re doing and what they have published around it. Because just like in our IoT example, we’re talking millions and millions, maybe, even billions of users all having, you know, multiple transactions going on at one time. All that data needs to be processed and needs to be logged, and, you know, we’re looking for fraud detection. That needs to be pretty quick, right? Because you need to be able to capture that in the moment that, you know…Whether you’re inserting your chip card or whether you’re swiping your card, you need to know whether that’s about to happen, right?

So, it has to be done pretty quickly. And so, it’s definitely a streaming architecture. My bet is there’s some people out there that are already using that Kappa Architecture.

And then another one is going to be anomaly detection. I’m going to break that into two different ones. So, anomaly detection ones talk about security from the insider threats. So, think of being able to capture, you know, insider threats in your organization that are maybe trying to leak data or trying to give access to people that don’t need to have it. Those are still things that happen in real-time. And, you know, the faster that you can make that decision, the faster that you could predict that somebody is an insider threat, or that they’re doing something malicious on your network, the quicker and the less damage that is going to be done to your environment.

And then, also, anomaly detection from manufacturers. So, we’re talking about a little bit about IoT but also looking at manufacture. So, there’s a great example. And I would say that, you know, for your research, one of the books that you would want to look into is the Introduction to the Apache Flink. There’s an example in there from a manufacturer of Erickson who’ve implemented the Kappa Architecture. And what they have is…I think it’s like 10 to 100 terabytes of data that they’re processing at one time. And they’re looking for anomaly detection in that workflow to see, you know, are there sensors? Are there certain things that are happening that are out of the norm so that maybe they can stop manufacturing defect or predict something that’s going to go wrong within their manufacturing area, and then also, you know, externally, you know, from when the users have their devices and be able to predict those too?

So, those are the three areas that I would check, definitely check out the Introduction to Apache Flink, a lot of talk about the Kappa Architecture. Use that as some of your resources and be able to, you know, pull out some of those examples.

But remember, those three areas that I would really key on and look at are IoT, fraud detection. So, look at some of the credit companies or other fraud detections. And then also, anomaly detection, whether be insider threats or manufacturers.

So, that’s the end of today’s episode for Big Data, Big Questions. I want to thank everyone for watching. And before you leave, make sure that you subscribe. So, you never want to miss an episode. You never want to miss any of my Big Data Tips. So, make sure you subscribe, and I will see you next time. Thank you

Big Data Big Questions: Kappa Architecture for Real-Time

August 7, 2017 by Thomas Henson Leave a Comment

Should I Use Kappa Architecture For Real-Time Analytics?

Analytics architectures are challenging to design. If you follow the latest trends in Big Data, you’ll see a lot different architecture patterns to chose from.

Architects have a fear of choosing the wrong pattern. It’s what keeps them up at night.

What architecture should be used for designing a real-time analytics application. Should I use the Kappa Architecture for real-time analytics? Watch this video and find out!

Video

Transcript

Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of Big Data, Big Questions. Today’s question is all about the Kappa architecture and real-time analytics. So, our question today came in from a user, and it’s going to be about how we can tackle the Kappa architecture, and is it a good fit for those real-time analytics, for sensor networks, how it all kind of works together. Find out more, right after this.

So, today’s question came in from Francisco. And it’s Francisco from Chile, and he says, “Best regards from Chile.” So, Francisco, thanks for your question, and thanks for watching. So, his question is, “Hi, I’m building a system for processing sensor network data in near real-time. All this time, I’ve been studying the Lambda architecture in order to achieve this. But now, I’ve ran into the Kappa architecture, and I’m having trouble deciding between which one.” He says, what he wants to do is, he wants to analyze this near real-time data in real-time. So, as the data is coming from the sensors, he wants to obtain knowledge, and then push those out in some kind of a UI. So, some kind of charts and graphs, and he’s saying, do we have any suggestions about why we would choose one of these architectures that we would recommend for him? Well, thanks again, Francisco, for your question. And so, yes, I have some thoughts about how we should set up that network. So, but let’s review, real quick, about what we’ve talked about in previous videos of the Lambda architecture, and what the Kappa architecture is, and then how we’re going to implement those.

So, if you remember, the Lambda architecture – we have two different streams. And so, we have a batch-level stream and we have a real-time. So, as your data comes in, it might come in through something that’s a queueing system, called Kafka, or it’s in a area right there, where we’re just using it to queue all the data as it comes in. And so, for that real-time, you will follow that real-time stream, and so, you might use Spark, or Flink, or some kind of real-time processing engine that’s going to do the analytics, and push that out to some of your dashboards for data that’s just as it’s coming in, right? So, as soon as that data comes in, you want to analyze it as quick as you can – it’s what we call near real-time, right? But, you also have your batch layer. So, for your batch processing, for your storing of the data, right? Because, at some point, your queueing system, whether it’s Kafka or something, it’s going to get very, very large, and some of that data’s going to be old, and you don’t need to have it in a area where you can stream it out and analyze it all the time. So, you want it to be able to tier, or you want to move that data off to, maybe, HTFS, or S3 Object. And so, from there, you can use your distributed search, you can have it in HTFS, use Cassandra, or some other kind of… maybe it’s HBase, or some kind of NoSQL database that’s working on top of Hadoop. And then, you also can run your batch jobs there. So, you can run your MapReduce jobs there, whether it’s traditional MapReduce, or whether it’s Spark’s batch-level processing. But, you have two layers.

And so, that’s one of the challenges with the Lambda architectures – you have these two different layers, right? So, you’re supporting two levels of code, and for a lot of your processing, a lot of your data that’s coming in, maybe you’re just using the real-time there, but maybe the batch processing is used every month. But, you’re still having to support those two different levels of code. And so, that’s why we talk about the Kappa architecture, right? So, the Kappa architecture, it simplifies it. So, as your data comes in, you want to have your data that’s in one queueing system, or one storage device – where your data comes in, you can do your analytics on it, so you can do your real-time processing, and push that data our to your dashboards, your web applications, or however you’re trying to consume that data. Then, also do your distributed search, as well. So, you can… If you’re using ElasticSearch, or some other kind of distributed search, maybe it’s Solr, or some of the other ones, you can be able to analyze that data and have it supporting that real-time search, as well. But, you might use Spark, and Flink, for your real-time analytics, but you also wanted to do your batch, too. So, you’re going to have some batch processing that’s going to be done. But, instead of creating a whole ‘nother tier, you want to be able to do that within that queueing system that you have. And so, whether you’re using Kafka, or whether you’re using Pravega, which is a new open-source product that just was released by Dell, you want to be able to have all that data in one spot, so that when you’re queueing that data, you know that it’s going to be there. But, you can also do your analytics on it. So, you can use it for your distributed search, you can use it for those streaming analytics jobs, but also, whenever you go back to do some of your batch, or some of your transitional processing, you know that it’s in that same location, too. That way, there’s not as much redundancy, right? So, you’re not having to store data in multiple locations, and it’s taking up more room than you really need.

And so, this is what we call the Kappa architecture, and this is why it’s so popular right now is, it simplifies that workstream. And so, when we start deciding between those two, back to Francisco’s question – Francisco, your application – it seems like it has a real need for real-time, right? So, there’s a lot of things that are going on there from the network, and a lot of traffic that’s coming in. And so, this is going to be where we break down a couple of different concepts. And so, we talked about bound and unbound. So, a bound dataset is data that we know how much data is going to come in, right? Or, we wait to do, and do the processing on that data, after it’s already came in. And so, when you think of bound data, think of sales orders, think of interview numbers. And so, we know that’s largely what we would consider transactional data, so we know all the data as it’s coming in, and then we’re running the calculation then. But, what your data is, is unbound. And so, when we talk about unbound data is, you don’t know how much data is coming in, right? And it’s infinitive. It’s not going to end. So, with network traffic, you don’t know how long that’s going to be going on. So, the network traffic’s going to continue to come in, you don’t know… You might get one terabyte at one point, you might get ten terabytes, you might scale up all in one second. And then, as the data comes in, it might come in uneven, right? So, you might have some that’s timestamped a little bit earlier than other data that’s coming in, too. And so, that’s what we call unbound data.

And so, for unbound data, the Kappa architecture works really well. It also works really well for bound data, too. So, when we start to look at that, and looking at your project, my recommendation is to use the Kappa architecture – go ahead and use it because you’re using real-time data. But then, for those batch levels, and I’m sure that you’ll start having some processing and some pieces that you start doing that are batch – you can also consume that in the Kappa architecture, as well. And so, there are some things you can look into, so, you can choose streaming analytics, with Spark streaming, you can look at Flink, Beam – those are some of the applications you can use. But, you can also use distributed search, so you can use Solr, you can use ElasticSearch – all those are going to work well, whether you choose the Kappa architecture, or whether you choose the Lambda architecture. My recommendation is, go with the Kappa architecture.

Well, thanks guys, that’s another episode of Big Data, Big Questions. Make sure you subscribe, so that you never miss an episode. If you have any questions, have your question answered on Big Data, Big Questions – just go to the website, put your comments below, reach out to me on Twitter, however you want. Submit those questions, and have me answer those questions here on Big Data, Big Questions. Thanks again, guys.

What is a Data Lake

July 23, 2017 by Thomas Henson Leave a Comment

Explaining the Data Lake

The Enterprise space is notorious for throwing around jargon. Take Data Lake for example the term Data lake. Does it mean there is a real lake in my data center because that sounds like a horrible idea. Or is a Data Lake just my Hadoop cluster?

Data Lakes or Data Hubs have become mainstream in the past 2 years because the explosion in unstructured data. However, the one person’s Data Lake is another’s data silo. In this video I’ll put a definition and strategy around the term data lake. Watch this video to learn how to build your own Data Lake.

Transcript

Thomas Henson: Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of “Big Data, Big Questions.” So, today’s question we’re going to tackle is, “What is a data lake?” Find out more, right after this.

So, what exactly is a data lake? If you’ve ever been to a conference, or if you’ve ever heard anybody talk about big data, you’ve probably heard them use the term “data lake”. And so, if you haven’t heard them say data lake, they might have said data hub. So, what do we really mean when we talk about a data lake? Is that just what we call our Hadoop cluster? Well, yes and no, so really, what we look for when we talk about a data lake is we want an area that has all of our data that we want to analyze. And so, it can be the raw data that comes in, off of, maybe, some sensors, or it can be our transactional sales quarterly, yearly – historical reporting for our data – that we can have all in one area so that we can analyze and bring that data together.

And so, when we talk about data lake – and we really want to look for where that term data lake comes from – it comes from, actually, a blog post that was published some years ago, and… Apologize for not remembering the name, to give credit, but what they talked about is, they said that when we talk about unstructured data and data that we’re going to analyze in Hadoop, it’s really… If we look at it in the real world, it’s more like a lake. And we call it more like a lake because it’s uneven. You don’t know how much data is going to be in it, it’s not a perfect circle. If you’ve ever gone to a lake, it doesn’t have a specific shape. And in the way that data comes into it… So, you might have some underground streams, you might have some above-ground streams, you might just have some water that runs in from a rain – runoff, there, that’s all coming into it. And you compare that to what we’ve traditionally seen when we talk about our structured data that’s in our enterprise data warehouse is, if you look at that, that’s more like bottled water, right? It’s perfect, we know the exact amount that’s going to go into each bottle. It’s all packaged up, it’s been filtered, we know the source from it, and so that’s really structured. But, in the real world, data exists unstructured. And to get to that point where we can have that bottled water, we can have that structured data all put tight into one package that we can send out, you need to take it from the unstructured to the structured.

And so, this is where the concept of data lake comes in, is we talk about being able to have all this data that’s unstructured, in the form that it already exists. And so, your unstructured data is there, and it’s available for all analysts, so maybe I want to have a science experiment and test out some different models with maybe some historical data and some new data that’s coming in. And my counterpart, maybe in another division, or just from another project, can use the same data. And so, we all have access to that data, to have experiments, so that when the time comes for it to support, maybe, our enterprise dashboards or applications, we can push that data back out in a more structured form. But, until we get to that point, we really want that data to all be in one central location, so that we all have access to it. Because if you’ve ever worked in some organizations, you’ll go in, and they all have different data sets that may be the same. I mean, I’ve sat on different enterprise data boards in different corporations and different projects, just because of the fact that we all have the same data, but we may not call it all the same thing. And so, it really prohibits us being able to share the data.

And so, from a data lake perspective, don’t just think of your data lake as your Hadoop cluster, right? You want it to be multi-protocol, you want it to have different ways for data to come in and be accessed, and you don’t want it to just be another data silo, too. And so, that’s what we look at, and that’s what we mean when we talk about data lake or data hub. That’s our analytics platform, but it’s really where our company, where our corporation data exists, and it gives us the ability to share that data as well.

So, that’s our big data big question for today. Make sure you subscribe so that you never miss an episode, and also, if you have any questions that you want me to answer, go ahead and submit them. Go to thomashenson.com, big data big questions, you can submit them there, you can find me on Twitter, you can submit your questions here on YouTube – however you want, just ask those questions and I’ll do my best to get back and answer those. Thanks again.

Is Hadoop Killing the EDW?

June 27, 2017 by Thomas Henson Leave a Comment

Is Hadoop Killing the EDW? Fair question since in it’s 11th year Hadoop is known as the innovative kid on the block for analyzing large data sets. If the Hadoop ecosystem can analyze large data sets will it kill the EDW?

The Enterprise Data Warehouse has ruled the data center for the past couple of decades. One of the biggest question big data question I get is what’s up with the EDW. Most database developers and architects want to know what is the future of the EDW.

In this video I will give my views on if Hadoop is killing the EDW!

Transcript

(forgive any errors text was transcribed by a machine)

Hi I’m Thomas Henson with thomashenson.com and today is another episode of big data big questions today’s question this made it a little bit controversial but it is big data killing the enterprise data warehouse let’s find out so is the death of universe data warehouse coming all because of big data the simple answer is in the short-term the medium-term no but it really is hampering the growth of those enterprise traditional data warehouses right and part of the reason is the deluge of all this unstructured data so 80% of all the data in the data center and in the world is all unstructured data and if you think about enterprise data warehouses they’re very structured and they’re very structured because they need to be fast right so they support our applications and I support our dashboards but when it comes to you know trying to analyze that data and trying to get that unstructured data into a structured version it really starts to blow up your storage requirements on your enterprise data warehouse and so part of the reason that the enterprise data warehouse growth is slow is because of 70% of that data that’s in those enterprise data warehouses is all really cold data so really you know only thirty percent of the data in your enterprise data warehouse is what’s used and normally that’s your newest date so that cold data is sitting there on some of your premium fast storage you know taking up that space that has your licensing fees for your enterprise data warehouse and also the premium storage and a premium hardware that is sitting on then couple that with the fact that we talked about 80% of all new data that’s coming in and data is created in the world is all in structured data right so they could clean up data from Facebook any kind of social media platforms but video and you know log files and some of your semi structured data that’s coming you know often for your Fitbit or any kind of those IOT or any of the new emerging technologies and so all this data if you’re trying to pack it into your presentation warehouse just going to explode that license fee and then also that hardware and then you don’t even know if this data has any value to it soon and that’s where big data in Hadoop and spark and that whole ecosystem comes because we can store that data that unstructured on local storage and be able to analyze that data before we need to you know put it into the dashboard or some kind of application that’s supporting it so in the long term I think that the enterprise data warehouse will start to sunset and we’re starting to see that right now but for the immediate term still you’re seeing a lot of people doing enterprise data warehouse uploads so they’re taking some of that 70% of that cold data the transfer in Hadoop environment to save on calls to the sable net licensee but also to marry that with this new data this new instruction data that’s coming in from whether it be from sensors social media or anywhere in the world and they’re marrying that data to see if they can pull any insights from it then once they have insights depending on the workload sometimes they’re pushing it back up to the enterprise data warehouse and sometimes they’re using some of the newer projects to actually use their new environment and they’re you know big data architecture to support those production you know type enterprise data warehouse applications so so that’s all we have for today if you have any questions make sure you submit up to big data big questions you can do that in the comments below or you can do it on my website thomashenson.com thanks and I’ll see you again next time.

DataWorks Summit 2017 Recap

June 19, 2017 by Thomas Henson Leave a Comment

All Things Data

Just coming off an amazing week with a ton of information in the Hadoop Ecosystem. It’s been a 2 years since I’ve been to this conference. Somethings have changed like the name from Hadoop Summit to DataWorks Summit. Other things stayed the same like breaking news and extremely great content.

I’ll try to sum up my thoughts from the sessions I attended and people I talked with.

First there was an insanely great session called The Future Architecture of Streaming Analytics put on by a very handsome Hadoop Guru, Thomas Henson. It was a well received session where I talked about how to architect streaming application for the next 2-5 years where we will see some 20 billion plus connected devices worldwide.

Hortonwork & IBM Partnership

Next there was breaking news with Hortonworks and IBM partnerships. The huge part of the partnership was that IBM’s BigInsights will merge with Hortonworks Data Platform. Both IBM and Hortonworks are part of the open data platform .

What does this mean to the Big Data community? Well more consolidation of Hadoop distros packages, but more collaboration into the big data frameworks. This is good for the community because it allows us to focus on the open-source frameworks inside the big data community. Now instead of having to work though the difference of BigInsights vs. HDP, development will be poured into Spark, Ambari, HDFS, etc.

Hadoop 3.0 Community Updates

New updates coming the with the next release of Hadoop 3.0 was great! There is a significant amount of changes coming with the release which is slated for GA August 15, 2017. The big focus is going to be with the introduction of Erasure Coding for data striping, supporting containers for YARN, and some minor changes. Look for an in-depth look at Hadoop 3.0 in a follow up post.

Hive LLAP

If you haven’t looked deeply at Hive in the last year or so….you’ve really missed out. Hive is really starting to mature to a EDW on Hadoop!! I’m not sure how many different breakout sessions there were on Hive LLAP but I know it was mentioned in most I attended.

The first Hive breakout session was hosted by Hortonworks Co-founder Alan Gates. He walked through the latest updates and future roadmap for Hive. Also the audience was posed a question: What do we except in a Data Warehouse?

Governance
High Performance
Management & Monitoring
Security
Replication & DR
Storage Capacity
Support for BI

We walked through where the Hive community was in addressing these requirements. Hive LLAP was certainly there on the higher performance. More on that now….

Another breakout session focused on a shoot off for the Hadoop SQLs. Wow this session was full and very interesting. Here is the list of SQL engines tested in the shoot out:

MapReduce
Presto
Spark SQL
Hive LLAP

All the test were run using the Hive Benchmark Testing on the same hardware. Hive LLAP was the clear winner with MapReduce the huge loser (no surprise here). The Spark SQL performed really well but there were issues using the thrift server which might have skewed the results. Kerberos was not implemented on the testing as well.

Pig Latin Updates

Of course there were sessions on Pig Latin! Yahoo presented their results on converting all Pig jobs from MapReduce to Tez jobs. After seeing the keynote about Yahoo’s conversation rate from MapReduce jobs to Tez/Spark/etc jobs shows that Yahoo is still running a ton of Pig jobs. Moving to Tez has increased the speed and efficiency of the Pig jobs at Yahoo. Also in the next few months Pig on Spark should be released.

Closing Thoughts

After missing last year at the Hadoop Summit or DataWorks Summit it was fun to be back. DataWorks Summit is still the premier events for Hadoop developer/admins to come and learn new features developed by the community. For sure this year the theme seemed to be benchmark testing, mix between Streaming Analytics, and Big Data EDW. It’s definitely an event I will try to make again next year to keep up with the Hadoop community.

Big Data Big Questions: Do I need to know Java to become a Big Data Developer?

May 31, 2017 by Thomas Henson 1 Comment

Today there are so many applications and frameworks in the Hadoop ecosystem, most of which are written in Java. So does this mean anyone wanting to become a Hadoop developer or Big Data Developer must learn Java? Should you go through hours and weeks of training to learn Java to become an awesome Hadoop Ninja or Big Data Developer? Will not knowing Java hinder your Big Data career? Watch this video and find out.

Transcript Of The Video

Thomas Henson:

Hi, I’m Thomas Henson with thomashenson.com. Today, we’re starting a new series called “Big Data, Big Questions.” This is a series where I’m going to answer questions, all from the community, all about big data. So, feel free to submit your questions, and at the end of this episode, I’ll show you how. So, today, the first question I have is a very common question. A lot of people ask, “Do you need to know Java in order to be a big data developer?” Find out the answer, right after this.

So, do you need to know Java in order to be a big data developer? The simple answer is no. Maybe that was the case in early Hadoop 1.0, but even then, there were a lot of tools that were being created like Pig, and Hive, and HBase, that are all using different syntax so that you can extrapolate and kind of abstract away Java. Because the key is, if you’re a data analyst or a Hadoop administrator, most of those people aren’t going to have Java skills. So, for the community to really move forward with this big data and Hadoop, we needed to be able to say that it was a tool that not only Java developers were going to be able to use. So, that’s where Pig, and Hive, and a lot of those other tools came. Now, as we start to look into Hadoop 2.0 and Hadoop 3.0, it’s really not the case.

Now, Java is not going to hinder you, right? So, it’s going to be beneficial if you do know it, but I don’t think it’s something that you would want to run out and have to learn just to be able to become a big data developer. Then, the question is, too, when you say big data developer, what are we really talking about? So, are we talking about somebody that’s writing MapReduce jobs or writing Spark jobs? That’s where we look at it as a big data developer. Or, are we talking about maybe a data scientist, where a data scientist is probably using more like R, and Python, and some of those skills, to pull their insights back? Then, of course, your Hadoop administrators, they don’t need to know Java. It’s beneficial if they know Linux and some of the other pieces, but Java’s not really necessary.

Now, I will say, in a lot of this technology… So, if you look at getting out of the Hadoop world but start looking at Spark – Spark has Java, so you can write your Spark jobs in Java, but you can also do it in Python and Scala. So, it’s not a requirement for people to have Java. I would say that there’s a lot of developers out there that are big data developers that don’t have any Java skills, and that’s quite okay. So, don’t let that hinder you. Jump in, join an open-source community project, do something to expand your big data knowledge and become a big data developer.

Well, that’s all we have today. Make sure to submit your questions. So, I’ve got a space on my blog where you can submit the questions or just submit them here, in the comments section, and I’ll answer your big data big questions. See you again!

Ultimate Big Data Battle: Batch Processing vs. Streaming Processing

May 8, 2017 by Thomas Henson 2 Comments

Today developers are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. There are many projects that are helping to accelerate and speed up this innovation. All of these projects rely on batch and streaming processing, but what is the difference between batch and streaming processing? Let’s dive into the debate around batch vs. streaming.

What is Streaming Processing in the Hadoop Ecosystem

Streaming processing typically takes place as the data enters the big data workflow. Think of streaming as processing data that has yet to enter the data lake. While the data is queued it’s being analyzed. As new data enters the data is read and the results are recalculated. A streaming processing job is often times named a real-time application because the ability to process quickly changing data. While streaming processing is very fast it has yet to be truly real-time (maybe some day).

The reason streaming processing is so fast is because it analyzes the data before it hits disk. Reading data from disk incurs more latency than reading from RAM. Of course this all comes at a cost. You are only bound by how much data you can fit in the memory (for now..).

To understand the differences between batch and streaming processing let’s use a real-time traffic application as an example. The traffic application is a community driven driving application that provides real-time traffic data. As drivers report conditions on their commute the data is processed to share data with other commuters. The data is extremely time sensitive since finding out about traffic stop or fender bender an hour later would be worthless information. Streaming processing is used to provide the updates on traffic conditions, estimate time to destination and recommend alternative routes.

What is Batch Processing in the Hadoop Ecosystem

Batch processing and Hadoop are thought of as being the same thing. All data is loaded into HDFS and then MapReduce kicks off a batch job to process that data. If the data changes the job needs to be ran again. Step by Step processing that can be paused or interrupted, but not changed from a data set perspective. For a job in MapReduce typically the data already exist on the disk in HDFS. Since the data already exists on the DataNodes, the data must be read from each disk in the cluster where the data is contained. The processing of shuffle this data and results becomes the constraint in batch processing.

Not a big deal unless batch process takes longer than the value of the data. Using the data lake analogy the batch processing analysis takes place on data in the lake (on disk) not the streams (data feed) entering the lake.

Let’s step back into the traffic application to see how batch is used. What happens when a user wants to find out what her commute time will be for a future trip. In that case real-time data will be less important (the further away from the commute time) and the historic data will the key to setting that model. Predicting the commute could be processed with a batch engine because typically the has already been collected.

Is Streaming Better Than Batch?

Asking if streaming is better than batch is like asking if a which Avenger is better. If the Hulk can tear down buildings does that make him better than Ironman? What about the Avengers in Age of Ultron when they were trying to reflect off the Vibranium core? How would that all strength have helped here? In this case Ironman and Thor were better than the Hulk.

Just like with the Avengers, streaming and batch are better when working together. Streaming processing is extremely suited for cases when time matters. Batch processing shines when all the data has been collect and ready to test models. There is no one is better than the other augement right now with batch vs. streaming.

The Hadoop eco-system is seeing a huge shift into the world of streaming and batch coupled together to provide both processing models. Both workflows types come at a cost. So analyzing data with a streaming workflow that could be analyzed in a batch workflow is putting added cost to the results. Be sure the workflow will match the business objective.

Batch and Streaming Projects

Both workflows are fundamental to analyzing data in the Hadoop eco-system. Here are some of the projects and which workflow camp they fall into:

MapReduce – MapReduce is where it all began. Hadoop 1.0 was all about storing your data in HDFS and using MapReduce to analyze that data once it was loaded into HDFS. The process could take hours or days depending on the amount of data.

Storm – Storm is a real time analysis engine. Where MapReduce process data in batches, storm does analysis in streams or as data is ingested. A project once seen as the defacto streaming analysis engine but has lost some of that momentum with emergence of other streaming projects.

Spark – Processing engine for streaming data at scale. Most popular streaming engine in the Hadoop eco-system with the most active contributors. Does not require data to be in HDFS for analysis.

Flink – Hybrid processing engine that uses streaming and batch processing models. Data is broken down into bound(batch) and unbound (unbound) sets. Stream processing engine that incorporates batch processing

Beam – Another hybrid processing engine breaking processing into streaming and batch processing. Runs with both Spark and Flink. Heavy support from the Google family. A project with a heavy amount of optimism right now in the Hadoop eco-system because of it’s ability to run both batch and streaming processing depending on the workload.

Advice on Batch and Streaming Process

At the end of the day, a solid developer will want to understand both workflows. It’s all going to come down to the use case and how either workflow will help meet the business objective.

Ultimate Big Data Podcast List

December 13, 2016 by Thomas Henson 3 Comments

My Ultimate Agile Podcast blog post was such a hit I though it only appropriate to do one for Big Data. Who doesn’t need to data geek out when in the car, plane, train, or treadmill? Listening to podcast is one of the easiest ways to keep or skill up. However find a cultivated list of podcasts on just Big Data is not easy.

The list is intended to be a resource for the Big Data/Hadoop/Data Analytics community. So I will continue to update the list with new Big Data podcast or episodes.

If you a host of a big data related podcast below or new podcast and would like to interview me on your show, reach out by Twitter, comments, or etc..

Big Data Podcast — podcast rustic sign – letterpress wood type over grained cedar plank against red barn wood

Let me know you notice a podcast missing or broken links. Just add a comment or contact me and I will make the changes.

Since I’ve created this list, I’m putting the episodes of the podcast I was in first.

Big Data Podcast List by Category

Hadoop/Spark/MapReduce

Big Data Beard Podcast – Newly released podcast exploring the trends, technology, and talented people making Big Data a big deal. Host are Brett Roberts, Cory Minton, Kyle Prins, Robert Hout, Keith Quebodeaux, and myself. Join us as we talk to about our Big Data journey and with others in the community.

Get Up And Code 093: All About Running With Thomas Henson – In this Podcast episode my friend John Sonmez and I talk about how I ran my first 1/2 Marathon and the release of my Pig Latin Getting Started course. Pig Latin was one of my first languages I learned in the Hadoop ecosystem and I was excited to be able to give back to the community with this course.
My Life for the Code 02: Big Data Niche, Pluralsight, Family, and more with Thomas Henson – Another podcast I appeared on talking more about Pig Latin and where I see big data going on the next 10 years. Shawn and I also jump into to talking about pursuing your passion(spoiler mine is data analytics) while raising a family. We even threw in a couple of my books recommendations and teased my 2nd Pluralsight course HDFS Getting Started.

LinkedIn’s Kafka, Digital Ocean gets deep about cloud and Red Sox data! – LinkedIN’s Kafka processing 1 trillion messages…..

All Things Hadoop – Favorite episode Hadoop and Pig from Alan gates at Yahoo the title alone gives you an indication of how old it is but still awesome listen.

Puppet Podcast Provisioning Hadoop Clusters with Puppet – Learn how to use Puppet to automate your CDH environment with Puppet. Mike Arnold the creator of the Puppet module talks about to deploy CDH on a large scale with Puppet. If you virtualizing Hadoop (and you should be) then you’ll want to take note in the episode on how speed up your deployment process. My prediction is in the next year we will see more automation tools in the Hadoop ecosystem.

Roaring Elephant Podcast – Awesome insight from two guys working out in the field in Europe. They talk through hot topics in Hadoop ecosystem and also give some real world story from the customers they speak with. Great Podcast if you are just starting out in your Hadoop journey.

Episode 49: Thomas Henson on IoT Architectures

TechTarget Talking Data – Quick short digestible episodes all about data Build vs rent, Kafka and Spark Streaming

Data Engineering Podcast – Podcast dedicated to those who are running the Data pipelines in Big Data and Analytics workflows. Host Tobias does an amazing job keeping Data Engineers up to date with data workflows and tools to create those workflows.

Business of Big Data

Hot Aisle with Bill Smarzo – One of my favorite podcast episodes (full disclosure: I work with both the hosts of the Hot Aisle and Bill Schmarzo) on the topic of the business of big data. Bill’s insight into to what Big Data can mean for a business is something a lot of us as developers/admin lack when talking outside of the wall of IT. One of the biggest reasons Hadoop projects fail is because they aren’t tied to a business objective. In this episode learn about how to tie your Hadoop project to a business to generate more revenue for the company, which brings in more money to expand your Hadoop cluster (win-win-win).

Cloud of Data – Wow talk about an all-star cast of interview it looks like a who’s who of Data CEOs . The first episode was with InfoChimp’s CEO, which I actually worked at CSC during the InfoChimp’s acquisition. Those were some really bright data scientist.

Data Analytics/ Machine Learning

Data Skeptic – usually short format on specific topics in data analytics. the podcast is great. It’s about data analytics and not just about big data but confused as the same thing. My favorite episodes are the algorithm explanations b/c as someone who mostly stays on the software side I like to keep up with the use of these algorithms b/c it helps when working with the DS team.

Partially Derivative – Another great podcast on data analytics, my favorite episode was done live from Stitchfix my wife’s favorite product and mine to but for a different reason. Stichfix is a monthly subscription company that matches a customer with their own personal stylist, but behind the dressing room curtain Stichfix is really a data company. Listen in to hear about all the experimentation that take place on a daily basis at Stichtfix. Also hear about how they are using machine learning to pick out clothes you’d like.

Linear Digressions – Another short quick hit on Data analytics Machine learning on Genomics, how polls got Brexit, and Election forecasting.

Data Crunch -Podcast devoted to highlighting how data analytics is changing the world. Released 1 -2 times a month coming in under 30 minutes per episode.

Internet of Things (IoT)

Inquiring Minds Understanding Heart Disease with Big Data – Not a podcast dedicated IT or Big Data but in this episode Greg Marcus talks about analyzing the heart with IOT. Think that smartwatch is just for tracking steps and sending text messages? That smartwatch could help advance the science behind heart disease by giving your doctor access. Really great episode to hear how IOT is offering lower cost research in healthcare and provide more data than traditional studies.

Oh, and if you are looking for a quick tips on Hadoop/Data Analytics/Big Data, subscribe to my YouTube channel which is all about getting started in Big Data. Make sure to bookmark this page to check for frequent updates to the list. As Big Data gets more popular this list is sure to grow.