Data Engineers: Python VS. C#

June 18, 2019 by Thomas Henson 8 Comments

Which Is Better Python Or C#?

Getting into wars over different programming languages is a no no in the world of programing. However, I recently had a question on Big Data Big Questions about which is better for Data Engineers Python or C#. So in the spirit of examining the difference through the lens of Data Engineering I decided to weigh in.

Python has long been used in Data Analytics for building Machine Learning models. C# is an object oriented programing language developed by Microsoft and used widely in all ranges of applications. Both have a ton of community support and a large user base but, which one is better? In this episode Big Data Big Questions I breakdown both Python and C# for Data Engineers. Make sure to watch the video to find out my thoughts on which is better in Data Engineering.

Transcripts – Data Engineer Python VS. C#

Hi folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question, we’re going to do a comparison between Python and C#. It’s a question that I’ve had coming in, and it’s also something that’s a passion of mine, because I used to be a C# developer back in the day. Then, I’ve currently, I guess the last four or five, maybe six years, I’ve learned Python. I thought it would be good to go through some of that, especially if you’re just starting out. Maybe you’re in high school, or maybe you’re in college, or maybe you’re even looking to make a jump into data engineering or machine learning engineering, and you’re like, “Hey, man, there’s C# out here. There’s Python.” What are some of the differences? What should I learn? Find out right after this.

Today’s episode of Big Data Big Questions, I wanted to do some of the differences between Python and C#. First thing, we’ll start off with C#. C#, heavily developed by Microsoft. I think it was released in 2000. It’s an object-oriented programming language. You see it a lot. I used it, for instance, back when I was doing asp.net. There’s a lot of things that you can do, use it for. It relies on the .NET framework. You have to have the .NET framework to be able to go. They are in version 7.0. Primarily used, I used it a lot for web application development, but you can do a lot of different things with it, build out really complex and awesome applications, whether it be a desktop application, whether it be web, mobile, they’ve just got so much of a community that, there’s a lot of different things that you can do with it. Another thing, too, one of the comparisons to it is, it looks just like Java. Another reason I rotated to it, because one of my first languages I learned, I think I learned VB first, but I did a lot of stuff around Java, and actually when I graduated out of college, I thought I was going to be a Java developer for a long time. Really got engrained in that community there. Fast forward to being a web developer, and transitioning to C#, it was a really natural process for me. Like I said, heavy community, heavy packages, and frameworks, and things to be able to use. See it a lot with Microsoft. If you’re doing C#, you’re probably used VisualStudio or I think it’s VS Code. They’ve got a couple different IDEs for development and everything like that. See it a lot there.

Python. If you’ve been following this channel, you’ve probably seen a ton of videos that we’ve done around Python. Python was developed in 1991. It’s in version three. We talked about C# being in version 7. Python’s in version 3. I wouldn’t put a lot into that, because we talked about C# being in 2000 and Python’s been around since 1991. Heavily involved, both of them. It’s object-oriented just like that, just like C#. Also, you see it a lot used in, for sure, data analytics, but there’s a lot of different other frameworks that you can use to do web development. Pretty much, you can do anything you want with Python. You do have to install Python and have that running in your version. Sometimes that can be a little bit clunky, especially maybe in a Windows environment, but it’s something that you can download and start playing with, and have going on your machine. Man, probably in less than five minutes. Maybe I should do a video on that, but you can go ahead, download that, and be up and running, and start running your own code. Huge community support. There’s a ton of things out there for it. Like I said, talked about, I think even in our book review, we talked about some of the books for data engineers. I think there were two to three Python books that I had showed there, too. Heavy use there. Like I said, a lot of involvement from data analytics, whether it be data scientists or machine learning engineers, and just like with Tensorflow, or PyTorch, a lot of the deep learning frameworks that we’ve talked about on this channel have Python APIs.

The question is, you’re a data engineer just starting out, which one should you learn? I’m going to go through three different questions, where we’re going to talk about what you should learn, and which one is better? I hate doing which one is better, because each one is a different tool, and some tools are better at other things, or have more functionality to do certain tasks. Let’s jump right into it.

Which one is easier to learn? Err! I’m having to put myself in there, because I’m biased as far as C# and just having been a part of that community. Like I said, my first language being in VB, which was similar, and then a ton of work in Java. C# on the premise looked a little easier, but the way I’m going to do this criteria is which one do I think is easier to get up and get started from a data engineering perspective or data analytics perspective. I’m going to have to give it to Python. Like I said, can be a little clunky when you’re first installing it, but if you were just able to open up a Linux, build out a Linux machine, you can do, especially if you’re in the red hot, and you can do Yum install Python, and then you can start scripting away on some of the code there. Then, also, too, I’ll give it Python just from the perspective of a lot of things from a data analytics perspective. Number two, I’m a data engineer. I’m a machine learning engineer. Which one should I learn today? Which one would I start off with if I had to choose, only choose between Python and C#? I would probably go with Python, right? Go ahead and learn Python. I would encourage anybody watching this channel, jump into that community. There’s a ton of books out there. We’ve talked about on this channel where you can go, and learn how to do data analytics from that perspective. Python’s going to get the win there. Which one do I enjoy coding in more? Personal preference, man, I think C# will always have that win for me. Like I said, this is a data engineering channel, but like I said. I started off as a web developer. I really like VisualStudio, and I know there’s some plugins you can do with VS Code. You can use that as your IDE for Python and everything like that. There’s something about C# and that language that I really found comfortable and probably will always have a special place in my heart. Like I said, just coming from a Java perspective and everything like that, I’ll give that the win. The overall win, the overall win between the three categories, if you’re a data engineer, a machine learning engineer, you have to start somewhere, I’d say start with Python. Go through some of the tutorials. Got some on this channel. I’ve got some on my blog, but get started there. I hope you enjoyed this. Tell me what you think. Did I miss something on the differences? Would you have chosen C# as something to start off with? Do you like Python better than C#, versus like I said, C# has a special place in my heart, let me know in the comments section below, or if you have any questions, Do you want me to answer on the show? Put it in here, and then make sure you subscribe and ring that bell, so you never miss an episode of Big Data Big Questions.

Tableau For Data Science?

May 15, 2019 by Thomas Henson Leave a Comment

Big Data Big Questions

Tableau is huge for interacting with data and empower users to find insight in their data. So does this mean Tableau is the primary tool for Data Scientist? In this episode of Big Data Big Questions we tackle the question of “Is Tableau used for Data Science”.

What is Tableau

Tableau is a business intelligence software that allows for users to visualize and drill down into data. Data Users leverage Tableau highly for visualization portion of Data Science projects. The sources for data can be from databases, CSVs, or almost any source with structured data. So if Tableau is for analyzing and visualizing data is it a tool specific Data Scientist? Watch the video below to find out Tableau’s role in the world of Data Science.

Transcript – Tableau For Data Science?

Hi folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question comes in from a user, and it’s around data science, and Tableau, and how those go together. But, before we jump into the question, if you have a question that you want to know about data engineering, IT, data science, anything related to IT, or just want to throw a question at me, put it in the comments section here below or reach out to me on Twitter at #BigDataBigQuestions. Or, thomashenson.com/big-questions. Ton of ways to get your questions here, answered right on this show, all you have to do is type away and ask.

Now, let’s jump into today’s question. Today’s question comes in from a YouTube viewer, and it’s about, hey, in data science, do you use Tableau? You can see the question here as it pertains to this, and so this is a question we started up this show doing, around data engineering, but now we’re really jumping towards, hey, what’s going on from a data science and just encompassing all of it? Today’s question, we’re going to talk about where’s Tableau used, right? A lot of people use Tableau. It’s really, really popular. But, is that really a tool that a data scientist is going to use? Should you invest your time as a data engineer or a data scientist aspiring or not aspiring to get into data science? Should you spend time learning about that tool?

My thoughts on Tableau are that it’s really good for giving information out to users that could be not necessarily data scientists. They could be users of it. They could be analysts. They could be somebody who just has a stake in their business. I’ve used it at a lot of different corporations that I’ve worked at, and companies, and companies, and organizations, and really what I see is those tools are more for the end user, for visualization. They may fall more in the data visualization bucket. We’ve talked about the three tiers of work. You have your data scientist, you have your data engineer, and your data visualization specialist, the person who’s making sure that, hey, at the end of the day, it’s great that we have all these algorithms that are showing us and being able to predict whatever we’re trying to look at in our data, but if we can’t sell that and can’t convey that to the people that need the data to make a decision on, then it’s just an experiment, it’s just us having fun doing research.

When it comes to an end product or being able to really sell your point, data visualization, I think that’s the bucket that Tableau fits in more than just traditional data science. Could be wrong. Let me know if I am here in the comments section below, but let me talk a little bit about my use case and where I’ve seen it. Like I said, I’ve used it in a lot of different organizations that I’ve worked with or even contracted with. One of the main use cases, I’ll give you an example. Let’s say that you’re a YouTube viewer. I’m not saying YouTube uses Tableau, this is just an example. I don’t want to give away too much information, insider. If you have a YouTube channel, think about if you want to see the videos that are coming in. You’re a user. You’re a publisher, a creator. You want to know. Here is all the videos that you have. Here’s how long they’re watched. Here’s all the demographics from behind the scenes that you can pull. Maybe the times that they were watched. How long they were watched, so on this video here, if people drop out after 30 seconds, I did something wrong there. Versus, how many people go through the end of it. Same thing, too. What you would do is, you would have all this information and aggregate all this data, and you maybe even pull some insights. Like, hey, what’s your average? We can do some real simple things, or you can do some complex things, too. Tableau is where you’re going to give the end user the access.

At least what I’ve seen a lot. There’s a big need to be able to do that and be able to pull that data. It gives you a way to, I wouldn’t say that a data scientist wouldn’t, per se, use that as their tool. It wouldn’t be their only tool. Maybe that’s the way that they aggregate and look at large amounts of data before they go in and start to pick and choose. I’m sure there’s some modules out there that are incorporating machine learning and deep learning. I will say, if you’re really looking from an AI perspective to jump into, it’s not just going to be about Tableau. I’m not saying that you shouldn’t get up to speed on Tableau, but I wouldn’t say that, hey, I’m a brand-new person graduating high school, graduating college, or somebody that sees it in their career and looking to go into data science, my choice would not be to jump in and learn Tableau. I would start learning a little bit more about Python, and algorithms, and maybe R, or some of the other higher-level languages to talk around machine learning and deep learning, versus saying, “Hey, this is the tool that’s going to kind of take me there.” Now, if you’re a data visualization person, or you want to get into big data from that perspective, there’s a lot of things that you can use Tableau to do. You might add it to your bucket. As far as we talk about on this show, how to accelerate your career or how to break into the big data realm, this is not one of those tools that I’m going to say, hey, this is the only choice you have. Not really going to be the one that’s probably going to make the more sense. It’s not going to be the game changer, like hey, this person’s certified in Tableau or is a Tableau wizard. If you’re applying for a job that’s all around Tableau then, definitely. As far as, I really want to get down into data science, and I really want to get deep in it, Tableau’s one of those things. Definitely probably going to use or come across tools that are similar to that, but it’s not going to be your mainstay, probably, where you’re writing your algorithms and doing your analytics.

That’s all for today. If you have any questions, make sure you put them in the comments section here below, and then make sure you click subscribe to follow this channel, so that you never miss an episode of Big Data Big Questions.

[Music]

Why Data Engineers Should Care About IoT

June 4, 2018 by Thomas Henson Leave a Comment

Why Data Engineers Should Care About IoT

The Internet of Things has been around for a few years but has hit an all time high for buzzword status. Is IoT important for Data Engineers and Machine Learning Engineers to understand? By 2020 Gartner predicts there to be over 21 Billion connected devices world wide. The data from these devices will be included in current and emerging big data work flows. Data Engineers & Machine Learning Engineers will need to understand how to quickly process this data data merge with existing data sources. Learn why Data Engineers should care about IoT in this episode of Big Data Big Questions.

Transcript

Hi folks, Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I want to tackle IoT for data engineers. I’m going to explain why IoT, or the Internet of Things, matters for data engineers, and how it’s going to affect our careers, how it’s going to affect our day-to-day jobs, and honestly, just the data that we’re going to manage. Find out more, right after this.
[Sound effects]

Today’s question is, what is IoT, and how does that affect the data engineers? We’ve probably seen the buzz word, or the concept of the Internet of Things, but what does that really mean? Is it just these little dash buttons that we have? Is this? Wait a minute. Is that ordering something?

Is this what IoT is, or is it the whole ecosystem and concept around it? First things first. IoT, or the Internet of Things, is the concept of all these connected devices, right? It’s not something that is, I will say, brand new. Something that’s been out there for a while, and when we really think about it, getting down to it, it is a sensor. We have these sensors, these cheap sensors.

We’ve had them for a long time, but what we haven’t had is all these devices connected with an IP address to the Internet, that can send the data. That’s the big part of the concept. It’s not just about the sensor, it’s about being able to move the data from the sensor.

This gives us the ability to be able to manage things in the physical world, bring them back, do some analytics on it, and even push data back out to it. The cool thing is, generally with IoT devices these are, I would say, economical or cheap devices that have an IP address, that can just pull in information. Think about a sensor, if you have a smart watch that’s connected to the Internet and can feed up information to you. That’s where some of it all started. These dash buttons. I can have these dash buttons all installed around my house, push a button whenever I need something, or start to look at what we’re talking about with smart refrigerators. Smart refrigerators can take pictures and have images of what all’s, the content that’s in your refrigerator, so if you’re at the store, you look, and you’re like, “Hey, you know, what am I…? Do I need that ranch dressing? Yeah? Let me check in my refrigerator, here.”

Also, a sensor could be inside the refrigerator, and tell you if something’s going wrong. Maybe the ice maker is blocked. Maybe you need a new water filter in your refrigerator, and the refrigerator knows that, has a sensor into it. It can send information to wherever, to be able to order that water filter for you and send it to your home, so you don’t even have to go in, and remember, “Hey, has it been 90 days? Or was it 60 days? Is it time? Is it time to change it?” Then, you’re going to forget. You’re going to let it go over, but now, you can have this sensor that’s going to tell you, and it’s going to order that for you. That’s the concept. It’s not just about the sensor. It’s about that ecosystem.

It’s about being able to move the data. For data engineers, what does this mean? Why do we care?

There are a lot of predictions out there about IoT and where it’s going. One of the big ones is, Gardner has a prediction that by 2020 we will have 20 billion, over 20 billion, of these devices. Not just the dash buttons, but just think of all these sensors, all these things with IP addresses connected to the Internet. What does that mean, from a data perspective? Some numbers that I’ve seen are 44 zettabytes of data are some of the predictions that I’ve seen, that’s going to be contributed to new data that’s coming in and the data that we have that’s already existing. Think about it. What is a zettabyte? It’s not a petabyte. It’s bigger than a petabyte.

How are we going to manage all these data, when right now we’re still managing terabytes and petabytes of data, and being like, “Man! This is a lot of data!” That’s why it’s important for data engineers, is that’s contributing to this deluge of data. How does all that affect us, as far as what are some of the concepts? When we start talking about IoT, and sensors, and having these data on the edge, being able to pull information back, but also being able to push the information out. What does that start to say?

As we’ve talked more and more about real-time analytics, this is where we’re really going to start to see real-time analytics really taking hold. As soon as we can get that data, and be able to analyze it and push information back out, that’s what’s going to help us. Think about it with automated cars, with a lot of the things that are going on outside in the physical world, where we have sensors, and devices talking to devices, streaming analytics is going to be huge in IoT.

The question becomes, if you’re looking to get involved in IoT, what are some of the projects? What are some of the things you can do to contribute and be a part of this IoT revolution? I would look into some of the messaging queues. Look at Pravega, look at Kafka, even look at RabbitMQ, and some of the other messaging queues, because think about it. As 20 billion devices, maybe more, by 2020. As these devices come in, they have to come into a queue. They have to be stored somewhere before they can be processed and before we can analyze them. I would look into the storage aspect of that.

Also, know how to do the processing. Look at some of your streaming processing, whether it be Apache Beam, whether it be Flink, or whether it be Spark. I would look into those, if you’re looking to get involved in IoT. If you have any questions, make sure you submit those in the comments section here below, or go to thomashenson.com/big-questions. Submit your questions, and I’ll try to answer them on here.

Until next time, see you again.

Better Career: Hadoop Developer or Administrator?

January 16, 2018 by Thomas Henson 1 Comment

The Hadoop Ecosystem is booming and so is the demand for Hadoop Developers/Administrators.

How do you choose between a Developer or Administrator path?

Is there a more demand for Hadoop Developers or Administrators?

Finding the right career path is hard and creates a lot of anxiety about how to specialize in your field. To land your first job or move up in your current role specializing will help. In this video I will help Data Engineers choose a path between Hadoop Developer or Administrator. Watch the video to get a breakdown of the Hadoop Developer and Administrator roles.

Video – Better Career: Hadoop Developer or Administrator

Transcript

Non-Technical Careers in Big Data

January 13, 2018 by Thomas Henson Leave a Comment

Big Data Career Without Coding?

Do all career options in Big Data demand skills with coding or administration? Big Data projects are in high demand right now, but skill sets for these projects come from different backgrounds. If you are wanting to get involved with Big Data, but don’t have a technical background watch the video to learn your options.

Options discussed in Non-Technical Careers in Big Data:

Data Governance
1. How Timely is the data?
2. What is the source of all this data? Garbage in Garbage out
3. Explain one of my first jobs in IT
Project Management
1. Agile Development
2. Scrum Master
3. DevOps
Compliance & Security
1. Huge Data Lakes need securing
2. Huge potential with GDPR General Data Protection Requirements -Plug Alan Gates Interview
3. How many different breaches do we hear about on daily basis

Video – Non-Technical Careers in Big Data

Transcript

Hi, folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Welcome back to the new year. Our first thing that we’re going to tackle today in our first episode of Big Data Big Questions for 2018 is going to be non-technical jobs or career options inside of big data.

It’s submitted in from one of our YouTube users. You can find out more right after this.

Today’s question comes in from YouTube. Remember, if you have any questions around big data or anything that you want to ask and you want me to answer, you can submit those in our YouTube comments below on any of the videos, or you can go to my website at thomashenson.com/bigdataquestions. You can submit any questions there, and I’ll answer them as best I can on air, and give you my advice on the Hadoop community, or big data, or data engineers, or any questions that you have.

Today’s question comes in from YouTube, and it’s from Shahzad Khan. He says, “I work as a change manager, and I don’t know anything about Java or Hadoop, but I want to learn this technology. Is it all right for me to learn, since I’m not into coding? Also, I’ve never been involved in a development team, please suggest.”

Great question. Thanks for the comments and thanks for watching. Continue to watch. My first thing when I look at this is, we’ve talked about the ability, and I’ve had a couple other videos that you’ve seen where we’ve talked about, that you don’t have to know Java to be involved in Hadoop. If you have any questions around that, you can check into that. Really, I think this question, I want to frame it a little bit different, and think about, just because you want to be involved in big data, and you want to be involved in the community and all the things that are happening, you don’t necessarily have to have a technical role to be involved in that.

There’s three roles that I want to talk about that are non-technical from the aspect of coding and Hadoop administration that you can do to still be involved in data or even big data. I’m going to put them together. These aren’t just specifically for big data. This can be around data analytics.

The first one is around data governance. When we talk about data governance, we talk about, what’s the flow of data? Where did the data originate? Everybody’s probably heard of the adage or the example of garbage in, garbage out. Where’s your data coming from? Can you trust, and can you automate, and trust the data that’s coming in? Data governance is about where that data comes from, but it’s also about, how timely is that data? You’re really involved with the sourcing of the data. You’re also looking at things around… I remember one of my first career options. I remember sitting around, and we have a couple different applications, and the heads of each application were together, and we were all there to talk about the different ways that we name things in our own databases. If you think about it, we were trying to merge everything into an enterprise data warehouse. This is a little more old school, but it still happens in big data, when we have these different data sources.

You might have an instance where data from one data set is named or has a different key than data in a separate data set, but you want to be able to merge those. Data governance is around, you can help find and help be a part of that, where the data’s coming from, so that’s one option. I would look into data governance if you still wanted to be involved in big data but didn’t have the technical skills or didn’t have desire to have the technical skills.

Another one is project management. We always need good project managers. Project managers, they’re the ones, the workhorses that really help bring the developers, bring the data scientists, bring the front-end developers, bring everybody together, and really gets that project going. Makes sure that we’re communicating. If you’re interested in project management, you can do that from a non-technical perspective. One of the things, though. I’ve got some stuff on my website where I went through and did the scrum master training. Think of agile development. Just like you would in traditional application development, big data needs agile developers or agile project managers as well.

Then, also look at the scrum master training, but also look at DevOps, and see where that is, if there’s any DevOps certifications, or anything that you can provide in that background to be able to help and manage these teams. Project management is a second one, and then the big one, the next one, compliance and security. We always need compliance and we always need security, especially now with the maturity of the Hadoop community and how much Hadoop is taking over and being used in the enterprise. There’s always compliance around it. You think of HIPAA, you think of some of the SEC compliance here in America. Then, you can also think of GDPR. GDPR, General Data Protection Requirements compliance, I would look at that regulation.

That’s something that’s really interesting to me, and if I was somebody non-technical, and I was interested in compliance or security, that is one area I would start to look at, because I think there’s going to be a growing need. Anytime there’s any kind of regulation, and this is a political statement in any way, but anytime there’s any kind of regulation or change in regulation, there’s a lot of things that go on behind the scenes as far as interpreting that and making sure that you’re in compliance with your enterprise, or if you’re working for some kind of public institution, you want to make sure you’re doing that. Anytime something like that, if you can become an expert and move to that, that would be huge as well.

For securing the data, too. It’s an ongoing, probably overused joke. How many data breaches have you heard about? There’s one every day. Big data is not, we’re not, immune to that. In fact, we’re larger, a larger target. Think about the three Vs.

Volume. How much data do we have in your Data Lake? Big data has big data, right? You need to be able to secure that. Those are the three areas I would look at for non-technical jobs if you still want to be involved in data. Data governance, project management, and compliance and security. That’s all for today. Thanks for tuning in. Make sure you subscribe, so you never miss an episode. I will see you again on Big Data Big Questions.

Data Engineer vs. Data Scientist

December 27, 2017 by Thomas Henson 2 Comments

What’s the Difference Between the Data Engineer & Data Scientist

The Data Scientist has been one of the top trending careers choices for the past 3-4 years but where is the love of the Data Engineers? In reality I think more people are confused about the roles in Big Data. Both Data Scientist and Data Engineers are used interchangeably but the roles require different skills sets. In this video I will break down the differences between the Data Engineer vs. Data Scientist.

Transcript – Data Engineer vs. Data Scientist

Thomas: Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. And so today, we’re going to be tackling the differences between a data scientist and a data engineer. Fine out more right after this.

Thomas: So, do you want to become a data engineer, or do you want to become a data scientist? So, this is a question…this is something we see a lot about is all about the data scientists, and big data, and then data engineering. But what’s the different between the two roles, and why is it that since 2012 data scientists have been the sexiest career in IT? And there’s been a lot of publication about it. There’s been a lot of information about it. We’re all talking about machine learning, and smart cars, and the battle between Tesla and Apple for machine learning engineers, data scientists, and how they can battle that out. But what about the data engineer, too? And kind of what are the differences? Well, I’ll tell you. Recently, Information Weekly had a survey out there for 2018 and the highest paying salaries in IT. Data engineer came in at number five. The only other roles that were above the data engineer were all C-level positions. So, think of your CIO, your CSO, your CTO. So, data engineers are getting a lot of love, too. But what are the differences between those two roles? So, we’ll break it down first jumping into what a data scientist does.

So, what does a data scientist do in their day to day work? Well, one of the things that they do is they evaluate data. So, that’s kind of a given. But how do they do that? So, they use different algorithms. They look at different results from data. So, say that we’re trying to find out a way for a web application to have more users that are engaged with it. So, how do I create more engaging content for my users and for my customers and give them something to value? Well, a data scientist would be able to take and look at different variables. So, maybe we get in a room, and maybe everyone kind of comes up with some variables and says, “Okay, how can we retain user retention? So, does this piece of content work? We’ve got some testing on these other pieces. Here’s some of our historical data.”

And so the data scientist, what they’ll do is they’ll evaluate all those data points and tell you which ones are going to be the most relevant. And they’ll do that by using algorithms. So, they’ll look, and maybe they’ll use SVD to find out, okay, which variables are going to make the most sense for us to have more engaging content, have a web application that makes users want to stay and engage with it longer. And so that’s kind of where their role is. Now, they’re not going to be the ones that are writing the MapReduce jobs or doing some of the Spark jobs. We really want them just evaluating the data and helping us build different data models that are going to give us the results we’re looking for.

So, if we can just increase our user retention time or increase our engagement of our content, our web application is going to be more popular. So, in our example, that’s what we want. We want our data scientists that are really evaluating…finding correlations between data and also eliminating correlations. So, this variable that we predicted that we thought was going to be very key to engaging for our web application for our users, it really doesn’t make a difference. And so it gives our developers, and our engineers, and our product and marketing team things for them to look at and say, “Hey, these are the variables that we need to focus on, and this is what’s going to make our web application…give us the desired results we’re looking for and increase that user retention time, increase the engagement for our users in our web application. So, that’s our data scientist.

Now, on the flip side, what is our data engineer going to do? So, our data engineer, they’re the ones that are going to say, “We’ve got this data here. We’re moving the data maybe into our Hadoop cluster. So, we’re moving it into our Hadoop cluster or wherever we’re storing it for analytics.” And so they’re the ones that are really moving that data there. They’re also writing those MapReduce jobs or Spark jobs. They’re doing the development portion of big data. So, our data scientists are over here saying this is the data that I need. The data engineer is saying, “Oh, we have the data. What kind of format…? How should we clean the data? How fast do you need the data, too?” So, how much speed is a concern for some of these variables and being able to fetter out some of the details, and being able to give…maybe improve that product a little bit faster to get it to the users.

And so that’s where you’re going to see the data engineer. They’re also going to be the ones that are managing and configuring our Hive and HBase deployments and doing some of the technical debt work that we’ve talked about before with making sure that we have a strategy for backup, making sure we have a strategy for high availability. So, this product that we’ve got here for our web application, we want to make sure that we’re still feeding our data in, and our data models are feeding our data back to our data scientists. But then we’re also pushing out those results from what the data scientists have given us, too. So, you kind of see two distinct roles.

So, our data engineer, they’re going to be involved in the tech. They’re going to be the ones that are really building those systems out. Where our data scientists, they’re involved with the data. They’re involved with the technology as far as how to use the…what tools are going to help them be able to [INAUDIBLE 00:05:20], use different algorithms, and be able to say, “This data point really makes a different where this other data point may not be making as much of a difference, and so it’s going to…” They’re going to be using those tools for that. But basically what they’re doing is they’re involved in the data. And you see the data engineers involved in the technology, and implementing, and kind of using that strategy.

So, I’m not saying one is better than the other one, but I may be a little bit biased because I’m a data engineer and like data engineering. But two different skillsets, two very important skillsets, two amazingly great career choices right now in IT, two of the probably highest paying individual contributor roles in IT right now. So, you can’t go wrong either way. If you’re looking for more tips and more information about being a data engineer, make sure you subscribe to this channel and find out more information about data engineers. We explore how to do different things. If you have any questions like this, make sure you submit them. Big Data, big questions, using the #bigdatabigquestions on Twitter, go to my website, thomashenson.com, Big Questions, submit your questions there. Put it in the comments below here. I’ll answer it on YouTube the best I can. Any questions like that, just get in touch with me. I’ll answer them on here. Make sure you subscribe. Thanks again for tuning in, and I’ll see you next time.

Show Notes

Singular Value Decomposition

Big Data Beard Podcast Episode 13: A LESSON IN DATA MONETIZATION FROM THE DEAN OF BIG DATA

Salary for Highest Paying Tech Jobs

What’s New in Hadoop 3.0?

December 20, 2017 by Thomas Henson 1 Comment

Major Hadoop Release!

Hadoop 3.0 is has dropped! There is a lot of excitement in the Hadoop community for a 3.0 release. Now is the time to find out what’s new in Hadoop 3.0 so you can plan for an upgrade to your existing Hadoop clusters. In this video I will explain the major changes in Hadoop 3.0 that every Data Engineer should know.

Transcript – What’s New in Hadoop 3.0?

Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. In today’s episode, we’re going to talk about some exciting new changes in Hadoop 3.0 and why Hadoop has decided to go with a major release in Hadoop 3.0, and what all is in it. Find out more right after this.

Thomas: So, today I wanted to talk to you about the changes that are coming in Hadoop 3.0. So, it’s already went through the alpha, and now we’re actually in the beta phase, so you can actually go out there and download it and play with it. But what are these changes that are in Hadoop 3.0, and then why did we go with such a major release for Hadoop 3.0? So, what all is in this one? There’s two major ones that we’re going to talk about, but let me talk about some of the other ones that are involved with this change, too. So, the first one is more support for containerization. And so if you go through Hadoop 3.0, when you go to the website, you can look, you can actually go through some of the documentation and see where they’re starting to support some of the docker pieces. And so this is just more evidence for the containerization of the world. We’ve seen it with Kubernetes. There’s a lot of different other pieces that are out there with docker. It’s almost like a buzzword to some extent, but it’s really, really been popularized.

It’s really cool changes, too, when you think about it. Because if we go back to when we were in Hadoop 1.0 and even 2.0, it’s kind of been a third rail to say, “Hey, we’re going to virtualize Hadoop.” And now we’re fast forwarding and switching to some of the containers, and so that’s going to be some really cool changes that are coming. Obviously there’s going to be more and more changes that are going to happen [INAUDIBLE 00:01:37], but this is really laying some of the foundation for that support for docker and some of the other major container players out there in the IT industry.

Another big change that we’re starting to see… One again, this is another… I won’t say it’s a monumental change, but it’s just more evidence for support for the cloud. And so the first one is there’s some expanded support for Azure’s data lakes. So, think the unstructured data there. Maybe some of our HTFS components. And then also some big changes in Amazon’s AWS S3. So, S3, they’re actually going to allow for easier management of your metadata with DynamoDB, which is a huge no sequel database used in a DAWS platform. So, those are two…I would say some of the minor changes. Those changes along probably wouldn’t have pushed it to be a Hadoop 3.0 or a major release.

The two major releases…and these are going to deal with the way that we store data, and it’s also going to deal with the way that we protect our data for disaster recover and when you start thinking of those enterprise features that you need to have. And so the first one is support for more than two namenodes. And so we’ve had support since Hadoop 2.0 where we were able to have a standby namenode. What this gave us in pre-having a standby namenode or even having a secondary namenode is if your Hadoop cluster went down…or if your namenode went down…your Hadoop cluster was all the way down, right?

Because that’s where all your data is stores as far as your metadata, and it knows what data is allocated on each of the namenodes. And so once we were able to have that secondary namenode and that shared journal where if one namenode went down, you can have another one. But when we start thinking about fault tolerance and disaster recovery for enterprises, we probably want to be able to expand that out. And so this is one of the ways that we’re actually going to tackle that in the enterprise is to be able to have those changes.

So, be able to support more than two namenodes. And so if you think about it with just doing some calculations, one of the examples is if you have three namenodes, and you have five shared journals, you can actually take two losses of a namenode. So, you could lose two namenodes, and your Hadoop cluster would still be up and running, still be able to run your MapReduce jobs, or if you’re using Spark or something like that, you still have your access to your Hadoop cluster there. And so that’s a huge change when we start to think about where we’re going with the enterprise and just the enterprise adoption. So, you’re seeing a lot of features and requests that are coming from the enterprise customer saying, “Hey, this is the way that we do DR. We’d like to have more fault tolerance built in.” And you’re starting to see that.

So, that was a huge change. One caveat around that…support for those namenodes, but they’re still in the standby mode. So, they’re not what we would talk about when we talk about HTFS federation. So, it’s not supporting three or four different namenodes in different portions of HTFS. I’ve actually got a blog post that you can check out about HTFS federation and kind of where I see that going and how that’s a little bit different, too. So, that was a big change. And then the huge change…I’ve seen some of the results on this before it even came out to the alpha. I think they did some testing in Japan Yahoo. But it’s about using Erasure coding for storing the data. So, if you think about how we store data in HTFS… If you remember the default three, so three times replication. So, as data comes in your namenode, it’s moved to one of your data nodes, and then two [INAUDIBLE 00:05:04] copies are moved to a different rack on two different data nodes. And so that’s to give you that fault tolerance there. So, if you lose one data node, you’re able to get your data and have your data in a separate rack that still would be able to run your MapReduce jobs or your Spark jobs, or whatever you’re trying to do with your data. Maybe just trying to pull it back.

That’s how we traditionally stored it. If you needed more protection, you just bumped it up. But that’s really inefficient. Sometimes we would talk about that being 200% of your data for one portion of your data block. But really, it’s more than that because most customers will have a DR cluster, and so they have it triple replicated over there. So, when you start to think about, “Okay, in our Hadoop cluster, we have it triple replicated. In our DR Hadoop cluster, we have it triple replicated.” Oh, and the data may exist somewhere else as the source data outside of your Hadoop clusters. That’s seven copies of the data. And how efficient is that for data that’s maybe mostly archive? Or maybe it’s compliance data. You want to keep it in your Hadoop cluster.

Maybe you run [INAUDIBLE 00:06:03] over it once a year. Maybe not. Maybe it’s just something you want to hold on to so if you do want to run a job, you can. So, what Erasure coding is going to do is it’s going to give you the ability to store that at a different rate. So, instead of having to triple replicate it, what Erasure coding basically does is it says, “Okay, if we have data, we’re going to break it into six different data blocks, and then we’re going to store three [INAUDIBLE 00:06:27] versus when we’re doing triple replication think of having 12. And so the ability to break that data down and be able to pull the data back from the [INAUDIBLE 00:06:36] is going to give you that ability to get a better ratio for how you’re going to store that data and what your efficiency rate is, too.

So, instead of 200%, maybe it’s going to be closer to 125 or 150. It’s just going to depend as you scale. Just something to look forward to. But it’s really cool because that gives you the ability to one, store more data – bring in more data, hold on to it, and not think so much about the…okay, this is going to take up three times the data just for how big the file is. And so it gives you the ability to hold on to more data and take more somewhat of a risk and be like, “Hey, I don’t know that we need that data right now, but let’s hold on to it because we know that we can use Erasure coding, and we can store it at a different rate. And then as we start to need it, or if it’s something that we need later on, we can bring that back and take that away.” So, think of Erasure coding as more of an archive for your data in HTFS.

And so those are the major changes in Hadoop 3.0. I just wanted to talk to you guys about that and just kind of get that out there. Feel free to send me any questions. So, if you have any questions for Big Data, Big Questions, feel free to go to my website, put it on Twitter, #bigdatabigquestions, put it in the comments section here below. I’ll answer those questions here for you. And then as always, make sure you subscribe so you never miss an episode. Always talking big data, always talking big questions and maybe some other tidbits in there, too. Until next time. See everyone then. Thanks.

Show Notes

Hadoop 3.0 Alpha Notes

Hadoop Summit Slides on Japan Yahoo Hadoop 3.0 Testing

DynamoDB NoSQL Database on AWS

Kubernetes

Kappa Architecture Examples in Real-Time Processing

October 11, 2017 by Thomas Henson Leave a Comment

“Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

Yes we can build models based on the real-time processing and in fact there are some you use every day….

In today’s episode of Big Data Big Questions, we will explore some real-world Kappa Architecture examples. Watch out this video and find out!

Video

Transcription

Hi, folks. Thomas Henson here with thomashenson.com. And today we’re going to have another episode of Big Data, Big Questions. And so, today’s episode, we’re going to focus on some examples of the Kappa Architecture. And so, stay tuned to find out more.

So, today’s question comes in from a user on YouTube, Yaso1977 . They’ve asked: “Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

And so, I think this user is stemming this question from their defense for either their master’s degree or their Ph.D. So, first off, Yaso1977, congratulations on standing on your defense and creating your research project around this. I’m going to answer this question as best I could and put myself in your situation where if I was starting out and had to come up with a research project to be able to stand for either my Master’s or my Ph.D. What would I do, and what are some of the things I would look at?

And so, I’m going to base most of these around the Kappa Architecture because that is the future, right, of streaming analytics and IoT. And it’s kind of where we’re starting to see the industry trend. And so, some of those examples that we’re going to be looking for are not just going to be out there just yet, right? We still have a lot of applications and a lot of users that are on Lambda. And Kappa is still a little bit more on the cutting edge.

So, there are three main areas that I would look for to find those examples. The first one is going to be in IoT. So your newer IoTs to the Internet of things workflows, you’re going to start to see that. One of the reasons that we’re going to see that is because there’s millions and millions of these devices that are out there.

And so, you can think of any device, you know, whether be it from a manufacturer that has sensors on manufacturing equipment, smart cards, or even smartphones, and just information from multiple millions of users that are all streaming back in and doing some kind of prediction modeling doing some kind of analytics on that data as it comes in.

And so, on those newer workflows, you’re probably going to start to see the Kappa Architecture being implemented in there. So, I would focus first off looking at IoT workflows.

Second, this is the tried and true one that we’ve seen all throughout Big Data since we’ve started implementing Hadoop, but fraud detection, specifically with credit cards and some of the other pieces. So, you know, look at that from a security perspective, and so a lot of security. I mean, we just had the Equifax data breach and so many other ones.

So, I would, for sure, look at some of the fraud detection around, you know, maybe, some of the major credit card companies and see kind of what they’re doing and what they have published around it. Because just like in our IoT example, we’re talking millions and millions, maybe, even billions of users all having, you know, multiple transactions going on at one time. All that data needs to be processed and needs to be logged, and, you know, we’re looking for fraud detection. That needs to be pretty quick, right? Because you need to be able to capture that in the moment that, you know…Whether you’re inserting your chip card or whether you’re swiping your card, you need to know whether that’s about to happen, right?

So, it has to be done pretty quickly. And so, it’s definitely a streaming architecture. My bet is there’s some people out there that are already using that Kappa Architecture.

And then another one is going to be anomaly detection. I’m going to break that into two different ones. So, anomaly detection ones talk about security from the insider threats. So, think of being able to capture, you know, insider threats in your organization that are maybe trying to leak data or trying to give access to people that don’t need to have it. Those are still things that happen in real-time. And, you know, the faster that you can make that decision, the faster that you could predict that somebody is an insider threat, or that they’re doing something malicious on your network, the quicker and the less damage that is going to be done to your environment.

And then, also, anomaly detection from manufacturers. So, we’re talking about a little bit about IoT but also looking at manufacture. So, there’s a great example. And I would say that, you know, for your research, one of the books that you would want to look into is the Introduction to the Apache Flink. There’s an example in there from a manufacturer of Erickson who’ve implemented the Kappa Architecture. And what they have is…I think it’s like 10 to 100 terabytes of data that they’re processing at one time. And they’re looking for anomaly detection in that workflow to see, you know, are there sensors? Are there certain things that are happening that are out of the norm so that maybe they can stop manufacturing defect or predict something that’s going to go wrong within their manufacturing area, and then also, you know, externally, you know, from when the users have their devices and be able to predict those too?

So, those are the three areas that I would check, definitely check out the Introduction to Apache Flink, a lot of talk about the Kappa Architecture. Use that as some of your resources and be able to, you know, pull out some of those examples.

But remember, those three areas that I would really key on and look at are IoT, fraud detection. So, look at some of the credit companies or other fraud detections. And then also, anomaly detection, whether be insider threats or manufacturers.

So, that’s the end of today’s episode for Big Data, Big Questions. I want to thank everyone for watching. And before you leave, make sure that you subscribe. So, you never want to miss an episode. You never want to miss any of my Big Data Tips. So, make sure you subscribe, and I will see you next time. Thank you

Big Data Big Questions: Learning to Become a Data Engineer?

September 22, 2017 by Thomas Henson 2 Comments

Data Scientist for the past few years has been named the sexiest job in IT. However the Data Engineer is a huge part of the Big Data movement. The Data Engineer is one the top paying jobs in IT. On average the Data Engineer can make anywhere from 90K – 150K a year.

Data Engineers are responsible for moving large amounts of data, administering the Hadoop/Streaming/Analytics Cluster, and writing MapReduce/Spark/Flink/Scala/etc. jobs.

With all this excitement for Data Analytics and Data Engineers, how can you get involved in this community?

Ready to learn tips to becoming a Data Engineer? Checkout this video for tips to becoming a Data Engineer.

Transcript

Hi Folks, I’m Thomas Henson, with thomashenson.com, and welcome back to another episode of Big Data, Big Questions. Today’s question is: What are some tips for learning to become a better data engineer? Find out more right after this.

So, today’s episode is all about tips for learning to become a better data engineer. So, if you’re watching this, you’re probably concerned with, one, how can I start out becoming a data engineer? What are some ways that I can learn to become better? Or maybe you’re just looking to answer one specific question. But all those are encompassed in what we call the data engineer.

A data engineer is somebody who’s concerned with moving data in and out of Hadoop ecosystem, being able to give status scientists and data analysts better views into the data. So, we’re involved with the day-to-day interactions of how that data is coming in. Is it in how we’re ingesting that data? How are we creating those applications and tuning those applications so that the data comes in faster? All to support those business analysts, those business decisions, and data scientists in creating better models and having just more data to put their hands on.

And so, a lot of times what we’re always doing is we’re asked to take on a couple terabytes of data here, maybe implement and do all the configuration for your hives. You know, your hive implementation or HBase or anything that’s in that big data ecosystem. Some of the tips that I’ve found for just getting started, so if you’re brand new to this and you don’t know where to start, the first thing I would recommend is, go out and just download the sandboxes.

So, download Cloudera’s sandbox, or download Hortonworks’ sandbox and just start playing with it. Go through some of the tutorials. Stand up on your local machine in a VM environment, and just start playing with moving some of the data around. Find some sample data, so go to data.world. Also, I have a post and a video on where to find some data sets, so take those data sets in, start ingesting those. I have a ton of resources and a ton of material on just some simple examples that you can walk through with Pig, and some around Hive. So, go there and find some of those. But, basically what I’m saying is, just get hands-on. Start creating applications. Start trying to do some simple things like, ingest some data in, put it into Hive, and be able to create a table and pull some of that data out, and just maybe some simple Hive queries. And do the same thing with Pig, and just kind of go around to some of those applications that you’re curious about, and start playing with them.

Another thing is, is once you start playing, and sampling, and testing that data, get involved. By getting involved, just ask some questions, create a blog post, try to find a way that you can contribute back to the community. I mean, that’s what I did when I was first starting out. I started off with a sandbox, and what I did was, I took and made sure that every day for 30 minutes, I was learning something new in the Hadoop ecosystem. And so, that’s another tip for you too, is to take and try to do this 30 minutes a day, every day. Even Saturdays, Sundays. Don’t take a day off. And it’s only 30 minutes. And if it’s something that you’re passionate about, and you like doing, that time is just going to fly by. But over time, that’s just really going to give you more and more time in the Hadoop ecosystem. So, whether you’re doing this for a project at work, whether you’re already in the ecosystem and you’re just trying to improve, that 30 minutes a day is really going to help. And it’s something that I’ve continued to do, and continued to do, now, even though I’ve been in part of the community for three or four years now. It’s how I just continue to learn, so I make sure I’m always kind of pushing.

Big Data Big Questions: Kappa Architecture for Real-Time

August 7, 2017 by Thomas Henson Leave a Comment

Should I Use Kappa Architecture For Real-Time Analytics?

Analytics architectures are challenging to design. If you follow the latest trends in Big Data, you’ll see a lot different architecture patterns to chose from.

Architects have a fear of choosing the wrong pattern. It’s what keeps them up at night.

What architecture should be used for designing a real-time analytics application. Should I use the Kappa Architecture for real-time analytics? Watch this video and find out!

Video

Transcript

Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of Big Data, Big Questions. Today’s question is all about the Kappa architecture and real-time analytics. So, our question today came in from a user, and it’s going to be about how we can tackle the Kappa architecture, and is it a good fit for those real-time analytics, for sensor networks, how it all kind of works together. Find out more, right after this.

So, today’s question came in from Francisco. And it’s Francisco from Chile, and he says, “Best regards from Chile.” So, Francisco, thanks for your question, and thanks for watching. So, his question is, “Hi, I’m building a system for processing sensor network data in near real-time. All this time, I’ve been studying the Lambda architecture in order to achieve this. But now, I’ve ran into the Kappa architecture, and I’m having trouble deciding between which one.” He says, what he wants to do is, he wants to analyze this near real-time data in real-time. So, as the data is coming from the sensors, he wants to obtain knowledge, and then push those out in some kind of a UI. So, some kind of charts and graphs, and he’s saying, do we have any suggestions about why we would choose one of these architectures that we would recommend for him? Well, thanks again, Francisco, for your question. And so, yes, I have some thoughts about how we should set up that network. So, but let’s review, real quick, about what we’ve talked about in previous videos of the Lambda architecture, and what the Kappa architecture is, and then how we’re going to implement those.

So, if you remember, the Lambda architecture – we have two different streams. And so, we have a batch-level stream and we have a real-time. So, as your data comes in, it might come in through something that’s a queueing system, called Kafka, or it’s in a area right there, where we’re just using it to queue all the data as it comes in. And so, for that real-time, you will follow that real-time stream, and so, you might use Spark, or Flink, or some kind of real-time processing engine that’s going to do the analytics, and push that out to some of your dashboards for data that’s just as it’s coming in, right? So, as soon as that data comes in, you want to analyze it as quick as you can – it’s what we call near real-time, right? But, you also have your batch layer. So, for your batch processing, for your storing of the data, right? Because, at some point, your queueing system, whether it’s Kafka or something, it’s going to get very, very large, and some of that data’s going to be old, and you don’t need to have it in a area where you can stream it out and analyze it all the time. So, you want it to be able to tier, or you want to move that data off to, maybe, HTFS, or S3 Object. And so, from there, you can use your distributed search, you can have it in HTFS, use Cassandra, or some other kind of… maybe it’s HBase, or some kind of NoSQL database that’s working on top of Hadoop. And then, you also can run your batch jobs there. So, you can run your MapReduce jobs there, whether it’s traditional MapReduce, or whether it’s Spark’s batch-level processing. But, you have two layers.

And so, that’s one of the challenges with the Lambda architectures – you have these two different layers, right? So, you’re supporting two levels of code, and for a lot of your processing, a lot of your data that’s coming in, maybe you’re just using the real-time there, but maybe the batch processing is used every month. But, you’re still having to support those two different levels of code. And so, that’s why we talk about the Kappa architecture, right? So, the Kappa architecture, it simplifies it. So, as your data comes in, you want to have your data that’s in one queueing system, or one storage device – where your data comes in, you can do your analytics on it, so you can do your real-time processing, and push that data our to your dashboards, your web applications, or however you’re trying to consume that data. Then, also do your distributed search, as well. So, you can… If you’re using ElasticSearch, or some other kind of distributed search, maybe it’s Solr, or some of the other ones, you can be able to analyze that data and have it supporting that real-time search, as well. But, you might use Spark, and Flink, for your real-time analytics, but you also wanted to do your batch, too. So, you’re going to have some batch processing that’s going to be done. But, instead of creating a whole ‘nother tier, you want to be able to do that within that queueing system that you have. And so, whether you’re using Kafka, or whether you’re using Pravega, which is a new open-source product that just was released by Dell, you want to be able to have all that data in one spot, so that when you’re queueing that data, you know that it’s going to be there. But, you can also do your analytics on it. So, you can use it for your distributed search, you can use it for those streaming analytics jobs, but also, whenever you go back to do some of your batch, or some of your transitional processing, you know that it’s in that same location, too. That way, there’s not as much redundancy, right? So, you’re not having to store data in multiple locations, and it’s taking up more room than you really need.

And so, this is what we call the Kappa architecture, and this is why it’s so popular right now is, it simplifies that workstream. And so, when we start deciding between those two, back to Francisco’s question – Francisco, your application – it seems like it has a real need for real-time, right? So, there’s a lot of things that are going on there from the network, and a lot of traffic that’s coming in. And so, this is going to be where we break down a couple of different concepts. And so, we talked about bound and unbound. So, a bound dataset is data that we know how much data is going to come in, right? Or, we wait to do, and do the processing on that data, after it’s already came in. And so, when you think of bound data, think of sales orders, think of interview numbers. And so, we know that’s largely what we would consider transactional data, so we know all the data as it’s coming in, and then we’re running the calculation then. But, what your data is, is unbound. And so, when we talk about unbound data is, you don’t know how much data is coming in, right? And it’s infinitive. It’s not going to end. So, with network traffic, you don’t know how long that’s going to be going on. So, the network traffic’s going to continue to come in, you don’t know… You might get one terabyte at one point, you might get ten terabytes, you might scale up all in one second. And then, as the data comes in, it might come in uneven, right? So, you might have some that’s timestamped a little bit earlier than other data that’s coming in, too. And so, that’s what we call unbound data.

And so, for unbound data, the Kappa architecture works really well. It also works really well for bound data, too. So, when we start to look at that, and looking at your project, my recommendation is to use the Kappa architecture – go ahead and use it because you’re using real-time data. But then, for those batch levels, and I’m sure that you’ll start having some processing and some pieces that you start doing that are batch – you can also consume that in the Kappa architecture, as well. And so, there are some things you can look into, so, you can choose streaming analytics, with Spark streaming, you can look at Flink, Beam – those are some of the applications you can use. But, you can also use distributed search, so you can use Solr, you can use ElasticSearch – all those are going to work well, whether you choose the Kappa architecture, or whether you choose the Lambda architecture. My recommendation is, go with the Kappa architecture.

Well, thanks guys, that’s another episode of Big Data, Big Questions. Make sure you subscribe, so that you never miss an episode. If you have any questions, have your question answered on Big Data, Big Questions – just go to the website, put your comments below, reach out to me on Twitter, however you want. Submit those questions, and have me answer those questions here on Big Data, Big Questions. Thanks again, guys.

What is a Data Lake

July 23, 2017 by Thomas Henson Leave a Comment

Explaining the Data Lake

The Enterprise space is notorious for throwing around jargon. Take Data Lake for example the term Data lake. Does it mean there is a real lake in my data center because that sounds like a horrible idea. Or is a Data Lake just my Hadoop cluster?

Data Lakes or Data Hubs have become mainstream in the past 2 years because the explosion in unstructured data. However, the one person’s Data Lake is another’s data silo. In this video I’ll put a definition and strategy around the term data lake. Watch this video to learn how to build your own Data Lake.

Transcript

Thomas Henson: Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of “Big Data, Big Questions.” So, today’s question we’re going to tackle is, “What is a data lake?” Find out more, right after this.

So, what exactly is a data lake? If you’ve ever been to a conference, or if you’ve ever heard anybody talk about big data, you’ve probably heard them use the term “data lake”. And so, if you haven’t heard them say data lake, they might have said data hub. So, what do we really mean when we talk about a data lake? Is that just what we call our Hadoop cluster? Well, yes and no, so really, what we look for when we talk about a data lake is we want an area that has all of our data that we want to analyze. And so, it can be the raw data that comes in, off of, maybe, some sensors, or it can be our transactional sales quarterly, yearly – historical reporting for our data – that we can have all in one area so that we can analyze and bring that data together.

And so, when we talk about data lake – and we really want to look for where that term data lake comes from – it comes from, actually, a blog post that was published some years ago, and… Apologize for not remembering the name, to give credit, but what they talked about is, they said that when we talk about unstructured data and data that we’re going to analyze in Hadoop, it’s really… If we look at it in the real world, it’s more like a lake. And we call it more like a lake because it’s uneven. You don’t know how much data is going to be in it, it’s not a perfect circle. If you’ve ever gone to a lake, it doesn’t have a specific shape. And in the way that data comes into it… So, you might have some underground streams, you might have some above-ground streams, you might just have some water that runs in from a rain – runoff, there, that’s all coming into it. And you compare that to what we’ve traditionally seen when we talk about our structured data that’s in our enterprise data warehouse is, if you look at that, that’s more like bottled water, right? It’s perfect, we know the exact amount that’s going to go into each bottle. It’s all packaged up, it’s been filtered, we know the source from it, and so that’s really structured. But, in the real world, data exists unstructured. And to get to that point where we can have that bottled water, we can have that structured data all put tight into one package that we can send out, you need to take it from the unstructured to the structured.

And so, this is where the concept of data lake comes in, is we talk about being able to have all this data that’s unstructured, in the form that it already exists. And so, your unstructured data is there, and it’s available for all analysts, so maybe I want to have a science experiment and test out some different models with maybe some historical data and some new data that’s coming in. And my counterpart, maybe in another division, or just from another project, can use the same data. And so, we all have access to that data, to have experiments, so that when the time comes for it to support, maybe, our enterprise dashboards or applications, we can push that data back out in a more structured form. But, until we get to that point, we really want that data to all be in one central location, so that we all have access to it. Because if you’ve ever worked in some organizations, you’ll go in, and they all have different data sets that may be the same. I mean, I’ve sat on different enterprise data boards in different corporations and different projects, just because of the fact that we all have the same data, but we may not call it all the same thing. And so, it really prohibits us being able to share the data.

And so, from a data lake perspective, don’t just think of your data lake as your Hadoop cluster, right? You want it to be multi-protocol, you want it to have different ways for data to come in and be accessed, and you don’t want it to just be another data silo, too. And so, that’s what we look at, and that’s what we mean when we talk about data lake or data hub. That’s our analytics platform, but it’s really where our company, where our corporation data exists, and it gives us the ability to share that data as well.

So, that’s our big data big question for today. Make sure you subscribe so that you never miss an episode, and also, if you have any questions that you want me to answer, go ahead and submit them. Go to thomashenson.com, big data big questions, you can submit them there, you can find me on Twitter, you can submit your questions here on YouTube – however you want, just ask those questions and I’ll do my best to get back and answer those. Thanks again.

Big Data Big Questions: Do I need to know Java to become a Big Data Developer?

May 31, 2017 by Thomas Henson 1 Comment

Today there are so many applications and frameworks in the Hadoop ecosystem, most of which are written in Java. So does this mean anyone wanting to become a Hadoop developer or Big Data Developer must learn Java? Should you go through hours and weeks of training to learn Java to become an awesome Hadoop Ninja or Big Data Developer? Will not knowing Java hinder your Big Data career? Watch this video and find out.

Transcript Of The Video

Thomas Henson:

Hi, I’m Thomas Henson with thomashenson.com. Today, we’re starting a new series called “Big Data, Big Questions.” This is a series where I’m going to answer questions, all from the community, all about big data. So, feel free to submit your questions, and at the end of this episode, I’ll show you how. So, today, the first question I have is a very common question. A lot of people ask, “Do you need to know Java in order to be a big data developer?” Find out the answer, right after this.

So, do you need to know Java in order to be a big data developer? The simple answer is no. Maybe that was the case in early Hadoop 1.0, but even then, there were a lot of tools that were being created like Pig, and Hive, and HBase, that are all using different syntax so that you can extrapolate and kind of abstract away Java. Because the key is, if you’re a data analyst or a Hadoop administrator, most of those people aren’t going to have Java skills. So, for the community to really move forward with this big data and Hadoop, we needed to be able to say that it was a tool that not only Java developers were going to be able to use. So, that’s where Pig, and Hive, and a lot of those other tools came. Now, as we start to look into Hadoop 2.0 and Hadoop 3.0, it’s really not the case.

Now, Java is not going to hinder you, right? So, it’s going to be beneficial if you do know it, but I don’t think it’s something that you would want to run out and have to learn just to be able to become a big data developer. Then, the question is, too, when you say big data developer, what are we really talking about? So, are we talking about somebody that’s writing MapReduce jobs or writing Spark jobs? That’s where we look at it as a big data developer. Or, are we talking about maybe a data scientist, where a data scientist is probably using more like R, and Python, and some of those skills, to pull their insights back? Then, of course, your Hadoop administrators, they don’t need to know Java. It’s beneficial if they know Linux and some of the other pieces, but Java’s not really necessary.

Now, I will say, in a lot of this technology… So, if you look at getting out of the Hadoop world but start looking at Spark – Spark has Java, so you can write your Spark jobs in Java, but you can also do it in Python and Scala. So, it’s not a requirement for people to have Java. I would say that there’s a lot of developers out there that are big data developers that don’t have any Java skills, and that’s quite okay. So, don’t let that hinder you. Jump in, join an open-source community project, do something to expand your big data knowledge and become a big data developer.

Well, that’s all we have today. Make sure to submit your questions. So, I’ve got a space on my blog where you can submit the questions or just submit them here, in the comments section, and I’ll answer your big data big questions. See you again!