Big Data Archives - Thomas Henson

What is a Chief Data Officer?

January 26, 2021 by Thomas Henson 16 Comments

Learning about the Chief Data Officer Role

In the last 10 years a new role has joined the C-suite this role is all about making the most of Data. The Chief Data Officer now sits as a C-level leader within organization whose role is to ensure the company is has the right strategy for Data. The salary average for this role is between $118K – $300K+ depending on what company you land at. Here at Big Data Big Questions we talked at length about the different roles in Data but today we are going to focus on those leadership roles in Big Data.

The Role of the Chief Data Officer with the Dean of Big Data

Make sure to watch the full video interview with Bill Schmarzo on the The Role of the Chief Data Officer.

Transcript from Otter.ai

Folks, Thomas Henson here with another episode of big data, big questions still jumping in that interview section where we’re going through careers today. Another amazing guest, folks may know this person is the Dean of big data, but bill Maher’s. Oh, if you’ve been following my channel for a while I’m done a book review. And if you’ve been following me, and you’re a data work summit, I think it was
2017 he was on stage, I had a breakout session, a lot of good stuff. So we talk about that and bring up some old memories talk about when we actually used to work together, and then talk about his newest book. So we’ll dive into his newest book. And let me just say this, this episode, make sure you tune in, he’s going to talk about what he thinks about data science, versus data engineering and the career output for those there at the end, I didn’t plan that it’s not that I’m trying to make, you know, a teaser. So you have to tune in which you should tune in the whole time. And then second off, if you’re interested in any kind of career, and maybe you’re like, Hey, you know what,
I’m ready to move on to the next stage in my career, or I want to get, you know, in the sea levels, or, you know, I want to be maybe less technical, but more, you know, business driven, listen to this episode, where he talks about, you know, career outlook for the chief data officer, you know, his thoughts around that, that vision, what companies are doing it? Well, how many companies aren’t so, so many different nuggets that you can take from this, make sure that you tune in. And then also my my request to everybody watching is one, put in the comment section here below what you think about these interviews. Second off, send me your ideas for who should be on maybe a tutor, maybe you want to come on in for the interview series, reach out to me, right, put it in the comment section here. And if you haven’t been tuning in, and this is your first time on this channel, you know, all about data analytics, but also about careers in it in tech. Right. So that’s what some of these interviews are, because data is actually touching all those I mean, we’ve had interviews on about marketing. So I will stop talking so we can get into this amazing interview.
All right, so 123 Hi, folks, thanks for joining today. We are excited. Super excited. We have Bill smartvault. On today, Bill, say hi to the big data. Big Questions. audience. Hey, Thomas. And Hi, big question. Big Data. Big Question. audience. Glad to be here, man. So I think the last time I saw you in person, we were on stage back when they used to have these things called conferences that you went to and I think you were you were coming offstage. You did a five and he did a little air guitar. back. I think we had it was hortonworks. What I think was the hortonworks conference data summit, something like that. So yeah, you’ve been? I’ve been doing well, I’ve been keeping busy and in trouble, which is what you’re supposed to do. Right? Right. Yeah, no, it’s good, man. So folks in the audience, if you’ve been following the channel for a long time, I actually did a review on bills. I think this was your second book bill. So the NBA. So Bill’s got a new book out. But bill for the folks who haven’t watched that amazing video, or haven’t heard of you? Why don’t you give us an intro, tell us who you are, what you do, and a little bit your background? Sure. So background wise, probably 40 some years in the data and analytics space. Lots of Forrest Gump moments in my life in the right place, right time. Not because I’m tall or good looking or from Iowa, sometimes you just get lucky be in the right place. So in the late 1980s, I was there at running a project with Procter and Gamble and Walmart that ended up being sort of the very first data warehouse bi project spent 25 years in the big data warehouse space when I was then recruited away from business objects by Yahoo to head up and be their vice president advertiser analytics that was at the time that Yahoo was developing this technology called Hadoop. And so I made the transition from a bi person to a data scientist. I teach at the University of San Francisco, where I’m an executive fellow, I also teach at the National University of Ireland in Galway, an honorary professor,
you know, and currently in between gigs, I just left my last gig where I was the Chief Innovation Officer, which was a most excellent adventure. It opened up all kinds of new domains and experiences, a lot of which I captured in my new book, things are just, you know, things I hadn’t thought about before they just all sudden became realized what was important. And in time as part of that realization was on the AI ml side, what we could do from an from an economics perspective, but also the power of team the power of empowering teams and how these go together.
So Bill, I want to jump right into this because you and I talked a little bit before this and we’ve talked in the past and you know
I think for my audience, we talk a ton about the technical and those are a lot of the questions that I get. But there’s so much power in, in understanding the business side. So first right in, you talk about the chief data officer for my audience, or for anybody what I mean, what does a chief data officer do? Like what is what is the CTO role? Yeah. So I think the chief data officer for most organization has become a minimi CIO. And what I mean by that is, I think the role of chief data officer and most organizations is not very fun, or creative or provocative. And I’m on a mission to actually change the nature of that role. I want that role become the chief data monetization officer, because I believe that organizations need to have a senior executive who single focus is figuring out how do I get value out of my data. So this chief data, so I’m on this crusade trying to get organizations to realize that, that there’s, there’s all kinds of unique economic value associated with data and analytics. And, and the Chief Data monetization officer probably doesn’t and shouldn’t have a technology background, I’m going to argue their background should be economics, because you think about what economics is about. Economics is about the creation of wealth and value, and how you use assets to create value. Well, that’s what that’s what data is all about. And you use analytics to convert raw data into valuable actionable customer product and operational insights that you can use to derive and drive new sources of value. So, so Thomas, my mission has been the chief data officer, you look like a CIO kind of person, don’t want that person, I want a Chief Data monetization officer who lives and breathes, who wakes up every morning and says, My job is to figure out how to get value out of that data, and is charged with with integrating across the entire organization, not just within it, but with sales and marketing and operations and engineering and everybody else to figure out where and how can we use data to derive new sources of value? No, I like that. So in this in, I’m gonna dig into this in a couple different ways. But first are all right, so in that role, so you get to decide this. So who does that person report to now? What well, who does that report to report to now? And where do you think that structure should be in the organization in Bills, Bills, most excellent adventure for your organization. Now, Bill’s most excellent venture, which is great. So I think this Rules report to the C e. o. Ci, correct right away. But I’ll see you tomorrow. So why don’t you just why don’t you make him the CEO themselves? Right? Well, I would do that. But I don’t want to get bogged down with all the crap that goes on with dealing with stockholders. No, this is a role that needs to sit at a level where this person can easily step across the organizational boundaries, and can help organizations to leverage and exploit, reuse, share, refine these data and analytic assets. If they’re buried underneath it, they’ll never get anywhere, because no one takes it seriously, from a business perspective. You stick them in finance or marketing or someplace, then you’ve automatically put them into a box. And this is the problem. Most organizations, we tend to want to put people in boxes. And once you’re in a box, it’s like a friggin cage, or you can’t get out of it. This person needs to have the authority to be able to walk across and show sales, marketing, finance, Product Management, engineering, how this one data source, for example, can power some of their key use cases, and can drive that collaboration across the different business units, so that they all can share and reuse the same data datasets over and over again. So in smartos, Most Excellent Adventure, this person reports high in the organization, and is charged with driving the overall across the organization use and monetization of these very valuable and economic or economically unique data and analytic assets. Let’s just draw on this for a second. I’m gonna go ahead and go this is why this is such an important conversation. This is why I think that this the person who runs this role, needs to have more of an economics background and a technology background. Here’s the reason why data as an economic asset, never wears out. Never depletes. And the same data set can be used across an unlimited number of use cases at a marginal cost equal to zero. Now think about that. Marginal cost equals your I have this asset. I can use it over and over and over and over again. It never wears out. It isn’t date is not the new oil. No date is like the new sun. It never goes away. It’s always provided energy for us. And so first off, from a data perspective, the thing that destroys and hinders the economic value of data, our data silos. If you can’t share data across the organization, I can’t take advantage of that economic multiplier effect. I can use that data over
Over and over again at a marginal cost equal to zero. So that’s number one is that data from an economics perspective, is unlike any other asset we ever seen in our life. And we tend to treat it like like it’s, we use an accounting mindset to try to put it into a box. No, don’t put it in a box, no boxes, we want swirls and let this thing swirl across the organization driving value. Now, here’s part two. So while data has this very unique asset that can be used over and over again, analytics is engineered correctly, will actually get more valuable, the more that they are used, right? Think about an asset you could have. Maybe it’s a car, maybe a Tesla, I love Elon Musk. When he made the statement, he made this provocative statement, which most people still don’t understand what he was meaning but as he says, he said, I believe that when you buy a Tesla, you are buying an asset that appreciates in value, not depreciate. Now, all the accounting people go What the hell does that mean? assets? you depreciate assets, you take them, you write them off? Yeah, real estate, 27 years, 27 and a half years over over time, right? Yeah, yeah. He said, Nope, wrong model, wrong frame, you’ve already lost the game, you’re thinking the wrong way about it, he say no, I can build an asset by the use of AI. And across a million Tesla cars, these cars are continuously learning, every time they turn a corner, every time they go past the path every time they go down the road. They’re continuously learning. And every night, the learning from each one of those million cars gets sucked up to the Tesla cloud in the sky and gets aggregated and then back propagated back to every car. So anything that one car learns about a particular driving situation. Now each of those million cars have learned it as well. That’s, that’s amazing, you can build these autonomous assets that continue to continuously learn and adapt. And many times the learning and adapting occurs with very minimal human intervention. So this is why your chief data monetization officer isn’t necessarily need to have a technology background, but needs to have an economics background and figuring out how how does. How do I leverage it? How do I use that to reinvent my business models? How do I use it to disintermediate customer relationships? How do I use that to totally redesign not just my business model, but the entire industries disrupt the entire industry business model? I love it. So today, right, I love the passion you gave me like five questions that I just wrote down that
you put you put a nickel, I mean, you got me going.
But so today, right. So today, we’re not in Bill’s excellent adventure today. And that’s, that’s why we’re here today to talk, hopefully, you know, to change that culture. So typically, the CDO will report to the CIO. And is that where we get challenges where it comes in it functions so marketing may be prohibitive, because they’re like, I don’t want to I don’t want to deal with the with CIOs, organization or engineering is like, we have our own kind of functionality. So it’s, it’s more of political or, you know, just just organizational structure. Bingo. Yeah, it’s it’s, you know, the the CIO, the IT organization has always been a cost center, not a profit center. And so the mindset around it has never been, how do I leverage that organization to derive new sources of value.
So you when you put the CTO or the Chief Data monetization officer into that spot, mean, they don’t look like a CIO at all, nothing they do looks like what the CIO does, but yet we’ve got this, this minimi CIO stuck in the CDO role, and they, and they think their job is to manage data. You know, your job is not to friggin manage data, your job is a friggin monetize it. So there’s, there’s a total mind shift needs to take place. Now, I’ll tell you right now, Thomas, there’s only a handful of companies out there that get this. Yeah, but but you know, if you look at the stock market, you look at the top five or six highest valuable companies out there. And you look at the amount of goodwill that’s stuck in those companies that’s comes from this monitor, you can very quickly figure out who these companies are, who have cracked the code are going, this is great. This leveraging data and analytics, to drive my business case use cases is like printing money. If no one else figures it out. That’s too bad. I just keep growing and getting more powerful. Yeah, so that was one of my questions is how many companies are doing it? Right? So for those, like you’ve got me thinking here now, so if if, you know, for those companies that are doing it, right and your CEO, and you know, I just my head goes straight to career career path and before I asked him the questions about, you know, what do we think a good CDO is the way that you’re explaining it to me is if if if our and I please, I do not want to be a CEO anybody so don’t get me wrong, but but you
You’re what you’re saying and the way that you’re kind of painting it to me like as you know, as an investor or as a board member, if I were looking at the next next CEO, I would want somebody maybe that came from one of those organizations, that that could be a natural step for a CTO or Chief Data monetization officer, similar to you see it. So it’s for so long, where you see you see CFOs being be moving into the CEO role. Is it fair to make the statement that oh, I could go to CEO? I think you look at I mean, probably the chief, the best Chief Data monetization officer out there is Elon Musk. Yeah. Right. And, you know, places like Google, these, you know, Amazon, masters of this, you know, apple, you know, Microsoft and parts of their business, not all parts of their business. These companies realize that they are in not just in the data business. They’re in the data monetization business, and they’ve cracked the code. I mean, think about for a second about Google. TensorFlow, I was joking earlier about TensorFlow, right? Exactly. The single most important technology that Google has, and the open source is now 99.99% of the financial analysts out there are going, What the heck? Why would you ever open source your most valuable technology? And here’s the reason why, in my humble opinion, why Google did it. It’s because in knowledge based industries, the economies of learning are more powerful than the economies of scale. So by having everybody out there using TensorFlow across a wide variety of use cases, TensorFlow just gets smarter and smarter and smarter. And who is the best at leveraging TensorFlow to drive data monetization? Google, so all their competitors are using their product or just helping Google to print more money? No, I like I love that. And it’s, it’s, I totally agree. And you’ve almost stoled my sales pitch or training pitch that I give to people. The the emphasis I put on it, too, is
you and I know you and I know what they what Google uses, right? They use TensorFlow, they’ve open sourced it. And it’s a it’s a great product, even if you and I were the most proficient TensorFlow people on the planet, and we didn’t work for Google, we don’t have the data stores that Google have. Right, right. And so that, you know, they’re, they’re able to get, they’re able to make their product better, kind of like what you were saying with Tesla make your product better, but you don’t have the data elements to act upon it. And they have the data sources, but they also have a different mindset. They’re their data scientists are not like normal data scientists, they give them a level of training, that I don’t think the average data scientists ever would, would ever appreciate. Understand. I mean, most of their senior data scientists at Google are taught design thinking, you’re gonna think, design thinking that has nothing to do with building neural networks, anything I said, you know, guess, right? It’s true, right? It’s they they understand, in detail, what it is they’re trying to do first, and then figure out what data they need not figure out not to say, oh, here’s our data I have. So what problems can I solve? No. How do you distinguish signal from noise if you don’t know the problem you’re trying to solve? And I think what you see from Google least in the folks I’ve met there, and I’ve not met a lot, but the ones I’ve met have been pretty impressive. They’ve got a laser focus on trying to figure out what is it we’re trying to do? And then what data do we need to support that? They’ve reversed the process? Everything’s about, obviously, to gather a bunch of data. And then here, tell me what’s valuable in data? Well, I mean, again, how you distinguish signal from noise and the data, if you don’t know what you’re trying to do. So they’ve taken it. And yeah, they got brilliant tools, and they got great datasets, but they have a different mindset is kind of like what Elon Musk did, he’s got this whole, I mean, if you want to change the game, change your frame, look at something different than your competition does look at it differently than than your competition. And you’ve got a chance of providing some very unique, differentiated compelling value. No, I think that’s important. And one of the things that, you know, I come more from the software side, as a software engineer. And you know, we we say this all the time, but we don’t act upon it. I don’t even act upon it sometimes. Right? Like, we we think, first and foremost, what is the new framework I can use versus what is the right framework for the job, same thing in data, right? Just just flipping, what do we want to solve? And let’s go find the data elements to solve that. So I’m going to give a homework assignment to you to your listeners, and I’ll provide the link for it. We developed a tool, design thinking tool called the hypothesis development canvas. And what it does is it articulates the problem we’re trying to solve before we ever put science to the data. So what problem you’re trying to solve
What are the metrics and KPIs against what you’re going to measure success in progress? Who are the key stakeholders need to buy in? What are the asset models you need to put in place? What are the decision you’re trying to drive the prediction support that data sources we might need. And even, even probably the most important part is, what are the costs of the false positives and false negatives. If you don’t know the cost of the false positive false negatives, you never know if your model is good enough. But yet, we sort of let that kind of flutter through. So we build this hypothesis development Canvas, we do it through an envision exercise, you probably remember me talking about envisioning exercises back in the day, I still do those, they’re still invaluable in driving alignment across the organization to figure out what problem we’re trying to solve. We put all that into the hypothesis development canvas. Now the data science team knows they’re trying to do they know how they’re going to measure success and progress. They know what decisions the customer they’re trying to make. They know what predictions they need to go, they now have a framework for figuring out which of the 1000s of different data sources are probably most important in solving that.
Yeah, no, I like that. And we’ll definitely make sure we link it here in the description, and in the show notes. So I’ll make sure that we repost that. So I want to I want to go back, I think we’ve set the stage. And you know, the way the way, I’m here in this conversation today, I’m pretty excited about it that, hey, there’s a career path for folks that like data that could possibly be CTO. Now, this is Thomas, and Bill saying that, but I think we have a pretty good a pretty good handle on it. So it may or may not work out. But what would you say to somebody who’s maybe they’re in college watching this, maybe they’re just wanting to change careers, or even you know, we even have some folks who are, who are just just moving into college.
Hey, this sounds interesting. I like the business side. I can be a peacemaker. What How do I become a CEO? So
I’ve done a couple of lectures in the last six months to a number different universities to to graduating seniors, basically with that same question, maybe even more broad, they’re like, Well, how do we? How do we future proof our career, we had COVID. Now, there’ll be something new next year, something new following that there’s always going to be change and challenges in front of us. Some of them may be digitally undo. Some of them may be, you know, healthcare induced or whatever pandemics we there’s the world is constantly changing.
I believe there are three skills that everybody needs to learn. When I teach my class. My class focuses on every one of my students, whether they’re a data scientist, an MBA student, software engineer, we work on these three skills, skill number one, analytics, you need to know what you can do with analytics. And you don’t necessarily need to know how to how to code, a TensorFlow or a neural network. But by golly, you better understand how it works and what you can do with it. Right? So you need to understand what are the things that I can do using reinforcement learning using unsupervised or supervised machine learning? What are the things that these things do so a an understanding of the application of advanced analytics is critical for everybody, whether you’re a nurse, a lawyer, a barista, a tech, whatever you might believe B, you need to understand it. That’s number one.
Number two, you need to understand economics, you need to understand where and how value is created. Or you need to understand how value is created with customers how it’s created with the operations within markets within products, you need to have a solid foundation, not in finance, as much as in economics, and understanding a lot of the basic economic concepts, you know, economic multiplier effect, and postponement theory and all these things supply and supply and demand a lot of basic concepts come to bear in the area of around economics, and then the third mixin. So you don’t do these simply mix these all together. The third one is design thinking, which is learning to speak the language of the customer. And the single most important tool that I think anybody should learn from design thinking is how to create a customer journey map. Think about your customer. Think about where they go through the path that they take, from the minute they have an epiphany as they want to do something all the way through the after goal. And then identify all the decisions that user has to make to support that journey. and identify in that journey, which are the points of high value or value creation right around which I want to make sure I’m monetizing. And what are also the points of that, you know, value destruction, the hindrances, because you might find that those those points have hidden hidden acts of hindrance are also monetization opportunities. And so what you do is you you really have to learn what a customer is because at the end of the day, the only person that provides value, the only person who provides who has ink in their pens is a customer. Yeah.
So I want to learn more design thinking Do you have a recommendation for a book or a couple of blog posts or anything that
Can oh god yeah, I, I’ve written a lot about it. I think my my third book, which is called the art of thinking, like a data scientists
goes into a lot on design thinking with respect to data science.
That That book is, I think it’s only available on my personal website, Dean of big data.com. I’ll make sure to link it here. Yeah, I’ll send a link to the reason why I self published that is I want to be able to see when you publish, which people don’t realize minute you publish, you give up distribution and pricing rights. It’s not in your control, I, there’s no book here, I had no say what this price that it was decided by, and how it’s distributed, right? And discounts and all that kind of stuff, I have no, say, I wanted to book, a workbook that my students could use that I could have at a price point any student could afford. And so it’s like 999, that’s $9.99. And also 900 900 $999. Right? My goal here is every time somebody buys that book that buys me two visits to Starbucks. So that’s my goal. That’s my Starbucks fix. But it’s got a chapter in there about the hypothesis development Canvas is a chapter about design thinking. And it’s really the entire workbook is how do you get people to think like a data scientist, again, regardless of your profession, whether you’re going to be a nurse, or a doctor, or physician or engineer, or technician, or whatever you’re going to be tomorrow’s world’s gonna require everybody to think like a data scientist. No, I think that’s so important. And, Bill, just a little, little information for the audience. Before we were kicking off recording, I told bill that was like, Hey, I think this is an important topic to talk about, just on the value of the business side. And, you know, Bill bill just gave, you just gave some of the points and my, my learning through through through my career as a software engineer, and in, you know, moving, you know, moving more into business development product side, my journey, my journey came because I wasn’t the best software engineer, and even even some of the best software engineers started realizing on the team, if they couldn’t explain to our customer and your customer, your customer could be, you know, I was, you know, I was I was a contractor then, but you may not have a customer, you may only do internal products, but you have a customer and so, you know, the things that bill’s talking about here are super important, because you’re always you’re always having to sell and having to communicate your vision, how else are you the projects that you want to work on, you know, the only way to continue to get those funded or to get those, you know, sponsored by your leadership or your customers, is to be able to show them your vision and be able to communicate that. So, Bill, I mean, that’s, I mean, it’s it, I think it’s huge to our audience, and I hope they’re still tuned in, when they when they found out we were going to talk business, right, like me, but that’s the point like this is, I mean, this is this is important stuff, you know, to be to be able to understand how to convey these projects, I’ve, I’ve had the great fortune of having managed a couple of data science teams, truly outstanding data science teams. And when I watch their, I know it gets them excited. And what gets them excited is when they’re talking to a customer, they’re able to help the customer solve a really wicked problem using data science. And then that solution gets put into operations. It isn’t just having a great idea. It’s seen it actually in the works. And that’s my teams I’ve worked with, I’ve been very fortunate they light up when they know that their ideas are actually helping, you know, help this company do this more, you know, more efficiently and better. So again, you to be effective, it is not just about having great ideas, it’s about being able to put those great ideas into work and provide value to people, that’s least for me, and for the people I’ve associated with. That’s where it gets fun.
So
part of our audience as well to you know, we do we do have folks that are, you know, executives or leaders or, you know, directors within their organization. Exactly. Get off your ass.
So, let’s say, Well, I mean, that’s where we’re going with this question. So say they’re watching this and they’re like, Hey, I like Bill’s excellent venture. I don’t think my company is one of the one of the handful that are doing it, right? What are the steps that I can put in place to start attracting talent? Let’s just start with the talent portion. How would great data we want to start building a great data team? How do I attract that talent?
great talent comes to organizations that empowers them.
great talent wants to be at the frontlines providing value and don’t want to sit in a box but want to be a part of a team that’s creating swirls I I call it
organizational improvisation. And what I mean by that promises that people want to work in a situation where all of their skills are being tested and pushed. They want to be in a city like a great soccer team, right? Think about the women’s us women’s Olympic soccer team. write poetry.
Ballet on the field, working in combination. Was there a coach above them Yellen. You know, Susie move here? JANET, go here, right? No, no, they had been empowered as a team and as individuals to accomplish their objectives. And that’s what they did. And so I think what it all starts with empowerment, I have a, I have a little thing when I finished my my, my video blogs in the morning, I always ended by saying hashtag culture of empowerment, yields hashtag culture of innovation, when you can empower people, when you can allow people to try things, test things, fail, learn and try again, you will get the best people. And better than that, you will get the best out of your people when you let them do what they can do. Again, this idea of I’m I’m very much against the organizational box, where he takes somebody who’s really brilliant, we stick them in a box, we put them in that cage, and they’re never let out bullshit, right? People are brilliant, they can learn they can synergize, they can blend, they can they can take one plus one equal three and seven, you can take diverse perspectives and diverse opinions, and you come up with something even better and more powerful that friction is how tires move, right? It’s all these things. But yet we senior management wants to go out and hire a data engineer and put a data engineer back by golly, no, right? You think about a data science team, you got data scientists, you got data engineers, you got an ml engineer, you’ve got business subject matter experts, my team is the design thinker amongst it, you have this team. And here’s the beauty of a team. Everybody, at some point in time, will be forced to have to lead. Everybody takes turns leading, depending on the task at hand, right? Everybody has to be prepared to lead everybody is prepared to work together, you’re you know, it’s it’s like playing.
It’s like playing a game
or childhood. Yeah, it’s like playing a game boy. Right. And, and when you play Final Fantasy legend to you soon realize it that way to win the game is to have a very diverse set of characters, who each will take different turns leading at different points in the game. That’s the way teams work. So senior executives, if you want to get the most me, you’re talking about hiring the best people, you probably have some pretty damn good people already in the company, you just had never empowered them. You’ve never given them the ability to try things. And by the way, if you don’t fail enough, you never learn enough. If we believe the economies of learning are more powerful than the economies of scale, then you have to let people fail. Just know you can’t fail stupidly. And you have to learn from those failures. That’s how you learn you try things out, you, you you nurture that natural curiosity to see what what happens. I blend this data set with this data set using the sort of ml AI framework blows up. All right, well, I didn’t work very well document share that. Try something different. That’s how we learn. I used to so one of one of my early managers in my career, and I love the saying in the way he kind of treated it. Because he said that, hey, I’ll only be mad at you. As long as you know, you know, no mistake or no making a decision, as long as you don’t make a decision that a first grader could make. Think I just blew the same. But I knew what he was trying to say. Right? Like, as long as it wasn’t like you said just something, something, something, something you could prevent, or something that you know, didn’t put a lot of thought so well. Here’s the interesting part of this talk. So I believe that greatness is in everyone. But every now and then you’ll have people who aren’t ready to step up towards that. Yeah, I, I had I had I had a really powerful data scientist, really, really smart. just could not step up, had to let him go. It wasn’t the right fit for him. He was struggling, he was not happy. He was bringing the team down. He had to be let go. And he came he went found another great job somewhere else that we could sit into a box and do what he wanted to do. He did he wanted to be in the box. He didn’t want to be in this world. He wasn’t ready. He didn’t want to lead. But I tell you, I I believe that greatness is in everyone. And it’s up to management, to basically unleash that greatness to let the best come up. And by the best managers out there. If you all your all your people are doing great things. Suddenly you look like a genius as a manager, when all that you’ve done is you basically have unleashed that greatness in your people. Yeah, no. So I like where we’re going here. So I’m gonna continue on this path. So for that director for, you know, like I said, Maybe we’ve got some executives of watch I’ve gotten I’ve gotten a few interesting questions and emails. So we know how to track that a talent and they’re on board where we’re doing the right things where
Empowering, we’re making sure that we have diverse thoughts and, and leadership within our team, everybody’s stepping into lead. How do I manage that up? Right? How do I maybe maybe this person is the you say the mini, the mini CIO or the mini mini CIO? If we’re stuck there, how do I make that? How do I how do I managed to get that cultural change so that once again, the goal is to become that handful of companies, right. So the The key is to find a friendly on the business side. That is somebody on the business side, who’s trying, it was in one of two modes. And I when I was in sales, I was a big fan of the Miller Heiman selling technique. And he used to say that they’re the people in two modes are the people who want to sell to people who are in growth mode. And people who are in trouble mode, people who are in growth mode, know there’s something bigger they can do, they just need help to get there. People are in trouble motor people who know they’re in trouble. Know, if they don’t switch things around, they’re soon going to be out on their tail. I like both those kind of people, because they’re now open to suggestion and trying to do things differently. And so you find a friendly, hopefully, probably have somebody more in the growth mode, somebody who’s No, I had a situation with a chief marketing officer, and a brand new to the company wanted to put his fingerprint on the company. We were releasing a brand new product. He wanted to we had you know, how do you focus your scarce sales and marketing resources to around the right customers for that product? Right. So we worked with him in his in his marketing team and the brilliant CIO, we had we put together a plan, we launched this and the first year generated an additional 28 million in revenue, right? Think about that 28 million revenue, from one use case. 28 minutes, like that’s like found money. It’s like walking down the street and Stumbling on a bag of money go, oh, here’s $28 million for you. Right? I’ve never found that on the street.
If I had, I wouldn’t be talking to you, I’d be saying that’s fine. But that’s that’s what it’s like, right? So you find a friendly, who has a vision, and you help them you make them a hero. Or you make them the champion. And then they’re out telling the rest of the seat that the CXO that Yeah, I just ran this use case. And I’ve gotten it’s my first one, he got 14 more to do the first one, you’ll have $28 million in additional revenue. And you’ll watch all the draw jaws and the executive go. What would you guys do? Right? Then what happens? Of course, success breeds success, which is interesting. So first off, find a friendly, make them a hero, make sure they’ve got a problem is big enough, right? I love the company and say, well, is my is my data big enough for big data? No, that’s the wrong question. Is your problem big enough, right? So $28 million.
Once you’ve done that, and you’ve started to convince the organization,
person by person is value, the next biggest challenge you’re going to have is on governance to make certain you’ve got a process put in place. So people are, are running their ideas through you. Because what you don’t want to do is have somebody go well, I’m you know, I can’t wait for your team to get to me, I’m gonna hire my own team and do my own thing. Well, data silos, right kills the value of data, orphaned analytics, I can’t I can’t continuously learn and adapt if I have orphaned analytics. So you have to have a rigid governance organization, which means this Chief Data monetization officer, not only are they a collaborative trying to step across borders and makes shit happen, but they also got teeth. And they’re not afraid to be a son of a bitch to walk across and say, no, we’re not going to have these bands or random data scientists in the organization, they’re going to sit in our organization, you’re going to work it through our governance process, because whatever you learn from the project, you just did, we want to make sure that every other part of the organization learns, because the economies of learning are more powerful than economies of scale. So the Chief Data monetization officer needs to be on one hand, very friendly, has the carrot can make you more money. Here’s $28 million per use case. It’s got a big old stick with a nail in it seen if you don’t play by the rules, buddy, you’re out of here. Yeah. So I hadn’t I hadn’t seen it from a governance perspective like that. So you know, building that value, I mean, and essentially what in my head the way that I’m kind of walking through it is, alright, so I build this amazing team. And we start you know, we find a friendly and essentially, you’re building your brand internally. Until, you know, you have you just have this rush of so many different opportunities within your organization. And then I guess I never thought about it from the perspective of organizations start to fall down when when when they start building their own. You can’t have four four centers of excellence, right. I know. Right? Right. We we saw this problem in data warehousing, right. We started off with data warehousing, but data warehouse was really hard to build. They were really hard to build so businesses couldn’t afford to wait for the data warehouse to get to them. So they built their own data Mart’s

Spark vs. Hadoop 2019

June 12, 2019 by Thomas Henson Leave a Comment

Spark vs. Hadoop 2019

In 2019 which skill is in more demand for Data Enginners Spark or Hadoop? As career or aspiring Data Engineers it makes sense to keep up with what skills are in demand for the market. Today Spark is hot and Hadoop seems to be on it’s way out but how true is that?

Hadoop born out the web search era and part of the open source community since 2006 has defined Big Data. However, Spark’s release into the open source Big Data community and boosting 100x faster processing for Big Data created a lot of confusion about which tool is better or how each one works. Find out what Data Engineers should be focusing on this episode of Big Data Big Questions Spark vs. Hadoop 2019.

Transcript – Spark vs. Hadoop 2019

Hi folks. Thomas Henson here with thomashenson.com, and today is another episode of Wish That Chair Spun Faster. …Big Data Big Questions!

Today’s question comes in from some of the things that I’ve been seeing in my live sessions, so some of the chats, and then also comments that have been posted on some of the videos that we have out there. If you have any comments or have any ideas for the show, make sure you put them in the comments section here below or just let me know, and I’ll try my best to answer these. This question comes in around, should I learn Spark, or should I learn Hadoop in 2019? What’s your opinion?

A lot of people are just starting out, and they’re like, “Hey, where do I start?” I’ve heard Spark, I’ve heard Hadoop’s dead. What do we do here? How do you tackle it?

If you’ve been watching this show for a long time, you’ve probably seen me answer questions similar to this and compare the differences between Spark and Hadoop. This is still a viable question, because I’ve actually changed a little bit he way I think about it, and I’m going to take a different approach with the way that I answer this question, especially for 2019.

In the past I’ve said that it really just depends on what you want to do. Should you learn Spark? Should you learn Hadoop? Why can’t you learn both? Which, I still think, from the perspective of your overall learning technology and career, you’re probably going to want to learn both of them. If we’re talking about, hey, I’ve only got 30 days, 60 days. “I want the quickest results possible, Thomas.” How can I move into a data engineer role, find a career? Maybe I just graduated college, or maybe I’m in high school, and I want to get an internship that maybe turns into a full-time gig. Help me, in the next 30 to 90 days, get something going.

Instead of saying depends, I’m really going to tell you that I think it’s going to be Spark. That’s a little bit of a change, and I’ll talk about some of the reasons why I think that change too. Before we jump into that, let’s talk a little bit about some of the nomenclature that we have to do around Hadoop. When we talk about Hadoop, a lot of times that we’re talking about Hadoop, and MapReduce and Htfs [Phonetic 00:02:10] in this whole piece. From the perspective of writing MapReduce jobs or processing our data, Spark is far and clear the leader in that. Even MapReduce is being decoupled, has been decoupled, and more and more jobs are not written in MapReduce. They’re more written with Flink [Phonetic 00:02:28], or Spark, or Apache Beam, or even [Inaudible 00:02:32] on the back-end. That war has been won by Spark for the most part. Secondly, when we talk about Hadoop, I like to talk about it from an ecosystem perspective. We’re talking about Htfs, we’re talking about even Spark included in that, and Flume, all the different pieces that make up what we call the big data ecosystem. We just call that with the Hadoop ecosystem.

The way that I’m answering this question today is, hey, I’m looking for something in 2019 that could really move the needle. What do you see that’s in demand? I see Spark is very, very much in demand, and I even see Spark being used outside of just Htfs as well, too. That’s not saying that if you’ve learned Hadoop or you’ve learned Htfs you’ve gone down the wrong path. I don’t think that’s the case, and I think that’s still viable. You’re asking me, what can you do to move the needle in 30 to 90 days? Digging down and becoming a Spark develop, that opens up a career option. That’s one of the quickest ways that you can get, and one of the big things we’ve seen out there with the roles. Roles for data engineers. Another huge advantage, we’ve talked about it a little bit on this channel, but the big announcement for what Data Bricks is going from the perspective of investment and what their valuation is. They’re an $2.5 billion advancement, and they’re huge in the Spark community. They’re part of the incubators and on a lot of steering committees for Spark. They have some tools and everything that they sell on top of that, but it’s just really opened my eyes to what’s out there. I knew Spark was pretty big, but the fact that Data Bricks and where they’re going, I think that’s a lot of what we’re seeing. Another point, too, you’ve heard me talk about it a good bit, but where we’re going with deep learning frameworks and bringing it into the core big data area. Spark is going to be that big bridge, I believe. People love to develop in Spark. Spark’s been out there. It gives you the opportunity now with Project Hydrogen and some of the other things that are coming to be able to take and do ETL over GPUs, but also import data and be able to implement and use Tensorflow or PyTorch, or even Caffe 2. It you’re looking in 2019 to choose between Spark and Hadoop to find something in the next 30 to 90 days, I would go all in with Spark. I would learn Spark, whether it be from Java, Scala, or Python, but be able to learn, and be able to start doing some tutorials around that, being able to code. Being able to build out your own projects, and I think that’s going to really open your eyes, and that can really get the needle moving. At some point, you want to go back, and you want to learn how to navigate data with Htfs. How to find things. They’re going on from the Hadoop ecosystem, because it’s all a big piece here, but if you’re asking me, the one thing to do to move the needle in 30 to 90 days, learn Spark.

Thanks again. That’s all I have today for Big Data Big Questions. Remember, subscribe and ring that bell, so you never miss an episode of Big Data Big Questions. If you have any questions, put them in the comment section here below. We’ll answer them on Big Data Big Questions.

Data Engineer in 2019

May 31, 2019 by Thomas Henson Leave a Comment

What’s a Data Engineer Career Like in 2019?

Times change and keeping up with maintaining skills while managing day to day projects can be exhausting.

Should I learn Hive or Tensorflow?

Which is better Flink or Spark?

How as a Data Engineer will I focus on Containers?

Questions like these come up all the times when I speaking with aspiring and career focused Data Engineers. Find out my thoughts around skills and career outlook for Data Engineers in 2019 on this episode of Big Data Big Questions.

Transcript – Data Engineer in 2019

Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in around, what does data engineering in 2019 look like? What are some of the trends? What are some of the things that are going on? Has this question come in from a comments section here on YouTube, so if you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson.com. And, I’ll discuss it in an orderly fashion as they come in, provided I have the time. I’ve been getting a ton of questions, so I really appreciate it. Thank you for this community here.

Today’s question, before we jump into it, I want to give you my three top trends to watch for in 2019. Before we did, I did want to credit with an article that they did for their 10 trends in big data. I talked about them on my YouTube live session, so if you’re ever around Saturday mornings, jump on. Throw me a question in the chat. Let’s get to cracking. I try to answer as many questions as I can there, and try to do that Saturday mornings. Jump in there.

The 10 here, you can check in the comments section here below, where I have some of the link to the [Inaudible 00:01:13] trend here. I’m going to read some of them real quick. The first one they said for the top 10 trends in 2019. Data management and [Inaudible 00:01:22]. They’re talking a little bit about ETL and how ETL’s not going away. I’ve said that for a while, but we did read an article not too long ago that’s saying, “Hey, you know, there’s some tools out there that are really going to make ETL kind of a thing of the past.” We’ll see. Hopefully, right?

I’m not for ETL, I just, man. Started out there, and it seemed like I was never going to get out of it. Number two, data siloes continue to proliferate. This goes into what we saw when Hadoop emerged as this huge, big data lake, where the data’s only going to exist there. We’ve been talking about it, especially on this channel, over the past few years where, hey, data has a lot of gravity to it. There’s going to be data out on your edge. There’s going to be data in the cloud. There’s going to be data still in core data centers.

The idea of a fluid data lake is a little bit more consolidated. You still have those main areas, but you still have to do analytics and place in some area. number three, streaming analytics has a breakout year. Talked about streaming analytics on this channel for the last couple years. Actually did a session about the future architectures of streaming analytics at the 2017, was it Hadoop Summit? They call it Data Works, now.

Data governance builds steam, talked about some of that here. Soft skills start to emerge as tech evolves. Just talking about the soft skills of understanding the business, talks about that with the book, the big data MBA here. Deep learning gets a little bit deeper. Hm. Have we talked about deep learning on this channel? Special K expands footprint. They’re talking about Kubernetes and what’s going on with the doctorization. Clouds are hard to ignore. New tech will emerge, talking about how Silicon Valley and a lot of open source, and closed source, tools have been emerged, and they don’t see that stopping anytime soon. Then, smart things everywhere. I’ve talked about those a good bit here, too.

Without further ado, let’s jump into my three trends for 2019. My three trends to watch for in 2019. The first one, deep learning and Hadoop. How are these ecosystems going to interact with each other? A lot of project out there have talked about it last year, around project hydrogen, submarines, another project, and NVIDIA’s Rapid. It’s all about being able to use GPU and also be able to use those deep learning libraries with data that’s in your Hadoop ecosystem or just for some ETL. That’s one of the things that NVIDIA Rapid’s… Maybe I should do a video just specifically on that. Watch that trend. Start watching what’s going on with TensorFlow and being able to use integrated in with Spark and some of your other tools that are more traditional in the Hadoop ecosystem. That was number one. Number two. Two? Yep. Number two, containerization of the world overtakes data engineering. Similar to what they were talking about it [Inaudible 00:04:11], with their trends, with Special K being special. I think the containerization, we’ve seen it a lot, a lot of announcements here lately with cloud native applications and cloud native experiences on the Cloudera side, and you even saw in Hadoop 3.0 where they were laying the groundwork to be able to containerize your Yarn, schedule your engine, and some of the other components there. We’re going to continue to see that, and that’s one skill that you’re going to be looking for. If you’re in data engineering right now, you want to know what’s coming up down the pipe for you, I would look into doing some things and getting more familiar with the containerization. That’s actually in my roadmap for the end of the year for me, to understand a little more around docker, and Kubernetes, and that whole ecosystem. That is a big trend we will see for data engineering. It’s not going to slow down. It’s been picking up steam a lot here lately, but it’s going to go full force. My third trend, thing that I’m looking for, for data engineers in 2019, streaming analytics. I was doing some research and looking around some IDC numbers around where we’re talking about from a data perspective. We’re gonna be, one of the interesting tidbits that they were talking about is how streaming analytics will take up anywhere from around 30% of all the analytics and things that are going on in Azure. Think about all these different devices bringing in data here by 2025. 30% of that’s going to have to be streaming analytics. That’s a huge number. There’s a number of tools out there that are helping to try to deal with what’s going on from a streaming analytics perspective.

We’ve got , we’ve got Kafka. On the cloud side we’ve got Kinesis. A lot of different tools. We had [Inaudible 00:05:42] on this channel here, but there’s a lot of tools in place, a lot of tools being created, because streaming analytics is a huge beast of data to handle. It’s a different kind of problem than what we’ve seen, and it’s only going to get worse as we start bringing in more data, more devices. Really cool opportunities for you as data engineers. Outside of my goals for 2019, if you’re looking for some things to jump into and some educational paths for yourself as a data engineer in 2019, I would look into those three trends. Deep learning, containerization, and then streaming analytics. That’s all I have for today. Make sure to subscribe and ring that bell so that you never miss an episode of Big Data Big Questions. Throw a comment in the comments section here below if you have any questions. If you like the video, if you hated it, just let me know how you feel about this, and I will see you next time on this episode of Big Data Big Questions.

Big Data Impact of GDPR

May 7, 2018 by Thomas Henson Leave a Comment

How does GDPR Impact Data Engineers

The General Data Protection Regulation (GDPR) goes into effect in May 2018. Many organizations are scrambling to understand how to implement these regulations. In this video we will be discussing Big Data Impact of GDPR.

Transcript – Big Data Impact of GDPR

Hi, folks! Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today is a very special episode. We’re going to talk a little bit more about regulation than we’ve probably talked about before.

We’re going to tackle the GDPR and what that means for big data, big data analytics, and why data engineers and even data scientists should understand the regulation and know it at least from a high level. Find out more right after this.

[Sound effects]

Welcome back. Today is a special episode. We’re going to talk about the GDPR, which is the general data protection regulation, and we’re going to talk about what that means for a data engineer, and why you should understand that.

Just to have a high-level overview, this is going to be one of those things where understanding this regulation is really going to help you. You’re going to have meetings about it. This is such a big change for our industry. If we think about it from an IT perspective or a big data perspective, think of changes that have happened in other industries.

Think of what happened in the US with the SEC in accounting, around Enron and some of the other financial accounting problems that happened in the early 2000s, and then also think about healthcare. Healthcare regulation is, if you know anything about healthcare, you probably know at least the HIPAA requirement. This GDPR is going to be similar to that. Nowadays, taking place in the EU, but the ramifications are going to happen, I believe, everywhere, because one, data exists everywhere. Most companies are global companies, and the way that we handle and capture that data, whether it be from a user in the EU or a user anywhere else in the world, we’re going to have to have those regulations, and have those systems in place, so that we can comply to that.

Just from a high level, if you’re a data engineer, and we focus on the technology, and the hardware, but from non-technical careers, remember we’ve talked about this before, so some of these non-technical careers. We talk about data governance in other places. If you’re interested in that, get head first, dive into the general data protection regulation. Find out as much as you can, because that’s really going to make yourself, one, valuable in the meetings, but also if you’re looking to do a career change, maybe you’re already doing some kind of compliance or something like that, and you want to get involved in big data, here’s your opportunity. Become an expert at this, because we’re moving fast to have to comply.

Just to talk a little bit about it. It’s the EU agreement on how data is processed and stored. It’s a replacement for the data protection directive 95/46, so this is a more stringent, more all-encompassing. You’re probably like, “Why are we going down this route? Why is a regulation coming out?” If you think about it, a lot of things have been happening over the last few years.

How often do we hear about a data breach? There was a huge one last year, right? Affecting millions and millions of users, people’s credit card, people’s social security numbers. Our data is constantly under attack, and it’s, from a big data perspective, we hold onto data so that we can analyze it and make better products, make more efficient products, make better websites, better clickthrough rates on your ads, there’s so many different things that we do with these data, but also, there’s so much danger in having it.

We have to make sure that we’re protecting it, and then also, we want to make sure from a privacy perspective, and this is where this is really going to hit, is allowing users to opt in or opt out. Knowing what’s being collected and how long they’re going to have it, and then also giving you the ability to say, “You know what? Let’s get rid of that data.” I don’t want you to hold onto it.

Those are some of the things that you’re going to be tackling with it. Also, just as a note, it was approved on April 14th, 2018. Must be complied with by May 25th. We’ve got some time, here, between it, and that’s where I’m really encouraging people, even if you’re watching this video after the date, you’re wanting to get in big data, on the governance side, maybe you have non-technical career options, learn this. I’m serious. Just learn this. This is going to be huge. You’re seeing, if you follow anything from Hortonworks, or Cloudera, or anybody involved in big data or even IT, you’re getting bombarded with information about it, because it’s such a big deal, and then the compliance on this, like I said, it’s industry-shifting, just like HIPAA was, and just like some of the SEC regulations and accounting regulations that came out in the early 2000s. If you’re looking for, I’ve got the official site listed here, so you can see where to go from the EU and see it.

Like I said, you’re going to see a ton of blog posts. There’s a ton of resources out there. Some of the tools, if you’re on the technical side, and you’re wondering, okay, I’ve got to go into a meeting. If somebody’s going to ask me what we’re doing about some of the data governance, and some of the other pieces, where can I focus, or where can I say, “Hey, you know what? Give me a week or two. Let me look at some of the things maybe you weren’t doing, and maybe the way that you’re protecting the data is a little bit different.”

Maybe the way that you’re tracking and holding onto the data, so that you can comply by getting rid of users’ data or opting not to track it, or even using a way to mask it, right? Using a way so that you can mask it, so you’re protecting the identities a little bit better. Maybe those are some of the weak points that you are…

Look into Apache Atlas, Apache Rancher, and Cloudera Navigator. Depending on the flavor of the Hadoop framework you’re using, or Hadoop package you’re using, whether it be Hortonworks or Cloudera, if you’re using one of those two main ones, look into these two tools, these three tools right here. This will give you some kind of framework, so you’re starting to see. So, you walk into the meeting, somebody says, “Hey, we’ve got to look at how we’re complying with GDPR, we want to really focus on data governance. What are we doing?” You’re sitting there saying, “I don’t know how to tackle this,” if you’re not doing it.

Go. Know these tools. Understand them from a high level. If you need to implement them, it’s a whole different story, but you can start getting trained up, start implementing those, too. Hope this was very helpful. This is something that I’m sure we will make some more videos on. We’ll be talking about constantly. I predict that this is kind of, like I said, industry-shifting regulation for IT and especially for big data, for all of us. I’m sure there’s going to be follow-on. I’m sure other countries in other areas, they’re starting to look at regulations. I’m sure here in the US, I’m sure Russia, Japan, I’m sure everywhere, they’re starting to look at some of these regulations. It’s not going to be just for the EU. Even if it was, it’s still affecting us. Everything’s global. If you have any questions, make sure you put them in the comments section here below. I will answer them here on Big Data Big Questions. You can go to my website, thomashenson.com. Look for the Big Questions, send me a comment. Also, make sure that you’re subscribing, so that you never miss an episode, and I will see you next time.

Skills Needed for Big Data Administrators

April 30, 2018 by Thomas Henson 84 Comments

Data Engineers & Big Data Administrators

In today’s episode of Big Data Big Questions we tackle what the skills are needed for Big Data Administrators. Data Engineers wear many hats in Data Analytic workflows, one part software engineer and one part systems administrators. The Big Data Administrators are responsible for keeping Hadoop, Kafka, Ambari, and other frameworks running. Find out what other skills Big Data Administrators need in the video below.

Make sure to subscribe to my YouTube channel to never miss an episode of Big Data Big Questions.

Transcript – Skills Needed for Big Data Administrators

Hi, folks! Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today, I’m going to answer a user question about data administration, or in big data, what is that big data administrator’s role?

What are some of the tools that they use? How can you get involved? Find out more, right after this.

Welcome back. Today’s question is going to revolve all around the big data administrator, what that role is, what are some of the tools that they use? This question came in from my website. You can do Big Data Big Questions, go to thomashenson.com, click on Big Questions, submit a question there. Put them in the comments section here below, and then always, make sure that you’re subscribing to this YouTube channel, so that you’ll never miss an episode. These are great tips. These are great ways for me to answer any questions that you have. If you have those questions, ask them, but also make sure you’re subscribing to the channel.

Today’s question comes in from Jarvis. He says he has a dilemma on Python for big data. We answered a number of questions around Python and big data, and then do you have to know Java? But, this one is a little bit different. It’s going to cover the data administrator.

Hi Thomas, a big fan of yours.

Thanks for watching. Thanks for sending in the question.

I had a question related to IT careers and skills in big data. I wanted to know if Python is required only by data administrators, or can all things done by Java on big data be implemented using Python as well?

This question is really good. Like I said, we’ve talked a little bit about, do you have to know Java in order to be able to be a big data admin, be involved in big data, be a data engineer?

The answer is no. You can do things in Python, but I want to tackle the question from the perspective of, you’re asking about data administration, and so there are two different roles. We’ve talked about the data engineer versus the data scientists. The data engineer is the one who’s setting up the cluster, maybe doing some of the software development, running your Hive jobs, maybe even just the software developer, from if you’re writing Java jobs, if you’re writing your Spark jobs, but your data administrator, that’s a different role inside of that. We have two pieces of the spectrum. This side over here, this is more software development side generated, and on this side over here, let’s say that this is more of the administrator, or our systems engineer, the person who’s setting up and running the cluster. Maybe not doing the day-to-day coding but doing the administrating and running of the system. Think of that as your full stack developer.

Think about when you split up your systems admin, who’s setting up the stack, making sure the database is running, doing those tasks versus who’s running the… Whether it be PHP code or .NET code. What skills does a data administrator have to have?

I would say that, if we’re talking about being able to be involved in the community, and be involved in big data, you’re going to keying on HTFS, Ambari, Hive, Flume, and you’re going to have a lot of Linux skills. If you’re asking me, you want to get into data administration, you want to be an awesome data administrator in the big data ecosystem, do you have to know Java? No. Can everything be implemented in Python? Maybe, but you’re probably going to be doing more administrative tasks as far as setting up the cluster, understanding the operating system that Hadoop’s running on.

You’re maintaining more that Linux level, and the Hadoop ecosystem level, so if you’re using Hortonworks or you’re using Cloudera, how all those tools are integrating and talking to each other. I would focus more on not even so much the coding part, but as far as being able to set up that cluster. It’s going to vary, too. It’s going to vary in the role.

Some places, especially when you’re just starting out on big data, and you have a small team in your company, you’re going to be the software engineer and the data administrator, right? You might need to have a little more code.

If you’re going to a more seasoned team or a bigger team, you can actually have that role where you’re running the administration. My answer is, I wouldn’t worry so much about Python and Java, if that’s the role that you’re wanting.

The data administrator, I would worry about being able to integrate the tools. Be familiar with the tools, be familiar with how to set up, how to add notes, how to take notes down. How to set up secondary name nodes, so, being able to make sure that, when one name node goes down, the second, you can flip over to the second name node. Being able to back up the data. Making sure that we’re taking snapshots. All the kind of tasks that go into running the system, versus being able to write a MapReduce job. If you’re really keen on being a big data administrator, which, those are great roles, those are a lot of fun, you’re still hands on, but you’re not really having to write the jobs.

You’re checking out new tech, checking out new projects, to see, “Hey, am I going to be able to integrate this into our system,” or, “Man, you know, we’ve got two or three more nodes that are going to come online, so let’s make sure that we get those racked and stacked, and then, let’s make sure that we’re adding those to the cluster, too.”

A lot of cool things that you can do in that role. Most of them aren’t going to involve coding, so you’re not really going to have to worry about Java, you’re not going to have to worry about Python, as much as you would in the traditional data engineer, where you’re looking at being more of a software engineer.

I hope I answered your question. If anybody else has any questions, put them in the comments section here below. Make sure to follow me here, so click subscribe, and then I’ll see you next time.

Better Career: Hadoop Developer or Administrator?

January 16, 2018 by Thomas Henson 1 Comment

The Hadoop Ecosystem is booming and so is the demand for Hadoop Developers/Administrators.

How do you choose between a Developer or Administrator path?

Is there a more demand for Hadoop Developers or Administrators?

Finding the right career path is hard and creates a lot of anxiety about how to specialize in your field. To land your first job or move up in your current role specializing will help. In this video I will help Data Engineers choose a path between Hadoop Developer or Administrator. Watch the video to get a breakdown of the Hadoop Developer and Administrator roles.

Video – Better Career: Hadoop Developer or Administrator

Transcript

Non-Technical Careers in Big Data

January 13, 2018 by Thomas Henson Leave a Comment

Big Data Career Without Coding?

Do all career options in Big Data demand skills with coding or administration? Big Data projects are in high demand right now, but skill sets for these projects come from different backgrounds. If you are wanting to get involved with Big Data, but don’t have a technical background watch the video to learn your options.

Options discussed in Non-Technical Careers in Big Data:

Data Governance
1. How Timely is the data?
2. What is the source of all this data? Garbage in Garbage out
3. Explain one of my first jobs in IT
Project Management
1. Agile Development
2. Scrum Master
3. DevOps
Compliance & Security
1. Huge Data Lakes need securing
2. Huge potential with GDPR General Data Protection Requirements -Plug Alan Gates Interview
3. How many different breaches do we hear about on daily basis

Video – Non-Technical Careers in Big Data

Transcript

Hi, folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Welcome back to the new year. Our first thing that we’re going to tackle today in our first episode of Big Data Big Questions for 2018 is going to be non-technical jobs or career options inside of big data.

It’s submitted in from one of our YouTube users. You can find out more right after this.

Today’s question comes in from YouTube. Remember, if you have any questions around big data or anything that you want to ask and you want me to answer, you can submit those in our YouTube comments below on any of the videos, or you can go to my website at thomashenson.com/bigdataquestions. You can submit any questions there, and I’ll answer them as best I can on air, and give you my advice on the Hadoop community, or big data, or data engineers, or any questions that you have.

Today’s question comes in from YouTube, and it’s from Shahzad Khan. He says, “I work as a change manager, and I don’t know anything about Java or Hadoop, but I want to learn this technology. Is it all right for me to learn, since I’m not into coding? Also, I’ve never been involved in a development team, please suggest.”

Great question. Thanks for the comments and thanks for watching. Continue to watch. My first thing when I look at this is, we’ve talked about the ability, and I’ve had a couple other videos that you’ve seen where we’ve talked about, that you don’t have to know Java to be involved in Hadoop. If you have any questions around that, you can check into that. Really, I think this question, I want to frame it a little bit different, and think about, just because you want to be involved in big data, and you want to be involved in the community and all the things that are happening, you don’t necessarily have to have a technical role to be involved in that.

There’s three roles that I want to talk about that are non-technical from the aspect of coding and Hadoop administration that you can do to still be involved in data or even big data. I’m going to put them together. These aren’t just specifically for big data. This can be around data analytics.

The first one is around data governance. When we talk about data governance, we talk about, what’s the flow of data? Where did the data originate? Everybody’s probably heard of the adage or the example of garbage in, garbage out. Where’s your data coming from? Can you trust, and can you automate, and trust the data that’s coming in? Data governance is about where that data comes from, but it’s also about, how timely is that data? You’re really involved with the sourcing of the data. You’re also looking at things around… I remember one of my first career options. I remember sitting around, and we have a couple different applications, and the heads of each application were together, and we were all there to talk about the different ways that we name things in our own databases. If you think about it, we were trying to merge everything into an enterprise data warehouse. This is a little more old school, but it still happens in big data, when we have these different data sources.

You might have an instance where data from one data set is named or has a different key than data in a separate data set, but you want to be able to merge those. Data governance is around, you can help find and help be a part of that, where the data’s coming from, so that’s one option. I would look into data governance if you still wanted to be involved in big data but didn’t have the technical skills or didn’t have desire to have the technical skills.

Another one is project management. We always need good project managers. Project managers, they’re the ones, the workhorses that really help bring the developers, bring the data scientists, bring the front-end developers, bring everybody together, and really gets that project going. Makes sure that we’re communicating. If you’re interested in project management, you can do that from a non-technical perspective. One of the things, though. I’ve got some stuff on my website where I went through and did the scrum master training. Think of agile development. Just like you would in traditional application development, big data needs agile developers or agile project managers as well.

Then, also look at the scrum master training, but also look at DevOps, and see where that is, if there’s any DevOps certifications, or anything that you can provide in that background to be able to help and manage these teams. Project management is a second one, and then the big one, the next one, compliance and security. We always need compliance and we always need security, especially now with the maturity of the Hadoop community and how much Hadoop is taking over and being used in the enterprise. There’s always compliance around it. You think of HIPAA, you think of some of the SEC compliance here in America. Then, you can also think of GDPR. GDPR, General Data Protection Requirements compliance, I would look at that regulation.

That’s something that’s really interesting to me, and if I was somebody non-technical, and I was interested in compliance or security, that is one area I would start to look at, because I think there’s going to be a growing need. Anytime there’s any kind of regulation, and this is a political statement in any way, but anytime there’s any kind of regulation or change in regulation, there’s a lot of things that go on behind the scenes as far as interpreting that and making sure that you’re in compliance with your enterprise, or if you’re working for some kind of public institution, you want to make sure you’re doing that. Anytime something like that, if you can become an expert and move to that, that would be huge as well.

For securing the data, too. It’s an ongoing, probably overused joke. How many data breaches have you heard about? There’s one every day. Big data is not, we’re not, immune to that. In fact, we’re larger, a larger target. Think about the three Vs.

Volume. How much data do we have in your Data Lake? Big data has big data, right? You need to be able to secure that. Those are the three areas I would look at for non-technical jobs if you still want to be involved in data. Data governance, project management, and compliance and security. That’s all for today. Thanks for tuning in. Make sure you subscribe, so you never miss an episode. I will see you again on Big Data Big Questions.

Learning Roadmap for Data Engineers?

December 19, 2017 by Thomas Henson Leave a Comment

Is there a learning Roadmap for Data Engineers?

Data Engineers are highly sought after field for Developers and Administrators. One factor driving developers into that space is the average salary of 100K – 150Kwhich is well above average for IT professionals. How does a developer start to transition into the field of Data Engineering? In this video I will give the four pillars that developers/administrators need to follow to develop skills in Data Engineering. Watch the video to learn how to become a better Data Engineer…

Transcript – Learning Roadmap For Data Engineers

Thomas: Hi, Folks. I’m Thomas Henson with thomashenson.com. Welcome back to another episode of Big Data, Big Questions. Today we’re going to talk about some learning challenges for the data engineer. And so we’re actually going to title this a roadmap learning for data engineers. So, find out more right after this.

Big Data Big Questions

Thomas: So, today’s question comes in from YouTube. And so if you want to ask a question, post it here in the comments, have these questions answered. So, most of these questions that I’m answering are questions that I’ve gotten from the field that I’ve met with customers and talked about, or I get over, and over, and over. And then a lot of the questions that I’m answering are coming in from YouTube, from Twitter. You can go to Big Data, Big Questions on my website, thomashenson.com…Big Data, Big Questions, submit your questions there. Anyway that you want to use it, use the #bigdatabigquestions on Twitter, and I will pull those out and answer those questions. So, today’s question comes in from YouTube. It’s from Chris. And Chris says, “Hi, Thomas. I hold a master’s degree in computer information systems. My question is, is there any roadmap to learn on this course called data engineer? I have intermediate knowledge of Python and Sequel. So, if there’s anything else I need to learn, please reply.”

Well, thanks for your question, Chris. It’s a very common question. It’s something that we’re always wanting to understand is how can I learn more, how can I move up in my field, how can I become a specialist. So, a data engineer is in IT. It’s a sought out field with high demand, but there’s not really a roadmap for these, so you can see what some people are learning, what other people are saying is a specification. So, you’re asking what I see and what I think are the skills that you need based off your Python and your Sequel background. Well, I’m going to break it down into four different categories. I think there’s four important things that you need to learn. And there’s different ways to learn them. And I’ll talk a little bit about that and give you some resources for that. And all the resources for this will be posted on my blog. So, I’ll have it on thomashenson.com. Look up Roadmap Learning for Data Engineer. And under that video, you’ll see all these links for the resources.

Ingesting Data

The first thing is you need to be able to move data in and out. And so most likely you’re going to want to know how to move into HDFS. So, you want to know how to move that data in, how to move it out, and how to use the tools. You can use Flume, just using some of the HDFS commands. You also want to know how to do that maybe from an object perspective. So, if in your workflow, you’re looking to be able to move data from an object based and still use that in Hadoop or the Hadoop ecosystem, then you’d want to know that. And then also I mix in a little bit of Kafka there, too. So, understanding Kafka. So, the important point there is being able to move data in and out. So, ingest data into your system.

ETL

The next one is to be able to perform ETL. So, being able to transform that data that’s already in place or as it’s coming into your system. Some of the tools there… If you watch any part of my videos, you know that I got my start in Pig, so being able to use Pig, or use MapReduce jobs, or maybe even some Python jobs to be able to do it. Or Spark just to be able to transform that data. So, we want to be able to take some kind of maybe structured data or semi structured data and transform it, being able to move that into a MapReduce job, a Spark job, or transform it maybe with Pig and pull some information out. So, being able to do ETL on the data, which rolls into the next point which is being able to analyze the data.

Analyze & Visualize

So, being able to analyze the data whether you have that data, you’re transforming it, maybe you’re moving it into a data warehouse in the Hadoop ecosystem. So, maybe you move it into Hive. Or maybe you’re just transforming some of that data, and capturing it, and pulling into HBase, and then you want to be able to analyze that data maybe with Mahout or MLlib. And so there’s a lot of different tutorials out there that you can do, and it’s just kind of getting your feet wet, understanding, “Okay, we’ve got the data. We were able to move the data in, perform some kind of ETL on it, and now we want to analyze that data.” Which brings us to our last point. The last thing that you want to be able to do and be familiar with is be able to visualize the data. And so with visualizing the data, you have some options there. So, you can use Zeplin or some of the other notebooks out there, or even some custom built… If you’re familiar with front end development, you can kind of focus in on some of the tools out there for making some really cool charts in really cool different ways to be able to visualize the data that’s coming in.

Four Pillars – Learning Road Map for Data Engineers

So, the four big pillars there, remember, are moving your data – so being able to load data in and out of HDFS – object based storage, and then also I’d mix a little Kafka in there, performing some kind of ETL on the data, being able to analyze the data, and then being able to visualize the data. In my career, I’ve put more emphasis around the moving data and the ETL portion. But for whatever you’re trying to do… Or your skill base may be different. Maybe you’re going to focus more on the analyzing of the data and the visualization of the data. But those are the four keys that I would look at for a roadmap to becoming a better data engineer or even just getting into data engineering. All that being said, I will say… I did four. Kind of draw a box around those four pillars and say as we’re doing those, make sure we’re understanding how to secure that data for bonus point. So, as you’re doing it, make sure you’re using security best practices and learning some of those pieces because we start implementing and put these into the enterprise, we want to make sure that we’re securing that data. So, that’s all today for the Big Data, Big Questions. Make sure you subscribe, so you never miss an episode, all this awesome greatness. If you have any questions, make sure you use the #bigdatabigquestions on Twitter. Put it in the comment section here on the YouTube video or go to my blog and see Big Data, Big Questions. And I will answer your questions here. Thanks again, folks.

Show Notes

HDFS Command Line Course

Pig Latin Getting Started

Big Data Beard Podcast Announcement

November 7, 2017 by Thomas Henson Leave a Comment

How do you keep up with all the news going on in the Big Data community?
Announcing the Big Data Beard Podcast, a Podcast devoted to Big Data news, architecture, and the software powering the big data ecosystem. Watch the video below to learn how I feel about the new podcast.

Transcript – Big Data Beard Podcast Announcement

Hi, folks! Thomas Henson here with thomashenson.com. Today, I’m in a different location. Looks like a construction site, right?

That’s because changes are coming. I’m building an office right now, and at some point in the future, I’m going to have a video maybe showing that office off.

With all these changes coming, I wanted to announce another big change. That’s a new podcast for you to listen to. If you follow me on Twitter, you’ve probably heard about the Big Data Beard Podcast. Look at all these tweets.

That’s a good way to keep in touch, but the Big Data Beard Podcast just released a couple weeks ago. Big Data Bears Podcast? This is going to be epic!

Beard. Check. Talks about big data, check. The Big Data Beard Podcast is a Podcast with a group of engineers I’ve been working with, and what we decided was, we decided we should take our conversations that we have over coffee, or beers, or just at conferences, start recording those, and maybe have some guests along the way. This is a great way for you to be able to find out what’s going on in big data and data analytics, and also a great way to get more information.

I’m all about learning, and all about finding ways to get more involved in the community, and find out what’s going on. This is a great way, in under an hour, once a week, to be able to be involved, have some information, and then even interact with us.

If you’d like to appear on the show, or you have any ideas for the show, post them in the comments here. Send them on Twitter. However, you can get in touch with me, just give us those ideas, and we’ll be sure to field those questions.

Make sure you subscribe and check out the Big Data Beard Podcast. Thanks, folks!

Kappa Architecture Examples in Real-Time Processing

October 11, 2017 by Thomas Henson Leave a Comment

“Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

Yes we can build models based on the real-time processing and in fact there are some you use every day….

In today’s episode of Big Data Big Questions, we will explore some real-world Kappa Architecture examples. Watch out this video and find out!

Video

Transcription

Hi, folks. Thomas Henson here with thomashenson.com. And today we’re going to have another episode of Big Data, Big Questions. And so, today’s episode, we’re going to focus on some examples of the Kappa Architecture. And so, stay tuned to find out more.

So, today’s question comes in from a user on YouTube, Yaso1977 . They’ve asked: “Is it possible to build a prediction model based on real-time processing data frameworks such as the Kappa Architecture?”

And so, I think this user is stemming this question from their defense for either their master’s degree or their Ph.D. So, first off, Yaso1977, congratulations on standing on your defense and creating your research project around this. I’m going to answer this question as best I could and put myself in your situation where if I was starting out and had to come up with a research project to be able to stand for either my Master’s or my Ph.D. What would I do, and what are some of the things I would look at?

And so, I’m going to base most of these around the Kappa Architecture because that is the future, right, of streaming analytics and IoT. And it’s kind of where we’re starting to see the industry trend. And so, some of those examples that we’re going to be looking for are not just going to be out there just yet, right? We still have a lot of applications and a lot of users that are on Lambda. And Kappa is still a little bit more on the cutting edge.

So, there are three main areas that I would look for to find those examples. The first one is going to be in IoT. So your newer IoTs to the Internet of things workflows, you’re going to start to see that. One of the reasons that we’re going to see that is because there’s millions and millions of these devices that are out there.

And so, you can think of any device, you know, whether be it from a manufacturer that has sensors on manufacturing equipment, smart cards, or even smartphones, and just information from multiple millions of users that are all streaming back in and doing some kind of prediction modeling doing some kind of analytics on that data as it comes in.

And so, on those newer workflows, you’re probably going to start to see the Kappa Architecture being implemented in there. So, I would focus first off looking at IoT workflows.

Second, this is the tried and true one that we’ve seen all throughout Big Data since we’ve started implementing Hadoop, but fraud detection, specifically with credit cards and some of the other pieces. So, you know, look at that from a security perspective, and so a lot of security. I mean, we just had the Equifax data breach and so many other ones.

So, I would, for sure, look at some of the fraud detection around, you know, maybe, some of the major credit card companies and see kind of what they’re doing and what they have published around it. Because just like in our IoT example, we’re talking millions and millions, maybe, even billions of users all having, you know, multiple transactions going on at one time. All that data needs to be processed and needs to be logged, and, you know, we’re looking for fraud detection. That needs to be pretty quick, right? Because you need to be able to capture that in the moment that, you know…Whether you’re inserting your chip card or whether you’re swiping your card, you need to know whether that’s about to happen, right?

So, it has to be done pretty quickly. And so, it’s definitely a streaming architecture. My bet is there’s some people out there that are already using that Kappa Architecture.

And then another one is going to be anomaly detection. I’m going to break that into two different ones. So, anomaly detection ones talk about security from the insider threats. So, think of being able to capture, you know, insider threats in your organization that are maybe trying to leak data or trying to give access to people that don’t need to have it. Those are still things that happen in real-time. And, you know, the faster that you can make that decision, the faster that you could predict that somebody is an insider threat, or that they’re doing something malicious on your network, the quicker and the less damage that is going to be done to your environment.

And then, also, anomaly detection from manufacturers. So, we’re talking about a little bit about IoT but also looking at manufacture. So, there’s a great example. And I would say that, you know, for your research, one of the books that you would want to look into is the Introduction to the Apache Flink. There’s an example in there from a manufacturer of Erickson who’ve implemented the Kappa Architecture. And what they have is…I think it’s like 10 to 100 terabytes of data that they’re processing at one time. And they’re looking for anomaly detection in that workflow to see, you know, are there sensors? Are there certain things that are happening that are out of the norm so that maybe they can stop manufacturing defect or predict something that’s going to go wrong within their manufacturing area, and then also, you know, externally, you know, from when the users have their devices and be able to predict those too?

So, those are the three areas that I would check, definitely check out the Introduction to Apache Flink, a lot of talk about the Kappa Architecture. Use that as some of your resources and be able to, you know, pull out some of those examples.

But remember, those three areas that I would really key on and look at are IoT, fraud detection. So, look at some of the credit companies or other fraud detections. And then also, anomaly detection, whether be insider threats or manufacturers.

So, that’s the end of today’s episode for Big Data, Big Questions. I want to thank everyone for watching. And before you leave, make sure that you subscribe. So, you never want to miss an episode. You never want to miss any of my Big Data Tips. So, make sure you subscribe, and I will see you next time. Thank you

Big Data Big Questions: Learning to Become a Data Engineer?

September 22, 2017 by Thomas Henson 2 Comments

Data Scientist for the past few years has been named the sexiest job in IT. However the Data Engineer is a huge part of the Big Data movement. The Data Engineer is one the top paying jobs in IT. On average the Data Engineer can make anywhere from 90K – 150K a year.

Data Engineers are responsible for moving large amounts of data, administering the Hadoop/Streaming/Analytics Cluster, and writing MapReduce/Spark/Flink/Scala/etc. jobs.

With all this excitement for Data Analytics and Data Engineers, how can you get involved in this community?

Ready to learn tips to becoming a Data Engineer? Checkout this video for tips to becoming a Data Engineer.

Transcript

Hi Folks, I’m Thomas Henson, with thomashenson.com, and welcome back to another episode of Big Data, Big Questions. Today’s question is: What are some tips for learning to become a better data engineer? Find out more right after this.

So, today’s episode is all about tips for learning to become a better data engineer. So, if you’re watching this, you’re probably concerned with, one, how can I start out becoming a data engineer? What are some ways that I can learn to become better? Or maybe you’re just looking to answer one specific question. But all those are encompassed in what we call the data engineer.

A data engineer is somebody who’s concerned with moving data in and out of Hadoop ecosystem, being able to give status scientists and data analysts better views into the data. So, we’re involved with the day-to-day interactions of how that data is coming in. Is it in how we’re ingesting that data? How are we creating those applications and tuning those applications so that the data comes in faster? All to support those business analysts, those business decisions, and data scientists in creating better models and having just more data to put their hands on.

And so, a lot of times what we’re always doing is we’re asked to take on a couple terabytes of data here, maybe implement and do all the configuration for your hives. You know, your hive implementation or HBase or anything that’s in that big data ecosystem. Some of the tips that I’ve found for just getting started, so if you’re brand new to this and you don’t know where to start, the first thing I would recommend is, go out and just download the sandboxes.

So, download Cloudera’s sandbox, or download Hortonworks’ sandbox and just start playing with it. Go through some of the tutorials. Stand up on your local machine in a VM environment, and just start playing with moving some of the data around. Find some sample data, so go to data.world. Also, I have a post and a video on where to find some data sets, so take those data sets in, start ingesting those. I have a ton of resources and a ton of material on just some simple examples that you can walk through with Pig, and some around Hive. So, go there and find some of those. But, basically what I’m saying is, just get hands-on. Start creating applications. Start trying to do some simple things like, ingest some data in, put it into Hive, and be able to create a table and pull some of that data out, and just maybe some simple Hive queries. And do the same thing with Pig, and just kind of go around to some of those applications that you’re curious about, and start playing with them.

Another thing is, is once you start playing, and sampling, and testing that data, get involved. By getting involved, just ask some questions, create a blog post, try to find a way that you can contribute back to the community. I mean, that’s what I did when I was first starting out. I started off with a sandbox, and what I did was, I took and made sure that every day for 30 minutes, I was learning something new in the Hadoop ecosystem. And so, that’s another tip for you too, is to take and try to do this 30 minutes a day, every day. Even Saturdays, Sundays. Don’t take a day off. And it’s only 30 minutes. And if it’s something that you’re passionate about, and you like doing, that time is just going to fly by. But over time, that’s just really going to give you more and more time in the Hadoop ecosystem. So, whether you’re doing this for a project at work, whether you’re already in the ecosystem and you’re just trying to improve, that 30 minutes a day is really going to help. And it’s something that I’ve continued to do, and continued to do, now, even though I’ve been in part of the community for three or four years now. It’s how I just continue to learn, so I make sure I’m always kind of pushing.

Bound vs. Unbound Data in Real Time Analytics

August 9, 2017 by Thomas Henson Leave a Comment

Breaking The World of Processing

Streaming and Real-Time analytics are pushing the boundaries of our analytic architecture patterns. In the big data community we now break down analytics processing into batch or streaming. If you glance at the top contributions most of the excitement is on the streaming side (Apache Beam, Flink, & Spark).

What is causing the break in our architecture patterns?

A huge reason for the break in our existing architecture patterns is the concept of Bound vs. Unbound data. This concept is as fundamental as the Data Lake or Data Hub and we have been dealing with it long before Hadoop. Let’s break down both Bound and Unbound data.

Bound Data

Bound data is finite and unchanging data, where everything is known about the set of data. Typically Bound data has a known ending point and is relatively fixed. An easy example is what was last year’s sales numbers for Telsa Model S. Since we are looking into the past we have a perfect timebox with a fixed number of results (number of sales).

Traditionally we have analyzed data as Bound data sets looking back into the past. Using historic data sets to look for patterns or correlation that can be studied to improve future results. The timeline on these future results were measured in months or years.

For example, testing a marketing campaign for the Telsa Model S would take place over a quarter. At the end of the quarter sales and marketing metrics are measured deeming a success or failure for the campaign. Tweaks for the campaign are implemented for next quarter and the waiting cycle continues. Why not tweak and measure the campaign from the first onset?

Our architectures and systems were built to handle data in this fashion because we didn’t have the ability to analyze data in real-time. Now with the lower cost for CPU and explosion in Open Source Software for analyzing data, future results can be measured in days, hours, minutes, and seconds.

Unbound Data

Unbound data is unpredictable, infinite, and not always sequential. The data creation is a never ending cycle, similar to Bill Murray in Ground Hog Day. It just keeps going and going. For example, data generated on a Web Scale Enterprise Network is Unbound. The network traffic messages and logs are constantly being generated, external traffic can scale-up generating more messages, remote systems with latency could report non-sequential logs, and etc. Trying to analyze all this data as Bound data is asking for pain and failure (trust me I’ve been down this road).

Our world is built on processing unbound data. Think of ourselves as machines and our brains as the processing engine. Yesterday I was walking across a parking lot with my 5 year old daughter. How much Unbound data (stimuli) did I process and analyze?

Watching for cars in the parking lot and calculating where and when to walk
Ensuring I was holding my daughter’s hand and that she was still in step with me
Knowing the location of my car and path to get to car
Puddles, pot holes, and pedestrians to navigate

Did all this data (stimuli) come in concise and finite fashion for me to analyze? Of course not!

All the data points were unpredictable and infinite. At any time during our walk to the car more stimuli could be introduced(cars, weather, people, etc). In the real world all our data is Unbound and has always been.

How to Manage Bound vs. Unbound Data

What does this mean? It means we need better systems and architectures for analyzing Unbound data, but we also need to support those Bound data sets in the same system. Our systems, architectures, and software has been built to run bound data sets. Since the 1970’s where relations database were built to hold data collected. The problem is in the next 2-4 years we are going to have 20 – 30 billion connected devices. All sending data that we as consumers will demand instant feedback on!

On the processing side the community has shifted to true streaming analytics projects with Apache Flink, Apache Beam and Spark Streaming to name a few. Flink is a project showing strong promise of consolidating our Lambda Architecture into a Kappa Architecture. By switching to a Kappa Architecture developers/administrators can support on code base for both streaming and batch workloads. Not only does this help with the technical debt of managing two system, but eliminates the need for multiple writes for data blocks.

Scale-out architectures have provided us the ability to quickly expand our demand. Scale-out is not just Hadoop clusters that allow for Web Scale, but the ability to scale compute intense workloads vs. storage intense. Most Hadoop cluster are extremely CPU top heavy because each time storage is needed CPU is added as well.

Will your architecture support 10 TBs more? How about 4 PBs? Get ready for explosion in Unbound data….