Data Engineers Archives - Thomas Henson

Tips & Tricks for Studying Machine Learning Projects

February 16, 2021 by Thomas Henson Leave a Comment

How to Study Machine Learning Through Projects

Studying Machine Learning can seem overwhelming! Over our careers as developer or technologist we are constantly having to learning new skills. Whether you are Database Administrator who needs to learn about Hadoop or Web Developer looking to learn JavaScript. Change is enviable and the way to change is through learning. In fact many developers in the community advocate for making learning a daily or weekly habit of 1 – 2 hours every week. In today’s episode of Big Data Big Questions we explore my tips and tricks for learning Machine Learning (ML) or any other new technology.

Tips and Ticks for Studying Machine Learning

Make sure to watch the full video where I break down my tips and tricks for learning Machine Learning.

Want More Data Career Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Questions where I answer questions from the community about Data Career questions.

Kubernetes vs. Hadoop Career Growth

November 24, 2019 by Thomas Henson Leave a Comment

Comparing Growth in Kubernetes & Hadoop

How do you select a career path in Data Engineering? There are some many options for technologies to learn for System Administrators in Data Engineering. While Hadoop has been king of for the past decade, we must make way for a higher level technology. Kubernetes (K8s) is a distributed computing technology but vastly different from Hadoop on the data processing side. K8s is an open-source technology for automating, managing, and scaling containers. If we are comparing to Hadoop think of comparing to the Systems Administrative side of Data Engineering not Software Engineering.

The questions is how does the K8s ecosystem compare to that Hadoop?

Should I add K8s to my list of must learn technologies?

https://youtu.be/FZP-OJyZf0Q

Kubernetes vs. Hadoop Transcript

Hi, folks. Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, in this episode we’re going to be talking and breaking down Kubernetes versus Hadoop and talking about specifically which one I would prefer, if I was starting out today, to learn as a data engineer. Before we even get into it, I’m not saying that these are the same technologies, but I am comparing the popularity and I’m also comparing what some of the innovations are that we want to study as data engineers, or just people in the industry as a whole, where do we see those markets. So, let’s jump right into that right after this.

All right, so, today’s question comes in… We’re going to talk a little bit about the popularity of Kubernetes specifically, and Hadoop, and where we’re at. One of the biggest things right now is the popularity of Kubernetes in the container world has eclipsed the popularity of Hadoop and the Hadoop ecosystem. If we were looking at a chart, we would see the number of contributions and development to those open source platforms, that Hadoop has just been eclipsed by it. Now, Kubernetes is not replacing Hadoop, but it is changing the way… And there are innovations in Hadoop that are taking advantage of containers and specifically Kubernetes. So, let’s break it down and talk a little bit about each one of these and then we’ll do a comparison of where we see it going for data engineers and then also provide recommendations for which one I would choose if I was starting fresh today.

Kubernetes is an open source orchestration system for automating application deployment, scaling, and management. It was originally designed by Google. Hmmm. That sounds familiar, right? [Laughs] So many things from the open source world comes. But if you think about Kubernetes… If you’ve done anything with containers, and specifically around Docker, Kubernetes is that orchestration, that layer that allows for you to cluster these together. Think of how Yarn works in the Hadoop world. We’ll talk a little bit about Hadoop here soon. Think of it as just being able to orchestrate, “I have all these different nodes that are deploying my application or running different portions of my application.” Kubernetes is the secret sauce to say, “Hey, I need three of these, or four of these,” and be able to not only do some of the load balancing and some of the other pieces, but the orchestration to scale those up and make them elastic and scale them down.

Kubernetes is synonymous with cloud-native. What’s going on with a cloud-native? Being able to move applications from Azure to AWS and make it seamless, or on-prem. To be able to develop something on-prem and be able to push it out. Really cool, really popular. Just another open source piece that’s came out of Google. Man, thank goodness for Google and all the things that they’ve done for the open source world. But, just another way to do that orchestration. So, from a data engineer’s perspective, really cool for us because it changes how we can deploy and use our applications. Back to what we were talking about, even the cloud-native piece. Being able to deploy out, start something, from a POC perspective, to be able to start and take advantage of being able to use something on-prem or do it in the cloud and then pull it back down, that’s really cool. And then also changes that application layer, but first let’s talk a little bit more about Hadoop, then we’ll talk about how it all fits together.

So, Hadoop, we’ve talked about it a bit here. Been around for a long time, synonymous with being able to scale out and make large-scale data decisions, like, be able to have that storage layer and also be able to analyze your data too. So, think of it as a node-based architecture similar to what we were talking about with Kubernetes, little bit different, but a node-based architecture that’s going to allow for us to analyze data and be able to make decisions with it over a large cluster, like, you’re building out a huge system here. So, synonymous with, originally, in early days of MapReduce, but it’s taking over more of a Hadoop ecosystem where we’re talking about not only that storage layer of HDFS, but also that processing layer that actually is going to allow for us to process the data on individual nodes, bring those back to the user, and be able to take advantage of all that under the covers. Also another product built and written of off research that was published by Google. Not specifically open sourced by Google, but some of the research papers that they pushed out there led to the writing of a research paper and the open source portion of Hadoop that became popular. It’s been around… If you follow this channel, you’ve heard me talk about Hadoop a good bit too.

Now, let’s talk a little bit about the architecture. When we’re talking about Hadoop and the architecture, we’re talking about running our data processing maybe with Spark, or Hive, or it could be Pig, or anything like that. Then you’ve got your layer that is your orchestration and that’s where Yarn comes in. It’s going to process that we have the resources on all these different nodes and spread out. And then your data layer comes in at HDFS. That’s where my data is stored, with an HDFS perspective. Well, what Kubernetes can do, so how it fits in the Big Data world and where we really see it is… Think of Spark, TensorFlow, any of those tools that we were talking about. Now our orchestration layer can be actually with Kubernetes. And then we can do persistent storage in our databases, S3, or there are some innovations out there that you can to do it in HDFS, but you’re seeing it used more in a cloud-native perspective, so little bit different change of the architecture. The really cool thing about it is you’re actually abstracting away, so you’re not only just using Yarn just for building out this cluster, building out a system, you’ll be using Kubernetes, which can also do your application development too, like, you are… I’m sorry, application deployment. So, you know, being able to build cloud-native applications that are not just for data analytics but maybe serving out your web host and some of the other pieces. So, really a lot of innovation around that and really cool. There’s a lot of stuff and I just can’t go into it. I’m trying to give you a high level from here but there’s a lot of resources, a lot of courses out there, a lot of other things that maybe we should pick up at some point to talk about, around the innovations with Kubernetes. It’s really cool. It’s really changing, it’s in flux. If anybody is saying that, “Hey, it’s always going to be this way,” it’s one of those things that’s continuing to innovate, just like the data analytics area too.

At the end of the day, I guess the question is, if I were starting out today, would I focus solely on Hadoop, would I focus solely on Kubernetes if I can only choose one? Which, you can never only choose one, but I appreciate and I love these types of questions too because they really push me to make a decision. So, today… One of the things, the biggest thing… Cloud is a huge topic and being able to do things cloud-natively, so being able to support it on-prem, being able to support it and push it out to different multi-cloud… There’s so many different topics and buzzwords in that, but if we really look at the to that and that decision making, Kubernetes is really one of the big portions around that, and that has a huge impact on what we’re doing from a data analytics perspective.

And frankly, because of Hadoop’s not so much ease to use it in the cloud, I think that’s one of the reasons that we’ve seen Hadoop wane with their growth. We’ve seen Hortonworks, Cloudera merge together, MapR be picked up and purchased, and then also IBM’s BigInsights, because of the fact that these were systems that would only work on-prem. You had other options in other cloud perspectives, but AWS had their version versus Azure had their version under the covers, but if you wanted to really pull it back into your own on-prem area or push it out, it was a little more complicated, and you couldn’t just move it from AWS to Azure. Kubernetes has really pushed that, not just from the analytics world, but from that perspective.

So, if you made me choose today and you said, “Hey, man, you can only choose one and it’s something that you’ve got to get skilled up on in the next three to six months,” I’d choose Kubernetes. Not saying that I would not learn Hadoop from that perspective, but if I had to choose between the two of those right now, I think there’d be a bigger opportunity for data engineers and specifically systems administrators and those kinds of people that are more hands-on with the administration piece. I think that’s where we’re going to see a lot… And you’ve seen a lot with the open source tools out there in the Hadoop ecosystem, like Spark and some of the things going on with Project Submarine, just being able to support containers.

That’s all I have today for Big Data Big Questions. If you have a question make sure you put it in the comment section here below or reach out to me. I’ll do my best to answer those questions here, on the next episode of Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

O’Reilly AI Conference London 2019

October 9, 2019 by Thomas Henson Leave a Comment

The Big Data Big Data Questions show is heading to London for the O’Reilly AI Conference October 15 – 17 2019. I’m excited to be a part of the O’Reilly AI Conference series. In fact, this will be my third O’Reilly AI conference in the past year. Let’s look back at those events and forward to London.

San Jose & New York

View this post on Instagram

Late night packing my conference gear for my trip to O’Reilly AI Conference this week. Most important items: 1️⃣ Stickers 2️⃣ 🎧 3️⃣ 💻 4️⃣ Bandages? (I’ll explain later) 5️⃣ 📚 (this weeks its my Neural Networking) What’s your list of must have gear for tech conferences? #programming #coding #AI #conference #techconference

A post shared by Thomas Henson (@thomas_henson) on Sep 5, 2018 at 5:09am PDT

First in 2018 I attended the San Jose conference where I spent a good portion of the time in the Dell EMC booth talking with Data Engineers and Data Scientist. One of the major themes I heard from Data professionals was they were attending to learn how to incorporate Tensorflow into their workflows. In my opinion Tensorflow was talked about in every aspect of the conference. We had a blast learning from attendees and discussing how to Scale Deep Learning Workloads. Also this was my first time attending a conference with 14 stitches in my left hand (trouble on the pull up bar)!

Next was O’Reilly AI New York. Forever this conference will be known in my head as the Sofia the Robot trip. During this conference I worked with Sofia the Robot not only at the conference but in a Dell EMC event at Time Square Studios (part of the Dell Technologies Magic of AI Series). Before the Magic of AI event, Sofia and I spent the day recording with O’Reilly TV about the current state of AI and what’s driving the widespread adoption. After a day of recording, I had a keynote for day two of the O’Reilly AI Conference where I discussed how AI is impacting future generations already. Then there was a whirlwind of activity as Sofia the Robot took questions at the Dell Technologies booth. The last thing of the day was the Magic of AI event in Time Square Studio where we had 100 people taking part in a questions and answer session with Sofia the Robot.

Keynote O’Reilly AI Conference New York

Coffee with Sofia the Robot

http://https://youtu.be/KbBvdoUOpmY

On To London

Next up is O’Reilly AI London. To say I’m excited is an understatement. During this trip I will accomplish many first time moments.

To begin with it’s my first international conference along with my first time in London. So many things to see and so little time to do it. Feel free to give me suggestions about visit locations in the comment section below.

Second at O’Reilly AI London I will give my first breakout session at an O’Reilly Conference. While I’ve been on O’Reilly TV and given a keynote I’ve yet to have a breakout session. My session is titled AI Growing Pains: Platform Considerations for Moving from POC to Large-Scale Deployments. The world is changing to innovate and incorporate Artificial Intelligence in many applications and services. However, with all this excitement many Data Engineers are still struggling with how to get projects past the Proof-of-Concept phase (POC) and into Production. Production environments present a list of challenges. The 3 biggest challenges I see when moving from POC to Production are the following:

The gravity of data is just as real as the gravity in the physical world. As Deep Learning workloads continue grow so does the amount of data stored to train these models. The data has gravity that will attract services and applications to the data. The trouble here making sure you have correct Data pipelines Strategy on place.
Once I had dinner with one of the Co-founders of Hortonworks, during which he said “Everything as Scale is exponentially harder. Have you ever moved around photos on your desktop? For the most part this is an easy task except when you accidentally move a large set of photos. Instantly after moving these large folders you are endlessly waiting for the hour glass to finish. Image doing this with 10 PBs of data. I think you get the picture here.
The talent pool today compared to early days of “Big Data” is much larger. However, the demand for skills in Deep Learning, Machine Learning, and Data Engineering is stressing the system. Which still leaves a skills gap for experienced engineers with Deep Learning and Machine Learning skills. The skills gap is one huge factor for why many projects get stuck in the POC phase instead into production.

If you would like to know more about moving projects from POC to Production make sure to checkout my session if you are attending O’Reilly AI Conference in London. AI Growing Pains: Platform Considerations for Moving from POC to Large-Scale Deployments @ 11:55 on October 16, 2019.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Why Data Engineers Should Blog

July 16, 2019 by Thomas Henson Leave a Comment

Blogging For Data Engineers?

How important is it for Data Engineers to have a blog? In this episode of Big Data Big Questions I talk about importance of building a blog in your career in Data Engineering, Data Analysis, or Data Science. Learn my thoughts on What Every Data Engineers Should Have A Blog in the video below.

Transcript – Why Data Engineers Should Blog

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of…

Big Data Big Questions. Today’s question, I thought I would take a topic that I’ve seen and keeps coming up in some of my videos, and really dig down into it. Maybe this is going to be a multi-part series, but we’re going to talk about starting a blog to build your brand as a data engineer, data scientist, or if you’re watching this and you’re just a technologist or somebody that just wants to do book reviews, trust me, there’s going to be some topics in here that are generalized for everybody, but it really shows you how to key in on your field.

Before we jump into that, though, I want to say, if you have any questions, put them in the comment section here below. This is where I find content to make sure I’m interacting with the community and answering the questions that you want. It also gives me an idea. Hey, there’s enough people that ask a question or interested in a certain topic, and I haven’t done any research on it, gives me an opportunity to study and see what’s going on. This is all about being a community here. Reach out to me on thomashenson.com/big-questions if you don’t want to put it in the comment section here below. I’ll do my best to answer those quick as I can.

Today, I want to talk about why you should start a blog as a data engineer, or data scientist, or if you’re a web developer, and you’re watching this, or anything. I think it’s very important. In 2019, should you start a blog? I think so. I don’t think it’s something that is going away. Just because I say start a blog, you don’t have to start a blog and just write. You can start a vlog. I think you definitely should have your own domain. I bought thomashenson.com. It cost me, I think, $12 a month. No, $12 a year, but it’s like, hosting and everything like that can be really, really cheap. I wouldn’t worry about that. It’s really important. I’m going to talk a little bit first about my journey and why I started a blog.

When I got my first job, like I said, I’ve talked about it before, I was a web developer. One of the things where I was working at, we weren’t really embracing. We were using open source, but we weren’t really contributing, and it was shunned upon or shied upon for us to actually have any code to be able to show or anything like that. One of the things, I didn’t really think about it at the time, but you get a couple years into your role, and you might get opportunities to interview at other places, to do other things, and one of the things that came up that was really whole when I was going through the interview process was, I didn’t have any example code or anything like that I could show. I wasn’t involved in the open source community outside of work, and I didn’t have my code. It was my company’s property, and there were some other pretty big reasons I couldn’t, I didn’t have anything I could point to and show. That got me thinking. I don’t have anything that really captures the work and some of the things that I do. Then, at this time, too, I’d already embraced trying to do at least 30 minutes a day, or maybe even four times a week getting 30 minutes in of learning new things. I had all these ideas and all these things that I was going through and learning in the process, but I could only talk about them. I’m on a whiteboard or from a resume perspective, but I didn’t really, couldn’t really show. Couldn’t let it stand on its own. That’s where I started really looking into blogging. I was like, “Man, maybe I should start a blog.” Start a blog, didn’t really know what I was going to do with it. If you go back and look at some of my early posts, it was like, “You know, I’m doing this, and I’m starting a business!” It really wasn’t a business, it was just me writing. As I started writing, I started talking about some of the things I’ve learned. I would go through and look, and be able to create articles around something I’ve learned, maybe even create some test projects.

A lot of that, they weren’t very good when I started. It can be an opinion thing if they’re good now, but I definitely know that I’ve improved, and I feel like that, but I think it’s something that really helped me and really focused me, too. Like I said, I was a web developer. You’ve all heard my story before, about when I became a data engineer, and jumped into the Hadoop area. I had that platform, and I had already been practicing doing some of the blogging and stuff like that. It was really easy for me, as I was going through, and learning, and learning things that other people wanted to see, to be able to start writing pig Latin tutorials. Hive, and what I’m doing with H base [Phonetic 00:04:40] and HTFS, and just general tips of things that I learned. It was like, strengthening that muscle, and it really helped me just accelerate just in being a part of the community as well, too. That’s my journey. That’s one of the main reasons that I’m so big on it, is because I came from that area, where I didn’t have anything that I could point to and say, “Hey, look.” These are all the cool things that I’m doing.

That’s why I started a blog, but why do I think that you should? What should your story be? Your story, you’re still writing it. You should write it on a blog. I really think it’s something that’s help you build out your brand, and I think it’s always something good that shows, one, you’re interactive in the community. It keeps you honest and keeps you motivated, too. It’s late at night. I didn’t really want to have to record any videos. I wanted to put it off. I have an audience. I have a schedule, and I try to keep content coming out. This made me come out to the office, and make sure that I got on camera, and was able to create content here, too. The same thing with your blog. If you create a blog, say you create a schedule, and you’re like, hey. I mean, I’ve done this before. I’m going to publish once every month. When I was first starting out, you feel horrible when you don’t. I missed quite a few months. It took me a long time before I published every month. I just really wasn’t consistent. It’ll keep you honest about learning. It’ll keep you honest about creating content and being a part of that community, too. I really think that it’s good at any stage in your career, but especially if you’re watching this channel, and you’re trying to figure out, “Where do I get started? What are some things that I should be doing?” You’ve probably heard me say it a ton of times. Start creating something to be a part of the community. I’m not saying go out and… We’ll have a longer session about how to start blogging and how to find, how to create your own content. I’m not saying go out and borrow people’s content or anything like that and put it as your own. There’s a definite way that you can do a lot of different things. I’m going to end this video this time, but maybe this is, we’ll just call this part one. I definitely think we should dig into how to start that blog, some content ideas, but I think today just kick around the idea, just think about it, start churning, start kicking those around in your idea, and then we’ll talk, and follow up later on with some content ideas. I’ll show you how to set up on, I think, I used Dream Host, but there’s a ton of other places out there. It’s something simple that you can set up in 10 minutes, and if you’re using [Inaudible 00:07:18] you can start publishing some of your own content, having your own audience, heck, you can put it in the comment section here below, to build, and we can use our audience to help everybody push their content out there. We can all support each other as well, too.

That’s all I have for today. Like I said, I’m going to follow up. I really like this idea, here. If you have some comments, or you think it’s a bad idea to start a blog in 2019, which I don’t think it is, but I’d love to hear your opinion. All opinions are welcome, so, thanks again, and I will see you next time on Big Data Big Questions.

[Music]

Data Engineers: Python VS. C#

June 18, 2019 by Thomas Henson 8 Comments

Which Is Better Python Or C#?

Getting into wars over different programming languages is a no no in the world of programing. However, I recently had a question on Big Data Big Questions about which is better for Data Engineers Python or C#. So in the spirit of examining the difference through the lens of Data Engineering I decided to weigh in.

Python has long been used in Data Analytics for building Machine Learning models. C# is an object oriented programing language developed by Microsoft and used widely in all ranges of applications. Both have a ton of community support and a large user base but, which one is better? In this episode Big Data Big Questions I breakdown both Python and C# for Data Engineers. Make sure to watch the video to find out my thoughts on which is better in Data Engineering.

Transcripts – Data Engineer Python VS. C#

Hi folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question, we’re going to do a comparison between Python and C#. It’s a question that I’ve had coming in, and it’s also something that’s a passion of mine, because I used to be a C# developer back in the day. Then, I’ve currently, I guess the last four or five, maybe six years, I’ve learned Python. I thought it would be good to go through some of that, especially if you’re just starting out. Maybe you’re in high school, or maybe you’re in college, or maybe you’re even looking to make a jump into data engineering or machine learning engineering, and you’re like, “Hey, man, there’s C# out here. There’s Python.” What are some of the differences? What should I learn? Find out right after this.

Today’s episode of Big Data Big Questions, I wanted to do some of the differences between Python and C#. First thing, we’ll start off with C#. C#, heavily developed by Microsoft. I think it was released in 2000. It’s an object-oriented programming language. You see it a lot. I used it, for instance, back when I was doing asp.net. There’s a lot of things that you can do, use it for. It relies on the .NET framework. You have to have the .NET framework to be able to go. They are in version 7.0. Primarily used, I used it a lot for web application development, but you can do a lot of different things with it, build out really complex and awesome applications, whether it be a desktop application, whether it be web, mobile, they’ve just got so much of a community that, there’s a lot of different things that you can do with it. Another thing, too, one of the comparisons to it is, it looks just like Java. Another reason I rotated to it, because one of my first languages I learned, I think I learned VB first, but I did a lot of stuff around Java, and actually when I graduated out of college, I thought I was going to be a Java developer for a long time. Really got engrained in that community there. Fast forward to being a web developer, and transitioning to C#, it was a really natural process for me. Like I said, heavy community, heavy packages, and frameworks, and things to be able to use. See it a lot with Microsoft. If you’re doing C#, you’re probably used VisualStudio or I think it’s VS Code. They’ve got a couple different IDEs for development and everything like that. See it a lot there.

Python. If you’ve been following this channel, you’ve probably seen a ton of videos that we’ve done around Python. Python was developed in 1991. It’s in version three. We talked about C# being in version 7. Python’s in version 3. I wouldn’t put a lot into that, because we talked about C# being in 2000 and Python’s been around since 1991. Heavily involved, both of them. It’s object-oriented just like that, just like C#. Also, you see it a lot used in, for sure, data analytics, but there’s a lot of different other frameworks that you can use to do web development. Pretty much, you can do anything you want with Python. You do have to install Python and have that running in your version. Sometimes that can be a little bit clunky, especially maybe in a Windows environment, but it’s something that you can download and start playing with, and have going on your machine. Man, probably in less than five minutes. Maybe I should do a video on that, but you can go ahead, download that, and be up and running, and start running your own code. Huge community support. There’s a ton of things out there for it. Like I said, talked about, I think even in our book review, we talked about some of the books for data engineers. I think there were two to three Python books that I had showed there, too. Heavy use there. Like I said, a lot of involvement from data analytics, whether it be data scientists or machine learning engineers, and just like with Tensorflow, or PyTorch, a lot of the deep learning frameworks that we’ve talked about on this channel have Python APIs.

The question is, you’re a data engineer just starting out, which one should you learn? I’m going to go through three different questions, where we’re going to talk about what you should learn, and which one is better? I hate doing which one is better, because each one is a different tool, and some tools are better at other things, or have more functionality to do certain tasks. Let’s jump right into it.

Which one is easier to learn? Err! I’m having to put myself in there, because I’m biased as far as C# and just having been a part of that community. Like I said, my first language being in VB, which was similar, and then a ton of work in Java. C# on the premise looked a little easier, but the way I’m going to do this criteria is which one do I think is easier to get up and get started from a data engineering perspective or data analytics perspective. I’m going to have to give it to Python. Like I said, can be a little clunky when you’re first installing it, but if you were just able to open up a Linux, build out a Linux machine, you can do, especially if you’re in the red hot, and you can do Yum install Python, and then you can start scripting away on some of the code there. Then, also, too, I’ll give it Python just from the perspective of a lot of things from a data analytics perspective. Number two, I’m a data engineer. I’m a machine learning engineer. Which one should I learn today? Which one would I start off with if I had to choose, only choose between Python and C#? I would probably go with Python, right? Go ahead and learn Python. I would encourage anybody watching this channel, jump into that community. There’s a ton of books out there. We’ve talked about on this channel where you can go, and learn how to do data analytics from that perspective. Python’s going to get the win there. Which one do I enjoy coding in more? Personal preference, man, I think C# will always have that win for me. Like I said, this is a data engineering channel, but like I said. I started off as a web developer. I really like VisualStudio, and I know there’s some plugins you can do with VS Code. You can use that as your IDE for Python and everything like that. There’s something about C# and that language that I really found comfortable and probably will always have a special place in my heart. Like I said, just coming from a Java perspective and everything like that, I’ll give that the win. The overall win, the overall win between the three categories, if you’re a data engineer, a machine learning engineer, you have to start somewhere, I’d say start with Python. Go through some of the tutorials. Got some on this channel. I’ve got some on my blog, but get started there. I hope you enjoyed this. Tell me what you think. Did I miss something on the differences? Would you have chosen C# as something to start off with? Do you like Python better than C#, versus like I said, C# has a special place in my heart, let me know in the comments section below, or if you have any questions, Do you want me to answer on the show? Put it in here, and then make sure you subscribe and ring that bell, so you never miss an episode of Big Data Big Questions.

Data Engineer in 2019

May 31, 2019 by Thomas Henson Leave a Comment

What’s a Data Engineer Career Like in 2019?

Times change and keeping up with maintaining skills while managing day to day projects can be exhausting.

Should I learn Hive or Tensorflow?

Which is better Flink or Spark?

How as a Data Engineer will I focus on Containers?

Questions like these come up all the times when I speaking with aspiring and career focused Data Engineers. Find out my thoughts around skills and career outlook for Data Engineers in 2019 on this episode of Big Data Big Questions.

Transcript – Data Engineer in 2019

Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in around, what does data engineering in 2019 look like? What are some of the trends? What are some of the things that are going on? Has this question come in from a comments section here on YouTube, so if you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson.com. And, I’ll discuss it in an orderly fashion as they come in, provided I have the time. I’ve been getting a ton of questions, so I really appreciate it. Thank you for this community here.

Today’s question, before we jump into it, I want to give you my three top trends to watch for in 2019. Before we did, I did want to credit with an article that they did for their 10 trends in big data. I talked about them on my YouTube live session, so if you’re ever around Saturday mornings, jump on. Throw me a question in the chat. Let’s get to cracking. I try to answer as many questions as I can there, and try to do that Saturday mornings. Jump in there.

The 10 here, you can check in the comments section here below, where I have some of the link to the [Inaudible 00:01:13] trend here. I’m going to read some of them real quick. The first one they said for the top 10 trends in 2019. Data management and [Inaudible 00:01:22]. They’re talking a little bit about ETL and how ETL’s not going away. I’ve said that for a while, but we did read an article not too long ago that’s saying, “Hey, you know, there’s some tools out there that are really going to make ETL kind of a thing of the past.” We’ll see. Hopefully, right?

I’m not for ETL, I just, man. Started out there, and it seemed like I was never going to get out of it. Number two, data siloes continue to proliferate. This goes into what we saw when Hadoop emerged as this huge, big data lake, where the data’s only going to exist there. We’ve been talking about it, especially on this channel, over the past few years where, hey, data has a lot of gravity to it. There’s going to be data out on your edge. There’s going to be data in the cloud. There’s going to be data still in core data centers.

The idea of a fluid data lake is a little bit more consolidated. You still have those main areas, but you still have to do analytics and place in some area. number three, streaming analytics has a breakout year. Talked about streaming analytics on this channel for the last couple years. Actually did a session about the future architectures of streaming analytics at the 2017, was it Hadoop Summit? They call it Data Works, now.

Data governance builds steam, talked about some of that here. Soft skills start to emerge as tech evolves. Just talking about the soft skills of understanding the business, talks about that with the book, the big data MBA here. Deep learning gets a little bit deeper. Hm. Have we talked about deep learning on this channel? Special K expands footprint. They’re talking about Kubernetes and what’s going on with the doctorization. Clouds are hard to ignore. New tech will emerge, talking about how Silicon Valley and a lot of open source, and closed source, tools have been emerged, and they don’t see that stopping anytime soon. Then, smart things everywhere. I’ve talked about those a good bit here, too.

Without further ado, let’s jump into my three trends for 2019. My three trends to watch for in 2019. The first one, deep learning and Hadoop. How are these ecosystems going to interact with each other? A lot of project out there have talked about it last year, around project hydrogen, submarines, another project, and NVIDIA’s Rapid. It’s all about being able to use GPU and also be able to use those deep learning libraries with data that’s in your Hadoop ecosystem or just for some ETL. That’s one of the things that NVIDIA Rapid’s… Maybe I should do a video just specifically on that. Watch that trend. Start watching what’s going on with TensorFlow and being able to use integrated in with Spark and some of your other tools that are more traditional in the Hadoop ecosystem. That was number one. Number two. Two? Yep. Number two, containerization of the world overtakes data engineering. Similar to what they were talking about it [Inaudible 00:04:11], with their trends, with Special K being special. I think the containerization, we’ve seen it a lot, a lot of announcements here lately with cloud native applications and cloud native experiences on the Cloudera side, and you even saw in Hadoop 3.0 where they were laying the groundwork to be able to containerize your Yarn, schedule your engine, and some of the other components there. We’re going to continue to see that, and that’s one skill that you’re going to be looking for. If you’re in data engineering right now, you want to know what’s coming up down the pipe for you, I would look into doing some things and getting more familiar with the containerization. That’s actually in my roadmap for the end of the year for me, to understand a little more around docker, and Kubernetes, and that whole ecosystem. That is a big trend we will see for data engineering. It’s not going to slow down. It’s been picking up steam a lot here lately, but it’s going to go full force. My third trend, thing that I’m looking for, for data engineers in 2019, streaming analytics. I was doing some research and looking around some IDC numbers around where we’re talking about from a data perspective. We’re gonna be, one of the interesting tidbits that they were talking about is how streaming analytics will take up anywhere from around 30% of all the analytics and things that are going on in Azure. Think about all these different devices bringing in data here by 2025. 30% of that’s going to have to be streaming analytics. That’s a huge number. There’s a number of tools out there that are helping to try to deal with what’s going on from a streaming analytics perspective.

We’ve got , we’ve got Kafka. On the cloud side we’ve got Kinesis. A lot of different tools. We had [Inaudible 00:05:42] on this channel here, but there’s a lot of tools in place, a lot of tools being created, because streaming analytics is a huge beast of data to handle. It’s a different kind of problem than what we’ve seen, and it’s only going to get worse as we start bringing in more data, more devices. Really cool opportunities for you as data engineers. Outside of my goals for 2019, if you’re looking for some things to jump into and some educational paths for yourself as a data engineer in 2019, I would look into those three trends. Deep learning, containerization, and then streaming analytics. That’s all I have for today. Make sure to subscribe and ring that bell so that you never miss an episode of Big Data Big Questions. Throw a comment in the comments section here below if you have any questions. If you like the video, if you hated it, just let me know how you feel about this, and I will see you next time on this episode of Big Data Big Questions.

Data Engineer LinkedIN Profile

May 17, 2019 by Thomas Henson Leave a Comment

How Connect with Data Engineering Community on LinkedIN

When it comes to professional networking and social media LinkedIN is king! However, does it make sense for Data Engineers and Data Scientist to embrace LinkedIN? The simple answer is if you are not on LinkedIN you are missing a huge opportunity to network and get involved in the Data Analytics community. In this episode of Big Data Big Question we explore how to utilize LinkedIN for build a career in Data Engineers. Also we dig into tips for optimizing your Data Engineering or Data Scientist profile on LinkedIN. Watch the video below to learn how to amplify your reach in Data Engineering community.

Transcript Data Engineer LinkedIN Profile

Hi, folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. You’re probably wondering. Where the heck am I? Actually, on the other side is my office. I thought this would be a good opportunity for me just to record in a different location. New year, how about some new places to record? This is my gym that I’m continually building up over the years. New view. Today, what we’re going to talk about, this is a Big Data Big Question comes in. It’s all around how do you build a LinkedIn profile for a data engineer? Specifically, how do I build it if I’m in that role today, or how do I build it if, I don’t really have work experience? Are there some things that I can do? All while staying honest, right? I’m not giving you tips to say, “Hey, use this term even if you haven’t done it.” Those are all going to be some tips for us, and then make sure you stick around to the end. I’ll go through and show you what I’ve done on some of my LinkedIn profile, too. I’ll show you how I’ve stacked some projects, and added some videos, and other kinds of content that can help you stay in the community. We’re going to break everything done, and we’re going to talk about specifically for your LinkedIn profile as a data engineer, the things that you can control. The areas that you control. There are some things you can’t control, and we’ll talk about those a little bit.

The first thing is your title. You can come up with an awesome title. Obviously, try to keep it relevant. Don’t say you’re a data engineer if you’re not a data engineer, but you can be a data enthusiast. I’ve seen people talk about they’re data ninjas or data gurus. I’ve even seen somebody for a while that, actually right out of, I think they graduated like a year before I did, and they’re background was Excel, and they were an Excel, I think they did Excel Ninja, and that’s how they got their first role outside. It was not a data engineering role, but it was actually a developing role that came into GIS or something like that. You can get creative with that. You can go through, and you can also look at seeking opportunities. I’ve seen people that have put what they’re seeking there. Control that title. The next area, number two, that you can control, is your work experience. To some extent, right? If you don’t have work experience, there’s some things you can’t do there. You can put work experience from, come on? Come on? Open source projects, right? You can become a contributor. You can move your way up in those areas, and that can give you an opportunity to be able to add some things in there. If you do have work experience, put those in there. Make sure you’re putting your daily tasks, especially anything that’s data related, like if you’re doing stuff with SQL, you’re doing stuff with development, whether it be C# or BB. You can go back and look at my profile and see where I was a .NET developer. Put those tasks.

Then, also, find other tasks that maybe you’ve done some research. One of the things that I had to do was, I had to do research on different things. When we were moving, like I said, I was a web developer moving into the data engineering role. At the time, it was somewhat of a conscious effort, but not some, as well, too. I volunteered to get on a project, and so some of the things that I had to do was the research. Looking through Horton Works, looking through Cloudera, the Sandbox, going through and standing up our own Hadoop platform, and just testing those things out. That’s all valid. That might not be my day to day task. I didn’t do it the whole time I was there, but that was one of the things that I was tasked with doing, and even as small as that sounds, put that on there. That gives you more experience and, if somebody’s looking at your LinkedIn profile, goes, “Hey, man,” this person is moving into that role. They have this experience there. Another thing is, if you attended any conferences, there, too. That was another thing that I was very fortunate in my role, where it was like, “Hey, love to get into this, Mr. Customer.” We signed up for a big project. There’s a couple of conferences that I needed to attend to get skilled up.

Hadoop [Inaudible 00:03:48] some of them. You’ve probably seen me wearing, this one’s not it, but you’ve seen me wearing some of the hoodies from Hadoop conference. We’ll put those conference attendance in there, especially if you’ve spoken at any or anything like that. You can put project stuff in there. You’ll see it on my profile, but if you’ve done anything, even if you’ve made a simple demo or something like that, make sure it’s customer, it’s public-facing. Don’t put any information from a company you’re not supposed to, but you can actually add projects to it. Whether it’d be a link to a blog post that you wrote for your company or for a project that you’re on, or video. I’ve made some videos on my personal site, and you’ll see those here. I’m going to show you my profile.

Number three that you can control, there’s some things we can control about that work experience. Then, here we can control the education. If you have a college degree, if you have anything from that nature, even certifications. There’s a little section for certifications. Those are things that you can control. Controlling that, I’m not saying put that you went to MIT if you didn’t go to MIT, right? This is not going to help you. Short-term it might get you an interview, but that’s not the long game we want to play, and that’s just not the right thing to do. Make sure. I’m talking about, you can control it from the aspect of, “Hey,” if you’re planning on going to college, you have an anticipated graduation date, I would home in on that, and any kind of honors, projects, or denotations you’ve had in there, include that in your education section. Those are longer-term, but I’m saying, you can control it, because you can determine today what themes you’re going build on, what you’re going to do during your college and your education experience. It’s a long-term thing, right? Most of them are four years, five years, however long it takes you. Took me longer. Maybe I’ll do another video someday on how long it actually took me. Either way, you’re going through your education, the factors that you can control. Make sure you’re putting that on there. Short-term, we’ve talked about this [Inaudible 00:05:48] short-term still in education. You can control the certifications. We know what my goals are for 2019 as far as certifications and the certifications that I’m trying to knock out. Those are short term that you can. They have them with the [Inaudible 00:06:01] they have them with Coursera. Other education sites, and then there’s also the vender-specific. AWS, Horton Works, Cloudera, specific certifications that you can go through. You can start adding that, and that scenario where, with work experience it’s a little harder. With education, traditional four-year college, a little harder. I little long-term to go, but those, if you’re really fighting to take that next role or move into a role, whether it be within your company, whether you’re trying to, you’re a consultant trying to bring in more projects, go through some of those certifications. That’s something that you can tackle, and just depending on your knowledge base, something anywhere from one month to six months, you can knock out some of those certifications that are really going to help you build out that LinkedIn profile as a data engineer.

The last area that you can control. You control title, you can control the work experience, you can control your education and certifications. Activity. You have the most control, and you can pause this video and go post a relevant topic right now, assuming you have a LinkedIn profile, which if you don’t, I think it’s going to be very important to you. You should get one. You can control that activity. You can control what you do from a hashtag perspective. What you want to put out there as far as, hey, if you go and look at my site, you’ll see some of the things that I’m learning and I’m going through. Not only on my YouTube channel does everybody get to see behind the scenes of what I’m looking at, but more importantly, you can start to mold that, and pull that part into your education. From my perspective, you can see, for a good part of last year, I was really working on doubling down into deep learning and understanding what’s going on in that community from a Tensorflow perspective, [Inaudible 00:07:46] perspective, from a PyTorch, or just what the heck does a C&N mean? You can see it slowly evolving my education and sharing that knowledge, and same thing there. You control that activity, but it’s not a one-way street. You’re not trying to just put stuff out here. You want to be [Inaudible 00:08:03] communities, too. You want to like and comment on some of your peers and other people around that are interested in the same things that you’re interested in, too.

About to roll into my last section. That was the mailperson. Talked about how you can put in, how you can add projects, add experience, and really beef up your LinkedIn profile, specifically for data engineers, machine learning engineers, Hadoop developers, that whole ecosystem. Now, let’s take a look real quick at my profile. I promise that, if you stuck around, I’d show you. As we’re going through this, just check out here on the experience. This is what I was talking about. Whenever you’re looking at what you’ve specifically done for a job, and what’s your day to day task card, this is where you get to put in your experience. You can see here, not only do I have my day to day task and even some of what my day to day tasks might be, and what my job description is, but also some other things I’ve been involved with, like conferences spoke at. You can see here where I brought in projects. Whenever I do demos and some of these other things, even on my site, it’s good to be able to link it here and show those as projects, show people, hey, these are some of the things I’ve done. Same thing with conferences. I’ve had some conferences I spoke at, at other places, and this is how I roll.

You can see here too, from a Pluralsight perspective, this is one thing that I got involved with Pluralsight, and just love to be able to have this in my profile. This shows that in the industry, I’m taking this to heart, and not only am I doing this, and furthering my knowledge, but I’m giving back and helping others, too. This gives me that opportunity to be able to do it. Everybody here has that opportunity. As you’re learning things, document it, make videos. Do things to be a part of the community and be able to show on your profile. The next thing, the activity, here. Look at some of the activity. You can see there’s definitely something I’m posting. I’m not posting, maybe I’m shooting a video today [Inaudible 00:10:03], but I’m not posting, not over-rotating too much on topics outside of my interest. My interest is for data engineers, machine learning engineers, and the data science community. That’s what I’m posting. I’m posting things here, and I’m also actively liking and commenting on others’ posts just to have that communication and have, make it not just a one-way conversation. That’s just some tips, and that’s just some ways that I’ve crafted my LinkedIn profile. I hope that you’ll go out and find me on LinkedIn. Let’s connect, and just build out your profile, and this gives you an opportunity to, as you’re looking and building out your profile, you can see some gaps. There’s some holes in areas that you need to shore up, whether it be in work experience, certifications, education, or just activity. If you have any questions, make sure you put them in the comments section here below. Go subscribe and ring that bell so you never miss an episode, and you’re always notified whenever we do an upload here on Big Data Big Questions.

[Sound effects]

Tableau For Data Science?

May 15, 2019 by Thomas Henson Leave a Comment

Big Data Big Questions

Tableau is huge for interacting with data and empower users to find insight in their data. So does this mean Tableau is the primary tool for Data Scientist? In this episode of Big Data Big Questions we tackle the question of “Is Tableau used for Data Science”.

What is Tableau

Tableau is a business intelligence software that allows for users to visualize and drill down into data. Data Users leverage Tableau highly for visualization portion of Data Science projects. The sources for data can be from databases, CSVs, or almost any source with structured data. So if Tableau is for analyzing and visualizing data is it a tool specific Data Scientist? Watch the video below to find out Tableau’s role in the world of Data Science.

Transcript – Tableau For Data Science?

Hi folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question comes in from a user, and it’s around data science, and Tableau, and how those go together. But, before we jump into the question, if you have a question that you want to know about data engineering, IT, data science, anything related to IT, or just want to throw a question at me, put it in the comments section here below or reach out to me on Twitter at #BigDataBigQuestions. Or, thomashenson.com/big-questions. Ton of ways to get your questions here, answered right on this show, all you have to do is type away and ask.

Now, let’s jump into today’s question. Today’s question comes in from a YouTube viewer, and it’s about, hey, in data science, do you use Tableau? You can see the question here as it pertains to this, and so this is a question we started up this show doing, around data engineering, but now we’re really jumping towards, hey, what’s going on from a data science and just encompassing all of it? Today’s question, we’re going to talk about where’s Tableau used, right? A lot of people use Tableau. It’s really, really popular. But, is that really a tool that a data scientist is going to use? Should you invest your time as a data engineer or a data scientist aspiring or not aspiring to get into data science? Should you spend time learning about that tool?

My thoughts on Tableau are that it’s really good for giving information out to users that could be not necessarily data scientists. They could be users of it. They could be analysts. They could be somebody who just has a stake in their business. I’ve used it at a lot of different corporations that I’ve worked at, and companies, and companies, and organizations, and really what I see is those tools are more for the end user, for visualization. They may fall more in the data visualization bucket. We’ve talked about the three tiers of work. You have your data scientist, you have your data engineer, and your data visualization specialist, the person who’s making sure that, hey, at the end of the day, it’s great that we have all these algorithms that are showing us and being able to predict whatever we’re trying to look at in our data, but if we can’t sell that and can’t convey that to the people that need the data to make a decision on, then it’s just an experiment, it’s just us having fun doing research.

When it comes to an end product or being able to really sell your point, data visualization, I think that’s the bucket that Tableau fits in more than just traditional data science. Could be wrong. Let me know if I am here in the comments section below, but let me talk a little bit about my use case and where I’ve seen it. Like I said, I’ve used it in a lot of different organizations that I’ve worked with or even contracted with. One of the main use cases, I’ll give you an example. Let’s say that you’re a YouTube viewer. I’m not saying YouTube uses Tableau, this is just an example. I don’t want to give away too much information, insider. If you have a YouTube channel, think about if you want to see the videos that are coming in. You’re a user. You’re a publisher, a creator. You want to know. Here is all the videos that you have. Here’s how long they’re watched. Here’s all the demographics from behind the scenes that you can pull. Maybe the times that they were watched. How long they were watched, so on this video here, if people drop out after 30 seconds, I did something wrong there. Versus, how many people go through the end of it. Same thing, too. What you would do is, you would have all this information and aggregate all this data, and you maybe even pull some insights. Like, hey, what’s your average? We can do some real simple things, or you can do some complex things, too. Tableau is where you’re going to give the end user the access.

At least what I’ve seen a lot. There’s a big need to be able to do that and be able to pull that data. It gives you a way to, I wouldn’t say that a data scientist wouldn’t, per se, use that as their tool. It wouldn’t be their only tool. Maybe that’s the way that they aggregate and look at large amounts of data before they go in and start to pick and choose. I’m sure there’s some modules out there that are incorporating machine learning and deep learning. I will say, if you’re really looking from an AI perspective to jump into, it’s not just going to be about Tableau. I’m not saying that you shouldn’t get up to speed on Tableau, but I wouldn’t say that, hey, I’m a brand-new person graduating high school, graduating college, or somebody that sees it in their career and looking to go into data science, my choice would not be to jump in and learn Tableau. I would start learning a little bit more about Python, and algorithms, and maybe R, or some of the other higher-level languages to talk around machine learning and deep learning, versus saying, “Hey, this is the tool that’s going to kind of take me there.” Now, if you’re a data visualization person, or you want to get into big data from that perspective, there’s a lot of things that you can use Tableau to do. You might add it to your bucket. As far as we talk about on this show, how to accelerate your career or how to break into the big data realm, this is not one of those tools that I’m going to say, hey, this is the only choice you have. Not really going to be the one that’s probably going to make the more sense. It’s not going to be the game changer, like hey, this person’s certified in Tableau or is a Tableau wizard. If you’re applying for a job that’s all around Tableau then, definitely. As far as, I really want to get down into data science, and I really want to get deep in it, Tableau’s one of those things. Definitely probably going to use or come across tools that are similar to that, but it’s not going to be your mainstay, probably, where you’re writing your algorithms and doing your analytics.

That’s all for today. If you have any questions, make sure you put them in the comments section here below, and then make sure you click subscribe to follow this channel, so that you never miss an episode of Big Data Big Questions.

[Music]

Skills Needed for Big Data Administrators

April 30, 2018 by Thomas Henson 1 Comment

Data Engineers & Big Data Administrators

In today’s episode of Big Data Big Questions we tackle what the skills are needed for Big Data Administrators. Data Engineers wear many hats in Data Analytic workflows, one part software engineer and one part systems administrators. The Big Data Administrators are responsible for keeping Hadoop, Kafka, Ambari, and other frameworks running. Find out what other skills Big Data Administrators need in the video below.

Make sure to subscribe to my YouTube channel to never miss an episode of Big Data Big Questions.

Transcript – Skills Needed for Big Data Administrators

Hi, folks! Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today, I’m going to answer a user question about data administration, or in big data, what is that big data administrator’s role?

What are some of the tools that they use? How can you get involved? Find out more, right after this.

Welcome back. Today’s question is going to revolve all around the big data administrator, what that role is, what are some of the tools that they use? This question came in from my website. You can do Big Data Big Questions, go to thomashenson.com, click on Big Questions, submit a question there. Put them in the comments section here below, and then always, make sure that you’re subscribing to this YouTube channel, so that you’ll never miss an episode. These are great tips. These are great ways for me to answer any questions that you have. If you have those questions, ask them, but also make sure you’re subscribing to the channel.

Today’s question comes in from Jarvis. He says he has a dilemma on Python for big data. We answered a number of questions around Python and big data, and then do you have to know Java? But, this one is a little bit different. It’s going to cover the data administrator.

Hi Thomas, a big fan of yours.

Thanks for watching. Thanks for sending in the question.

I had a question related to IT careers and skills in big data. I wanted to know if Python is required only by data administrators, or can all things done by Java on big data be implemented using Python as well?

This question is really good. Like I said, we’ve talked a little bit about, do you have to know Java in order to be able to be a big data admin, be involved in big data, be a data engineer?

The answer is no. You can do things in Python, but I want to tackle the question from the perspective of, you’re asking about data administration, and so there are two different roles. We’ve talked about the data engineer versus the data scientists. The data engineer is the one who’s setting up the cluster, maybe doing some of the software development, running your Hive jobs, maybe even just the software developer, from if you’re writing Java jobs, if you’re writing your Spark jobs, but your data administrator, that’s a different role inside of that. We have two pieces of the spectrum. This side over here, this is more software development side generated, and on this side over here, let’s say that this is more of the administrator, or our systems engineer, the person who’s setting up and running the cluster. Maybe not doing the day-to-day coding but doing the administrating and running of the system. Think of that as your full stack developer.

Think about when you split up your systems admin, who’s setting up the stack, making sure the database is running, doing those tasks versus who’s running the… Whether it be PHP code or .NET code. What skills does a data administrator have to have?

I would say that, if we’re talking about being able to be involved in the community, and be involved in big data, you’re going to keying on HTFS, Ambari, Hive, Flume, and you’re going to have a lot of Linux skills. If you’re asking me, you want to get into data administration, you want to be an awesome data administrator in the big data ecosystem, do you have to know Java? No. Can everything be implemented in Python? Maybe, but you’re probably going to be doing more administrative tasks as far as setting up the cluster, understanding the operating system that Hadoop’s running on.

You’re maintaining more that Linux level, and the Hadoop ecosystem level, so if you’re using Hortonworks or you’re using Cloudera, how all those tools are integrating and talking to each other. I would focus more on not even so much the coding part, but as far as being able to set up that cluster. It’s going to vary, too. It’s going to vary in the role.

Some places, especially when you’re just starting out on big data, and you have a small team in your company, you’re going to be the software engineer and the data administrator, right? You might need to have a little more code.

If you’re going to a more seasoned team or a bigger team, you can actually have that role where you’re running the administration. My answer is, I wouldn’t worry so much about Python and Java, if that’s the role that you’re wanting.

The data administrator, I would worry about being able to integrate the tools. Be familiar with the tools, be familiar with how to set up, how to add notes, how to take notes down. How to set up secondary name nodes, so, being able to make sure that, when one name node goes down, the second, you can flip over to the second name node. Being able to back up the data. Making sure that we’re taking snapshots. All the kind of tasks that go into running the system, versus being able to write a MapReduce job. If you’re really keen on being a big data administrator, which, those are great roles, those are a lot of fun, you’re still hands on, but you’re not really having to write the jobs.

You’re checking out new tech, checking out new projects, to see, “Hey, am I going to be able to integrate this into our system,” or, “Man, you know, we’ve got two or three more nodes that are going to come online, so let’s make sure that we get those racked and stacked, and then, let’s make sure that we’re adding those to the cluster, too.”

A lot of cool things that you can do in that role. Most of them aren’t going to involve coding, so you’re not really going to have to worry about Java, you’re not going to have to worry about Python, as much as you would in the traditional data engineer, where you’re looking at being more of a software engineer.

I hope I answered your question. If anybody else has any questions, put them in the comments section here below. Make sure to follow me here, so click subscribe, and then I’ll see you next time.

Rise of the Machine Learning Engineer

April 27, 2018 by Thomas Henson Leave a Comment

What is a Machine Learning Engineer?

Move over Data Scientist the Machine Learning Engineer is now the best role in Big Data Analytics. The Machine Learning Engineer is a hybrid mix of half Data Engineer and half Data Scientist, who can implement the data models and even make recommendation for new data sets. Find out why the Machine Learning Engineer is getting a lot of attention in 2018 by watching the video below.

Make sure to subscribe to my YouTube channel to never miss an episode of Big Data Big Questions.

Transcript – Rise of the Machine Learning Engineer

Hi, folks! Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in from a user, and this all are going to be about the machine learning engineer. What is a machine learning engineer? How does it differ from a data engineer or data scientist? We’re going to jump into all that right after this.

Welcome back. Today’s question comes in from a user, so before we jump into the question, make sure that you go and click on the subscribe, so that you never miss an episode. Also, if you have a question and you would like for me to answer it, about data engineering, about books, about business, anything around IT and specifically probably data analytics, make sure you put those in the comments section here below. Go to my website, thomashenson.com/bigquestions or use the hashtag #BigDataBigQuestions on Twitter. I will try my best to answer those as quickly as I can.

I’ve been getting a lot of questions in, and I’m really thankful for all the questions, and I am working through them as well. Today’s question comes in from a user. From the comments section on YouTube, Andrew Wiley [Phonetic]. He says, “Is it possible to learn both data science and data engineering?” This question stems off of the Cloudera certification. I’ve answered some questions around what is a data engineer, what is a data scientist, but this question is specifically, “Okay, is there a blended of two?” Is there one position that’s a blend of two?

I’ll say, for a while, there’s been a lot of confusion around, “Okay, if you’re a data scientist, you know how to stand up a Hadoop cluster, or if you know how to stand up a Hadoop cluster, you must be a data scientist. You’re a wizard, right?” This question is about, what about the blending of the two skills? Think about it from a web development perspective. For a long time, we had our web developers, and we had our back-end developers, and then we had the full-stack web developer. Now, we have a full-stack data engineer, and those are called machine learning engineers.

On a recent podcast out there, that O’Reilly did at Strata, they had a couple quests on talking about the rise of the machine learning engineer, and so I would say that if you’re looking to have skills with data science and data engineering, that position is going to be called a machine learning engineer. My view on how the machine learning engineer has come to fruition is in two parts. If you’re working in a small development or small analytics shop, most likely the data engineer, the person who’s putting together the code and running the system, there’s going to be one or two people on that. It’s going to be a really small team, who are going to be filling that role of a data scientist.

There’s a lot. There’s a big skills gap for data engineers and even more so with data scientists, too. You might be able to go through and look at some of the prescribed analytics and machine learning algorithms that you want to use, and you, as the data engineer, will understand how to use those. It’s not just willy-nilly, like, “Hey, I’m just going to pull this one down and have it.” You need to have a background in statistics, and probability, and heavy on math. One of the things, one of my gaps in skills that I’ve been working on is the math part.

You can follow along as, watch me learn how machine learning… The machine learning course, with Andrew Ng’s course, and you can see some of the things, especially if you’re a data engineer, that you need to shore up, so that you can fit into that machine learning engineer.

Think of the machine learning engineer in the small shop as, you’re the full-stack developer, you’re the full-stack engineer. It’s kind of doing everything. Then, in larger corporations, what you’re going to have is, like I said, we’ve got it on both sides of the spectrum. You’ve got your data engineer, that are really good at setting up, administrating an environment, maybe even doing the software development, running Hive, creating the MapReduce jobs or the Spark jobs, but then you have your data scientists who are, maybe have some SQL skills, really good at math, but not really good at the technical. The machine learning engineer is that person in the middle, to kind of bridge the gap. In bigger shops, you’re going to have your machine learning engineer who’s working with your data scientist, and then starts to be able to pick up on, “Okay, this is the way that we like to do some of the things here, and you’re really owning that part of the stack, and so, you’re not so much worried about developing and doing what I would call the Hadoop administration, or even the Hadoop development.

When I say Hadoop, remember, we’re just talking about anything in that ecosystem. Your machine learning engineer is your specialization of that. I did a little research, too, just to look at it. Just pulling it up, just some preliminary research, just looking for jobs out there. A lot of times, we’ll say, “Yeah, this is, you’re an Excel guru, and you say, ‘Excel guru?'” You go look, and there’s nobody with a job title excel guru. You’re giving it to yourself.

Looking at machine learning engineer, quick search on Google for jobs, there are a lot of different postings from companies all the way from IBM to Facebook, Lyft, a lot of different postings out there, just in my quick search. Also, looking at Glassdoor, and some of the other places, the salary ranges are right there with what a data engineer is, so anywhere from the low 80s, which I wouldn’t think that, that’s probably not really a true machine learning engineer, or maybe it’s in a different part of the country, all the way up to the 160s. That’s salary range per year. I thought that was pretty good mix, there.

Really fit in line with what we see as the data engineer and the data scientist, so those roles are out there. If you’re excited to go out and learn those, remember what I was saying. Want to have a solid background as a data engineer with understanding how the Hadoop administration works. Also, the workflows, and some of the development skills. Want to be able to implement, if you’re using Mahout, if you’re using TensorFlow, any of those frameworks, you want to be able to implement those, but then you also want to have the math portion too, so make sure you understand the algorithms from a math level, and how to tweak, and how to tune those.

That’s all for today. Hope I answered your question. If you have any questions, anybody out there, make sure that you first go and subscribe, and then ask your question. I’ll try to answer them here. Have a good day.

Data Engineering Courses Without Java

February 11, 2018 by Thomas Henson Leave a Comment

There are a ton of resources out there for Data Engineers to learn to develop/administrate Hadoop/Big Data Ecosystem. Do all these courses require basic Java Knowledge? When you are starting out learning to become a data engineer it can seem everywhere you look basic Java skills are required, but I’m telling you that is not the case. In this video I’ll explore the options and thoughts on learning big data without Java skills.

Video – Data Engineer Courses without Java

Landing a Data Engineer Internship

February 10, 2018 by Thomas Henson Leave a Comment

What are the skills required to land a Big Data Engineer Internship? Is it possible to gain those skills in under 3 months?

Today’s question comes in from YouTube: “Please make a video on how much time is required to become big data engineer ?? is it possible in 3 months and also try to mention about minimum skills required for internship.”

In this episode of Big Data Big Questions we will explore what the basic skills are for landing an Data Engineer Internship and tips for excelling after landing that internship. Watch the video below to find out what skills Data Engineers should focus on before applying for a Data Engineer Internship.

How to Study Machine Learning Through Projects

Tips and Ticks for Studying Machine Learning

Want More Data Career Tips?

Comparing Growth in Kubernetes & Hadoop

Kubernetes vs. Hadoop Transcript

Want More Data Engineering Tips?

San Jose & New York

Keynote O’Reilly AI Conference New York

Coffee with Sofia the Robot

On To London

Want More Data Engineering Tips?

Blogging For Data Engineers?

Transcript – Why Data Engineers Should Blog

Which Is Better Python Or C#?

Transcripts – Data Engineer Python VS. C#

What’s a Data Engineer Career Like in 2019?

Transcript – Data Engineer in 2019

How Connect with Data Engineering Community on LinkedIN

Transcript Data Engineer LinkedIN Profile

Big Data Big Questions

What is Tableau

Transcript – Tableau For Data Science?

Data Engineers & Big Data Administrators

Transcript – Skills Needed for Big Data Administrators

What is a Machine Learning Engineer?

Transcript – Rise of the Machine Learning Engineer

Video – Data Engineer Courses without Java

Video – Data Engineer Internship