Tips & Tricks for Studying Machine Learning Projects

February 16, 2021 by Thomas Henson Leave a Comment

How to Study Machine Learning Through Projects

Studying Machine Learning can seem overwhelming! Over our careers as developer or technologist we are constantly having to learning new skills. Whether you are Database Administrator who needs to learn about Hadoop or Web Developer looking to learn JavaScript. Change is enviable and the way to change is through learning. In fact many developers in the community advocate for making learning a daily or weekly habit of 1 – 2 hours every week. In today’s episode of Big Data Big Questions we explore my tips and tricks for learning Machine Learning (ML) or any other new technology.

Tips and Ticks for Studying Machine Learning

Make sure to watch the full video where I break down my tips and tricks for learning Machine Learning.

Want More Data Career Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Questions where I answer questions from the community about Data Career questions.

Data Engineers: Data Science vs. Computer Science Degree

October 2, 2019 by Thomas Henson Leave a Comment

How Do you Choose the Right Degree?

College is such tough time when it comes to choosing education paths. For most folks College makes the first time they are making huge decision about their futures. So it’s easy to get analysis paralysis because the decision means so much. Or does it? At the end of the day it feels bigger than the decision really is over the long term.

The difference between a Data Science Degree and Computer Science degree might impact career outlook in the short term. The long term impacts of which degree you chose are minimal. Look around at the number of position where degrees aren’t even a requirement. When I was working on my first Big Data project our Data Scientist didn’t have a degree in Data Science but he was great in that role. Now I will say that Data Science degrees haven’t been around that long so it kinda of make sense.

Find out my thoughts of the differences between a Data Science Degree and Computer Science Degree in the video below.

Video Data Science vs. Computer Science Degree

Transcript – Data Science vs. Computer Science Degree

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in from a user, and it’s all about, what specific Master’s degree should I get? Find out how I answer this question and what Master’s degree you should get or should not get if you’re going into data engineering.

Today’s question. If you have a question, make sure you put it in the comments section here below. Reach out to me on thomashenson.com/bigquestions. Find me on Twitter, whatever you want to do, and I’ll do my best to answer your questions right here.

Today’s question. I’m looking for a career as a data engineer, but I’ve got a Bachelor’s in IT, and I’m looking to get into a Master’s degree. Awesome! Congratulations. It’s a pretty cool thing to go through. I went through a Master’s program as well. Which is better for data engineering career? Thinking about that specifically. A Master’s degree in data science or a Master’s degree in computer science.

This question, for me, really keys. I remember what it was like going through, when I’m trying to figure out which kind of Master’s I wanted to go to. I had a similar situation. Specifically wasn’t in the data engineering from that perspective, but I was looking in, to see, what do I want to do to take the next level in my engineering career? I looked at an MBA with an emphasis in information systems versus a Master in Science Computer Science. I ended up choosing to continue on down the business path and getting my MBA in information systems. Pretty excited to have gone through that, and really happy with my decision. I feel like it’s been fortunate with my career. I understand where you’re coming from. I’m not telling you to get an MBA. That’s not what I’m saying. I understand how much you look, going back and forth, and you’re like, man! What do I do here? I appreciate you asking for my opinion, as well. Which one should you get if you’re going into a data engineering? It’s an easy guess for me, here, just to say, “I think computer science and the skills that are involved in computer science are going to help.” If I were in your shoes, I would look, and pivot more towards the computer science. I would look into, though, there are new universities and other programs that are starting to emerge that actually have a data engineering track. Just like you were asking about, should I do the data science? In my opinion, if you’re not trying to go down the data science path, you maybe don’t go into that. If they do have a tack specific for data engineers, so data science in a newer program, a lot of universities and colleges are having around the globe, so if they have a specific data engineering path, I’d look into that. Specifically, I’d probably stay with the computer science track. However, like I said, there are some universities that are putting out a specific, “This is not data science,” but a specific data engineering path, where you’re going to go through more systems administration stuff, where you’re going to be building out programs that are going to analyze data, and being able to really focus on distributed systems, whether it be from Kubernetes, and containers, to different clouds. No one had to do it in AWS. Building out good data pipelines and really understanding what you’re doing from that perspective. I think I’d look into that, and also make sure you’re looking at some of those degrees.

One more bonus tip around as you’re going through that. I would definitely, at the university that you’re looking at, have a conversation with some advisors, and even some of the professors in the data science world or in the computer engineering world, and see if you can cross over. Maybe there’s an opportunity there to do something inter-disciplinarian. Maybe you can take a couple of the data science courses, because they would be really good for you to get exposure to it, not become a data scientist, but exposure to what goes on, on the data science side, and have those packaged together, and go through some of those courses while you’re going through the computer science course. Maybe they, not asking you to take double load. Hopefully there’s a crossover there, where it’s like, “Hey, I can pick and choose some of these.” With data engineering and just the boom that’s going on with that as far as careers and, if you look at just globally, we need more data engineers. The universities will be pretty excited for, especially somebody standing out to do that. Worst case scenario, what are they going to do? Your professors may tell you no, but they see that you’re engaged, and that you’re interested in data engineering, so they’re going to be able to look out for, maybe there’s new classes that are coming up. What about internships, right? Some of these universities have really good relationships with corporations. Your name is already at the top of the list, and it’s shown that you’re showing initiative, that hey, I’m excited about the data engineering world, so any opportunities to learn more or any opportunities for future career growth, might be a good thing. Something as simple as taking an hour to reach out and talk to a professor may be investing in yourself and in your career for further on down the future. Definitely try that out. Should you get a Master’s degree to become a data engineer? You don’t have to, but like I said, I’ve got a Master’s degree, and I went through that for my own purposes. If you’re watching this video, you’ve made it all the way to the end, which I hope you’ve made it to the end. Everybody that starts watching it, this was a specific question where we were talking about different degree options for your career. We’re not saying that you have to get the Master in Computer Science to become a data engineer. Heck, you can even go through, you can do the Master in Data Science and become a data engineer. This is just my advice for what we’re trying to do. There are other data engineers that don’t have degrees. We’ve covered that quite a few bit on this channel, and so I just want to be specific to that. I don’t want people watching this course, especially if you’re in college, or if you’re in high school and you’re starting to think about your data engineering path, like, “Aw, man! I’ve got to go get a Master’s degree to do this. Be in it for the long haul. That’s not what we’re talking about here. We’re just talking about options. Let me know if you have any questions about degrees, certifications, anything data engineering or technology-specific, and I will answer it on the next episode of Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Data Engineer LinkedIN Profile

May 17, 2019 by Thomas Henson Leave a Comment

How Connect with Data Engineering Community on LinkedIN

When it comes to professional networking and social media LinkedIN is king! However, does it make sense for Data Engineers and Data Scientist to embrace LinkedIN? The simple answer is if you are not on LinkedIN you are missing a huge opportunity to network and get involved in the Data Analytics community. In this episode of Big Data Big Question we explore how to utilize LinkedIN for build a career in Data Engineers. Also we dig into tips for optimizing your Data Engineering or Data Scientist profile on LinkedIN. Watch the video below to learn how to amplify your reach in Data Engineering community.

Transcript Data Engineer LinkedIN Profile

Hi, folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. You’re probably wondering. Where the heck am I? Actually, on the other side is my office. I thought this would be a good opportunity for me just to record in a different location. New year, how about some new places to record? This is my gym that I’m continually building up over the years. New view. Today, what we’re going to talk about, this is a Big Data Big Question comes in. It’s all around how do you build a LinkedIn profile for a data engineer? Specifically, how do I build it if I’m in that role today, or how do I build it if, I don’t really have work experience? Are there some things that I can do? All while staying honest, right? I’m not giving you tips to say, “Hey, use this term even if you haven’t done it.” Those are all going to be some tips for us, and then make sure you stick around to the end. I’ll go through and show you what I’ve done on some of my LinkedIn profile, too. I’ll show you how I’ve stacked some projects, and added some videos, and other kinds of content that can help you stay in the community. We’re going to break everything done, and we’re going to talk about specifically for your LinkedIn profile as a data engineer, the things that you can control. The areas that you control. There are some things you can’t control, and we’ll talk about those a little bit.

The first thing is your title. You can come up with an awesome title. Obviously, try to keep it relevant. Don’t say you’re a data engineer if you’re not a data engineer, but you can be a data enthusiast. I’ve seen people talk about they’re data ninjas or data gurus. I’ve even seen somebody for a while that, actually right out of, I think they graduated like a year before I did, and they’re background was Excel, and they were an Excel, I think they did Excel Ninja, and that’s how they got their first role outside. It was not a data engineering role, but it was actually a developing role that came into GIS or something like that. You can get creative with that. You can go through, and you can also look at seeking opportunities. I’ve seen people that have put what they’re seeking there. Control that title. The next area, number two, that you can control, is your work experience. To some extent, right? If you don’t have work experience, there’s some things you can’t do there. You can put work experience from, come on? Come on? Open source projects, right? You can become a contributor. You can move your way up in those areas, and that can give you an opportunity to be able to add some things in there. If you do have work experience, put those in there. Make sure you’re putting your daily tasks, especially anything that’s data related, like if you’re doing stuff with SQL, you’re doing stuff with development, whether it be C# or BB. You can go back and look at my profile and see where I was a .NET developer. Put those tasks.

Then, also, find other tasks that maybe you’ve done some research. One of the things that I had to do was, I had to do research on different things. When we were moving, like I said, I was a web developer moving into the data engineering role. At the time, it was somewhat of a conscious effort, but not some, as well, too. I volunteered to get on a project, and so some of the things that I had to do was the research. Looking through Horton Works, looking through Cloudera, the Sandbox, going through and standing up our own Hadoop platform, and just testing those things out. That’s all valid. That might not be my day to day task. I didn’t do it the whole time I was there, but that was one of the things that I was tasked with doing, and even as small as that sounds, put that on there. That gives you more experience and, if somebody’s looking at your LinkedIn profile, goes, “Hey, man,” this person is moving into that role. They have this experience there. Another thing is, if you attended any conferences, there, too. That was another thing that I was very fortunate in my role, where it was like, “Hey, love to get into this, Mr. Customer.” We signed up for a big project. There’s a couple of conferences that I needed to attend to get skilled up.

Hadoop [Inaudible 00:03:48] some of them. You’ve probably seen me wearing, this one’s not it, but you’ve seen me wearing some of the hoodies from Hadoop conference. We’ll put those conference attendance in there, especially if you’ve spoken at any or anything like that. You can put project stuff in there. You’ll see it on my profile, but if you’ve done anything, even if you’ve made a simple demo or something like that, make sure it’s customer, it’s public-facing. Don’t put any information from a company you’re not supposed to, but you can actually add projects to it. Whether it’d be a link to a blog post that you wrote for your company or for a project that you’re on, or video. I’ve made some videos on my personal site, and you’ll see those here. I’m going to show you my profile.

Number three that you can control, there’s some things we can control about that work experience. Then, here we can control the education. If you have a college degree, if you have anything from that nature, even certifications. There’s a little section for certifications. Those are things that you can control. Controlling that, I’m not saying put that you went to MIT if you didn’t go to MIT, right? This is not going to help you. Short-term it might get you an interview, but that’s not the long game we want to play, and that’s just not the right thing to do. Make sure. I’m talking about, you can control it from the aspect of, “Hey,” if you’re planning on going to college, you have an anticipated graduation date, I would home in on that, and any kind of honors, projects, or denotations you’ve had in there, include that in your education section. Those are longer-term, but I’m saying, you can control it, because you can determine today what themes you’re going build on, what you’re going to do during your college and your education experience. It’s a long-term thing, right? Most of them are four years, five years, however long it takes you. Took me longer. Maybe I’ll do another video someday on how long it actually took me. Either way, you’re going through your education, the factors that you can control. Make sure you’re putting that on there. Short-term, we’ve talked about this [Inaudible 00:05:48] short-term still in education. You can control the certifications. We know what my goals are for 2019 as far as certifications and the certifications that I’m trying to knock out. Those are short term that you can. They have them with the [Inaudible 00:06:01] they have them with Coursera. Other education sites, and then there’s also the vender-specific. AWS, Horton Works, Cloudera, specific certifications that you can go through. You can start adding that, and that scenario where, with work experience it’s a little harder. With education, traditional four-year college, a little harder. I little long-term to go, but those, if you’re really fighting to take that next role or move into a role, whether it be within your company, whether you’re trying to, you’re a consultant trying to bring in more projects, go through some of those certifications. That’s something that you can tackle, and just depending on your knowledge base, something anywhere from one month to six months, you can knock out some of those certifications that are really going to help you build out that LinkedIn profile as a data engineer.

The last area that you can control. You control title, you can control the work experience, you can control your education and certifications. Activity. You have the most control, and you can pause this video and go post a relevant topic right now, assuming you have a LinkedIn profile, which if you don’t, I think it’s going to be very important to you. You should get one. You can control that activity. You can control what you do from a hashtag perspective. What you want to put out there as far as, hey, if you go and look at my site, you’ll see some of the things that I’m learning and I’m going through. Not only on my YouTube channel does everybody get to see behind the scenes of what I’m looking at, but more importantly, you can start to mold that, and pull that part into your education. From my perspective, you can see, for a good part of last year, I was really working on doubling down into deep learning and understanding what’s going on in that community from a Tensorflow perspective, [Inaudible 00:07:46] perspective, from a PyTorch, or just what the heck does a C&N mean? You can see it slowly evolving my education and sharing that knowledge, and same thing there. You control that activity, but it’s not a one-way street. You’re not trying to just put stuff out here. You want to be [Inaudible 00:08:03] communities, too. You want to like and comment on some of your peers and other people around that are interested in the same things that you’re interested in, too.

About to roll into my last section. That was the mailperson. Talked about how you can put in, how you can add projects, add experience, and really beef up your LinkedIn profile, specifically for data engineers, machine learning engineers, Hadoop developers, that whole ecosystem. Now, let’s take a look real quick at my profile. I promise that, if you stuck around, I’d show you. As we’re going through this, just check out here on the experience. This is what I was talking about. Whenever you’re looking at what you’ve specifically done for a job, and what’s your day to day task card, this is where you get to put in your experience. You can see here, not only do I have my day to day task and even some of what my day to day tasks might be, and what my job description is, but also some other things I’ve been involved with, like conferences spoke at. You can see here where I brought in projects. Whenever I do demos and some of these other things, even on my site, it’s good to be able to link it here and show those as projects, show people, hey, these are some of the things I’ve done. Same thing with conferences. I’ve had some conferences I spoke at, at other places, and this is how I roll.

You can see here too, from a Pluralsight perspective, this is one thing that I got involved with Pluralsight, and just love to be able to have this in my profile. This shows that in the industry, I’m taking this to heart, and not only am I doing this, and furthering my knowledge, but I’m giving back and helping others, too. This gives me that opportunity to be able to do it. Everybody here has that opportunity. As you’re learning things, document it, make videos. Do things to be a part of the community and be able to show on your profile. The next thing, the activity, here. Look at some of the activity. You can see there’s definitely something I’m posting. I’m not posting, maybe I’m shooting a video today [Inaudible 00:10:03], but I’m not posting, not over-rotating too much on topics outside of my interest. My interest is for data engineers, machine learning engineers, and the data science community. That’s what I’m posting. I’m posting things here, and I’m also actively liking and commenting on others’ posts just to have that communication and have, make it not just a one-way conversation. That’s just some tips, and that’s just some ways that I’ve crafted my LinkedIn profile. I hope that you’ll go out and find me on LinkedIn. Let’s connect, and just build out your profile, and this gives you an opportunity to, as you’re looking and building out your profile, you can see some gaps. There’s some holes in areas that you need to shore up, whether it be in work experience, certifications, education, or just activity. If you have any questions, make sure you put them in the comments section here below. Go subscribe and ring that bell so you never miss an episode, and you’re always notified whenever we do an upload here on Big Data Big Questions.

[Sound effects]

Why Data Engineers Should Care About IoT

June 4, 2018 by Thomas Henson Leave a Comment

Why Data Engineers Should Care About IoT

The Internet of Things has been around for a few years but has hit an all time high for buzzword status. Is IoT important for Data Engineers and Machine Learning Engineers to understand? By 2020 Gartner predicts there to be over 21 Billion connected devices world wide. The data from these devices will be included in current and emerging big data work flows. Data Engineers & Machine Learning Engineers will need to understand how to quickly process this data data merge with existing data sources. Learn why Data Engineers should care about IoT in this episode of Big Data Big Questions.

Transcript

Hi folks, Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I want to tackle IoT for data engineers. I’m going to explain why IoT, or the Internet of Things, matters for data engineers, and how it’s going to affect our careers, how it’s going to affect our day-to-day jobs, and honestly, just the data that we’re going to manage. Find out more, right after this.
[Sound effects]

Today’s question is, what is IoT, and how does that affect the data engineers? We’ve probably seen the buzz word, or the concept of the Internet of Things, but what does that really mean? Is it just these little dash buttons that we have? Is this? Wait a minute. Is that ordering something?

Is this what IoT is, or is it the whole ecosystem and concept around it? First things first. IoT, or the Internet of Things, is the concept of all these connected devices, right? It’s not something that is, I will say, brand new. Something that’s been out there for a while, and when we really think about it, getting down to it, it is a sensor. We have these sensors, these cheap sensors.

We’ve had them for a long time, but what we haven’t had is all these devices connected with an IP address to the Internet, that can send the data. That’s the big part of the concept. It’s not just about the sensor, it’s about being able to move the data from the sensor.

This gives us the ability to be able to manage things in the physical world, bring them back, do some analytics on it, and even push data back out to it. The cool thing is, generally with IoT devices these are, I would say, economical or cheap devices that have an IP address, that can just pull in information. Think about a sensor, if you have a smart watch that’s connected to the Internet and can feed up information to you. That’s where some of it all started. These dash buttons. I can have these dash buttons all installed around my house, push a button whenever I need something, or start to look at what we’re talking about with smart refrigerators. Smart refrigerators can take pictures and have images of what all’s, the content that’s in your refrigerator, so if you’re at the store, you look, and you’re like, “Hey, you know, what am I…? Do I need that ranch dressing? Yeah? Let me check in my refrigerator, here.”

Also, a sensor could be inside the refrigerator, and tell you if something’s going wrong. Maybe the ice maker is blocked. Maybe you need a new water filter in your refrigerator, and the refrigerator knows that, has a sensor into it. It can send information to wherever, to be able to order that water filter for you and send it to your home, so you don’t even have to go in, and remember, “Hey, has it been 90 days? Or was it 60 days? Is it time? Is it time to change it?” Then, you’re going to forget. You’re going to let it go over, but now, you can have this sensor that’s going to tell you, and it’s going to order that for you. That’s the concept. It’s not just about the sensor. It’s about that ecosystem.

It’s about being able to move the data. For data engineers, what does this mean? Why do we care?

There are a lot of predictions out there about IoT and where it’s going. One of the big ones is, Gardner has a prediction that by 2020 we will have 20 billion, over 20 billion, of these devices. Not just the dash buttons, but just think of all these sensors, all these things with IP addresses connected to the Internet. What does that mean, from a data perspective? Some numbers that I’ve seen are 44 zettabytes of data are some of the predictions that I’ve seen, that’s going to be contributed to new data that’s coming in and the data that we have that’s already existing. Think about it. What is a zettabyte? It’s not a petabyte. It’s bigger than a petabyte.

How are we going to manage all these data, when right now we’re still managing terabytes and petabytes of data, and being like, “Man! This is a lot of data!” That’s why it’s important for data engineers, is that’s contributing to this deluge of data. How does all that affect us, as far as what are some of the concepts? When we start talking about IoT, and sensors, and having these data on the edge, being able to pull information back, but also being able to push the information out. What does that start to say?

As we’ve talked more and more about real-time analytics, this is where we’re really going to start to see real-time analytics really taking hold. As soon as we can get that data, and be able to analyze it and push information back out, that’s what’s going to help us. Think about it with automated cars, with a lot of the things that are going on outside in the physical world, where we have sensors, and devices talking to devices, streaming analytics is going to be huge in IoT.

The question becomes, if you’re looking to get involved in IoT, what are some of the projects? What are some of the things you can do to contribute and be a part of this IoT revolution? I would look into some of the messaging queues. Look at Pravega, look at Kafka, even look at RabbitMQ, and some of the other messaging queues, because think about it. As 20 billion devices, maybe more, by 2020. As these devices come in, they have to come into a queue. They have to be stored somewhere before they can be processed and before we can analyze them. I would look into the storage aspect of that.

Also, know how to do the processing. Look at some of your streaming processing, whether it be Apache Beam, whether it be Flink, or whether it be Spark. I would look into those, if you’re looking to get involved in IoT. If you have any questions, make sure you submit those in the comments section here below, or go to thomashenson.com/big-questions. Submit your questions, and I’ll try to answer them on here.

Until next time, see you again.

Freelance Hadoop Administrative Roles

May 25, 2018 by Thomas Henson Leave a Comment

Freelance Hadoop Admin Roles

A lot of the world’s economy is shifting to freelance/contracting economy or what Seth Godin terms as a “gig” economy. For Data Engineers heavy on the development side of Hadoop projects that is an easy transition. Software development projects have a a natural flow of a starting and end point. How does that work for the data engineers who are Hadoop Admins? Traditionally Operations or OPs roles are full time with no end in sight. In fact most have on call hours where Administrators have to be available 24/7. How can these types of Data Engineers find Freelance Hadoop Admin Roles? Find my thoughts on what a Freelance Hadoop Admin role looks like and where to look in the video below.

Transcript – Freelance Hadoop Admins Roles

Hi, folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. In today’s episode, I’m going to tackle questions about, what are some of the projects freelance Hadoop engineers can do, or Hadoop administrators? What are some projects in the freelance world that are going to translate and be good targets if you’re looking to be able to grab some kind of freelance Hadoop administrative job?
Find out more, right after this.

Welcome back. Before we jump into today’s question, I just want to remind you. If you have any questions, submit them in the comments section here, below, and then also, make sure you subscribe to my channel, so that you never miss an episode. I will answer as many of these questions as I can get to. I just need you to keep coming in, and submitting the questions, and giving me feedback on the types of content that you’d like to see.

Thank you everyone for subscribing. I really appreciate it, and now let’s jump into today’s Big Data Big Question. My question comes in from a YouTube comment. What freelance projects can be done by Hadoop administrators?

This one’s a little bit tougher, I believe, than when we talk about data engineers and we talk about the development side. I feel like those projects are a little bit easier to find as far as being able to bid for them, but also new projects come in on the development side a lot. When you think about the administrative side, so think about continuous development, continuously holding up that operations side for Hadoop. You’ve got a Hadoop cluster. You’re continually adding new clusters. You’re patching it. You’re doing the day-to-day operations, so, those are more permanent roles.

It’s a little bit harder to find a freelance position for a couple months or something like that for a project on the development side versus in the administrative side. However, I will say, if you’re looking to fill those roles, I think you’re going to find more of a short-term contract with those, and so by those I don’t mean a development. You’ll have a project that comes in. It may take you two weeks, it may take you two months. I believe the Hadoop administrator roles, they’re going to be a little longer if you’re looking for a contract position. A lot of those are going to be more consultive type, so think of new emerging companies. They’re starting up their Hadoop environment, jumping into the Hadoop ecosystem. They don’t really have a basis for how they’re going to do it, and so they’re looking for people to really come in and be those knowledge experts to help them get off the ground.

That kind of engagement is probably going to be at least three months, probably six, maybe even a little bit longer. These are more long-term contracts in my opinion, that you’re going to be able to find. The cool thing about these roles is, if you have a background or you have a desire to be able to be a trainer, and be able to help lead and teach other people, that’s one of my passions, these are the kind of roles that you’re going to be able to do. Not only do you get to be hands-on and be technical, you get to help a team that is brand new to the Hadoop ecosystem, maybe has a ton of experience in other areas, but you get to draw on that experience and help them build out their first Hadoop cluster, start working on some of their first use cases, and it’s something that, it can be very rewarding.

If you’re looking for these roles, and that’s probably what the question is referring to, I would look for companies that are just starting to dive into the Hadoop ecosystem. This is probably going to be a little more, you’re looking for people that are just going into that role, so look to see who’s hiring Hadoop administrators and some of those other roles, and then just reach out and content those companies. Let them know. Give them your background, talk to them a little bit about some of the technologies that they’re working on. If you have anything that you’ve been contributing to or working with in the open source community, around HDFS, or Ambari, or any of the administrative things like Zookeeper, those are where you can really shine, and say, “Hey, look. I’m involved in this community, here. I’d love to come in, have a conversation, talk to you about how you’re setting up your Hadoop cluster, where some of the troubleshooting issues are going to come up. What are some of the things that I’ve seen with my experience, that I think you should look out for, and I can kind of help with?

Those are going to be amazing roles. Like I said, it’s going to be harder, probably, than the data engineer who focuses more on the development side, but it’s going to be, in my opinion, could be a lot more rewarding, because of the fact that, you’re probably going to be more of a consultant, and you’re going to be more running a team. You’re still hands on in the tech, but you’re actually being able to train and communicate to others how they’re going to have this system up and running long time after your engagement ends.

Thanks for the question, and make sure you subscribe to the channel, and then we will see you on the next episode of Big Data Big Questions.

Show Notes Links

Ambari

Ambari Meetup

Amabari Mailing List

HDFS Documentation

Hadoop Mailing List

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

What is the Difference Between Spark & Hadoop

May 14, 2018 by Thomas Henson 1 Comment

Spark & Hadoop Workloads are Huge

Data Engineers and Big Data Developers spend a lot of type developing their skills in both Hadoop and Spark. For years Hadoop’s MapReduce was King of the processing portion for Big Data Applications. However for the last few years Spark has emerged as the go to for processing Big Data sets. Still it can be unclear what the differences are between Spark & Hadoop. In this video I’ll breakdown the differences every Data Engineer should know between Spark & Hadoop.

Make sure to watch the video below to find out the differences and subscribe to never miss an episode of Big Data Big Questions.

Transcript – What is the Difference Between Spark & Hadoop

Hi, folks, Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. And so, today’s question is one I’ve been wanting to tackle for a long time, I’m not sure why I haven’t gotten to it, but I’m ready to it. So, it’s the ultimate showdown. What’s the difference between Hadoop and Spark, and which one will win in the fight. So, find out how I’ll answer that question right after this.

Welcome back. So, today’s question comes in from a user. It came on through a YouTube comment section. So, post your question down here below. You can actually go to my website and go to Big Questions. So, thomashenson.com/bigquestions. Put it out on Twitter. Use the hashtag Big questions. I’ll look it up, try to answer those questions.

So, today’s question comes in and it says, YouTube comment, “Nowadays, there are predominantly two softwares that are used for dealing with big data, Hadoop Ecosystem and Spark. Could you elaborate on the similarities and differences in those two technologies?”

So, that’s an amazing question. It’s one that we hear all the time. So, Hadoop, very sure technology, it’s been out there. Really, it’s associated with a lot of things. There are going on in the Big Data community and a lot of things you talk about, you say big data, it’s almost anonymous, synonymous that you’re going to say Hadoop as well. But with Hadoop being over 10 years, maybe 13 years old, just depending on how you look at it, a lot of people are calling for its death, and Spark is the one that’s going to do that.

But there’s a little bit of difference. Like I said, we say that Hadoop is this all-encompassing thing. You hear me say it all the time, the Hadoop Ecosystem. So, I call it an ecosystem because a lot of things get pulled into the Hadoop Ecosystem. A lot of people say things, like assuming that Hadoop runs and does all the processing, and has all the functionality for your applications, or if you’re running it. But in a lot of data centers, you can run big data clusters and not be using Hadoop or not be using MapReduce.

And so, let me explain a little bit what I mean really by the true definition for Hadoop and then we’ll talk a little bit about Spark. So, Hadoop is built of two components. So, we separated it out into two different components. And so, the first one we’re going to break down is MapReduce. So, you’ve probably heard of MapReduce that’s what started that being able to process large datasets, and so it’s an indexing, somewhat of an indexing way to do data. So, if you have a cluster, you’re able to run your mapper and your reducer jobs, and be able to process data that way, and that functionality is called MapReduce. That’s one portion of Hadoop.

Another part of Hadoop, the really cool, the part that I’ve been involved with a ton is called the Hadoop Distributed File System or HDFS. And so, HDFS is the way that all the data is stored. And so, we have our MapReduce that’s controlling how the data is going to be processed, but HDFS is how we store that data. And, so many applications whether they’re in the Hadoop Ecosystem or new to just data processing or even just scripting, uses that Hadoop or HDFS to be able to pull data and be able to use your data as a file system.

And so, you have those two pieces right there and those two components.

When they talk about Hadoop being old, or Hadoop being slow, or portions of Hadoop that people aren’t interested in, most of the time they’re talking about the MapReduce portion. And so, there’s been a lot of things that have come out. So, there’s been MapReduce 1, and then MapReduce version 2, and Tez, and just different components around to compete with MapReduce, and Spark is one of those technologies as well.

And so, Spark is a framework. It’s called lightning fast, but it’s a framework for processing data. And so, you can still process your data that exist in HDFS, that exist in S3. There’s other places that your data can exist and be processed by Spark, but predominantly, most of the data centers, they still have their data in HDFS. So, things were built upon HDFS. HDFS is where your data is housed and so you process it whether you’re using Spark, whether you’re using Tez, whether you’re using any new way of processing the data, or you still may be using MapReduce, but you can have all that in HDFS. So, when you think about it, the two do compete, they do compete, but primarily from a processing engine.

And so, I’ve got a couple blog post out there that I’ll link to here in the show in the show notes, but you can go out and see where I break down the difference between batch and streaming, and some of those different workloads. And so, Spark really came on whenever we started talking about being able to stream data and being able to process data faster as it comes in.

And so, that’s why you see a lot of people that are talking about Hadoop being the past technology and then Spark’s, the newer technology that’s going to take over the world. There’s still going to be components from the traditional Hadoop like we talked about with HDFS. That’s probably still going to be used I don’t think for a long time. Like I said, there’s still a ton of people and a ton of developers still using MapReduce. And so, MapReduce has its functionalities for when we talk about batch workloads and there’s still development going on with MapReduce 1, and then Tez and some other platforms that are encompassed in the Hadoop community.

So, I would say, if you’re looking at it from a learning perspective, all right, which one do I want to learn, do I want to learn Hadoop, or do I want to learn Spark, and thinking that it’s all or nothing. I would say it’s not. I would focus mainly depending on what you’re looking to do, but I would definitely focus and learn HDFS, and so understand how the file system works and how you can compress, and how you can make those calls because chances are you’re going to be using HDFS and you’re also going to be using Spark, and Tez, and HBase, and Pig, and Hive, and a lot of different other tools in the ecosystem.

And so, I would say, it’s not an either or, you’re not going to pick and say, “I’m only going to do Spark,” or, “I’m only going to do Hadoop.” You’re more than likely going to be using a lot, using Spark too for your streaming applications and for your processing of your data, but you’re still using Hadoop, and the things in Hadoop with HDFS and being able to manage your data maybe with the [INAUDIBLE 00:06:28], and some of the other functionalities that are in that ecosystem. So, it’s not an all or nothing thing. And so, learning one is not going to stop you from getting your job or is going to stop you or prevent you from having to not learn another one. So, it’s not an either-or thing, but if you’re asking who will win in the future, I would say they both win.

Well, that’s all I have for today. Make sure to subscribe to the channel so you never miss an episode. We’ve got a lot of things that we’re working on, so we got some Isilon quick tips that are still rolling out. We’ve got some book reviews starting to get some interviews, so you can see some interviews that [INAUDIBLE 00:07:03] in the past, and then also these Big Data Big Questions. And so, anything that you want to see, just pop here in the comment section and I’ll try to answer it or try to tackle at the best I can. Thanks again.

Skills Needed for Big Data Administrators

April 30, 2018 by Thomas Henson 1 Comment

Data Engineers & Big Data Administrators

In today’s episode of Big Data Big Questions we tackle what the skills are needed for Big Data Administrators. Data Engineers wear many hats in Data Analytic workflows, one part software engineer and one part systems administrators. The Big Data Administrators are responsible for keeping Hadoop, Kafka, Ambari, and other frameworks running. Find out what other skills Big Data Administrators need in the video below.

Make sure to subscribe to my YouTube channel to never miss an episode of Big Data Big Questions.

Transcript – Skills Needed for Big Data Administrators

Hi, folks! Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today, I’m going to answer a user question about data administration, or in big data, what is that big data administrator’s role?

What are some of the tools that they use? How can you get involved? Find out more, right after this.

Welcome back. Today’s question is going to revolve all around the big data administrator, what that role is, what are some of the tools that they use? This question came in from my website. You can do Big Data Big Questions, go to thomashenson.com, click on Big Questions, submit a question there. Put them in the comments section here below, and then always, make sure that you’re subscribing to this YouTube channel, so that you’ll never miss an episode. These are great tips. These are great ways for me to answer any questions that you have. If you have those questions, ask them, but also make sure you’re subscribing to the channel.

Today’s question comes in from Jarvis. He says he has a dilemma on Python for big data. We answered a number of questions around Python and big data, and then do you have to know Java? But, this one is a little bit different. It’s going to cover the data administrator.

Hi Thomas, a big fan of yours.

Thanks for watching. Thanks for sending in the question.

I had a question related to IT careers and skills in big data. I wanted to know if Python is required only by data administrators, or can all things done by Java on big data be implemented using Python as well?

This question is really good. Like I said, we’ve talked a little bit about, do you have to know Java in order to be able to be a big data admin, be involved in big data, be a data engineer?

The answer is no. You can do things in Python, but I want to tackle the question from the perspective of, you’re asking about data administration, and so there are two different roles. We’ve talked about the data engineer versus the data scientists. The data engineer is the one who’s setting up the cluster, maybe doing some of the software development, running your Hive jobs, maybe even just the software developer, from if you’re writing Java jobs, if you’re writing your Spark jobs, but your data administrator, that’s a different role inside of that. We have two pieces of the spectrum. This side over here, this is more software development side generated, and on this side over here, let’s say that this is more of the administrator, or our systems engineer, the person who’s setting up and running the cluster. Maybe not doing the day-to-day coding but doing the administrating and running of the system. Think of that as your full stack developer.

Think about when you split up your systems admin, who’s setting up the stack, making sure the database is running, doing those tasks versus who’s running the… Whether it be PHP code or .NET code. What skills does a data administrator have to have?

I would say that, if we’re talking about being able to be involved in the community, and be involved in big data, you’re going to keying on HTFS, Ambari, Hive, Flume, and you’re going to have a lot of Linux skills. If you’re asking me, you want to get into data administration, you want to be an awesome data administrator in the big data ecosystem, do you have to know Java? No. Can everything be implemented in Python? Maybe, but you’re probably going to be doing more administrative tasks as far as setting up the cluster, understanding the operating system that Hadoop’s running on.

You’re maintaining more that Linux level, and the Hadoop ecosystem level, so if you’re using Hortonworks or you’re using Cloudera, how all those tools are integrating and talking to each other. I would focus more on not even so much the coding part, but as far as being able to set up that cluster. It’s going to vary, too. It’s going to vary in the role.

Some places, especially when you’re just starting out on big data, and you have a small team in your company, you’re going to be the software engineer and the data administrator, right? You might need to have a little more code.

If you’re going to a more seasoned team or a bigger team, you can actually have that role where you’re running the administration. My answer is, I wouldn’t worry so much about Python and Java, if that’s the role that you’re wanting.

The data administrator, I would worry about being able to integrate the tools. Be familiar with the tools, be familiar with how to set up, how to add notes, how to take notes down. How to set up secondary name nodes, so, being able to make sure that, when one name node goes down, the second, you can flip over to the second name node. Being able to back up the data. Making sure that we’re taking snapshots. All the kind of tasks that go into running the system, versus being able to write a MapReduce job. If you’re really keen on being a big data administrator, which, those are great roles, those are a lot of fun, you’re still hands on, but you’re not really having to write the jobs.

You’re checking out new tech, checking out new projects, to see, “Hey, am I going to be able to integrate this into our system,” or, “Man, you know, we’ve got two or three more nodes that are going to come online, so let’s make sure that we get those racked and stacked, and then, let’s make sure that we’re adding those to the cluster, too.”

A lot of cool things that you can do in that role. Most of them aren’t going to involve coding, so you’re not really going to have to worry about Java, you’re not going to have to worry about Python, as much as you would in the traditional data engineer, where you’re looking at being more of a software engineer.

I hope I answered your question. If anybody else has any questions, put them in the comments section here below. Make sure to follow me here, so click subscribe, and then I’ll see you next time.

Rise of the Machine Learning Engineer

April 27, 2018 by Thomas Henson Leave a Comment

What is a Machine Learning Engineer?

Move over Data Scientist the Machine Learning Engineer is now the best role in Big Data Analytics. The Machine Learning Engineer is a hybrid mix of half Data Engineer and half Data Scientist, who can implement the data models and even make recommendation for new data sets. Find out why the Machine Learning Engineer is getting a lot of attention in 2018 by watching the video below.

Make sure to subscribe to my YouTube channel to never miss an episode of Big Data Big Questions.

Transcript – Rise of the Machine Learning Engineer

Hi, folks! Thomas Henson here, with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in from a user, and this all are going to be about the machine learning engineer. What is a machine learning engineer? How does it differ from a data engineer or data scientist? We’re going to jump into all that right after this.

Welcome back. Today’s question comes in from a user, so before we jump into the question, make sure that you go and click on the subscribe, so that you never miss an episode. Also, if you have a question and you would like for me to answer it, about data engineering, about books, about business, anything around IT and specifically probably data analytics, make sure you put those in the comments section here below. Go to my website, thomashenson.com/bigquestions or use the hashtag #BigDataBigQuestions on Twitter. I will try my best to answer those as quickly as I can.

I’ve been getting a lot of questions in, and I’m really thankful for all the questions, and I am working through them as well. Today’s question comes in from a user. From the comments section on YouTube, Andrew Wiley [Phonetic]. He says, “Is it possible to learn both data science and data engineering?” This question stems off of the Cloudera certification. I’ve answered some questions around what is a data engineer, what is a data scientist, but this question is specifically, “Okay, is there a blended of two?” Is there one position that’s a blend of two?

I’ll say, for a while, there’s been a lot of confusion around, “Okay, if you’re a data scientist, you know how to stand up a Hadoop cluster, or if you know how to stand up a Hadoop cluster, you must be a data scientist. You’re a wizard, right?” This question is about, what about the blending of the two skills? Think about it from a web development perspective. For a long time, we had our web developers, and we had our back-end developers, and then we had the full-stack web developer. Now, we have a full-stack data engineer, and those are called machine learning engineers.

On a recent podcast out there, that O’Reilly did at Strata, they had a couple quests on talking about the rise of the machine learning engineer, and so I would say that if you’re looking to have skills with data science and data engineering, that position is going to be called a machine learning engineer. My view on how the machine learning engineer has come to fruition is in two parts. If you’re working in a small development or small analytics shop, most likely the data engineer, the person who’s putting together the code and running the system, there’s going to be one or two people on that. It’s going to be a really small team, who are going to be filling that role of a data scientist.

There’s a lot. There’s a big skills gap for data engineers and even more so with data scientists, too. You might be able to go through and look at some of the prescribed analytics and machine learning algorithms that you want to use, and you, as the data engineer, will understand how to use those. It’s not just willy-nilly, like, “Hey, I’m just going to pull this one down and have it.” You need to have a background in statistics, and probability, and heavy on math. One of the things, one of my gaps in skills that I’ve been working on is the math part.

You can follow along as, watch me learn how machine learning… The machine learning course, with Andrew Ng’s course, and you can see some of the things, especially if you’re a data engineer, that you need to shore up, so that you can fit into that machine learning engineer.

Think of the machine learning engineer in the small shop as, you’re the full-stack developer, you’re the full-stack engineer. It’s kind of doing everything. Then, in larger corporations, what you’re going to have is, like I said, we’ve got it on both sides of the spectrum. You’ve got your data engineer, that are really good at setting up, administrating an environment, maybe even doing the software development, running Hive, creating the MapReduce jobs or the Spark jobs, but then you have your data scientists who are, maybe have some SQL skills, really good at math, but not really good at the technical. The machine learning engineer is that person in the middle, to kind of bridge the gap. In bigger shops, you’re going to have your machine learning engineer who’s working with your data scientist, and then starts to be able to pick up on, “Okay, this is the way that we like to do some of the things here, and you’re really owning that part of the stack, and so, you’re not so much worried about developing and doing what I would call the Hadoop administration, or even the Hadoop development.

When I say Hadoop, remember, we’re just talking about anything in that ecosystem. Your machine learning engineer is your specialization of that. I did a little research, too, just to look at it. Just pulling it up, just some preliminary research, just looking for jobs out there. A lot of times, we’ll say, “Yeah, this is, you’re an Excel guru, and you say, ‘Excel guru?'” You go look, and there’s nobody with a job title excel guru. You’re giving it to yourself.

Looking at machine learning engineer, quick search on Google for jobs, there are a lot of different postings from companies all the way from IBM to Facebook, Lyft, a lot of different postings out there, just in my quick search. Also, looking at Glassdoor, and some of the other places, the salary ranges are right there with what a data engineer is, so anywhere from the low 80s, which I wouldn’t think that, that’s probably not really a true machine learning engineer, or maybe it’s in a different part of the country, all the way up to the 160s. That’s salary range per year. I thought that was pretty good mix, there.

Really fit in line with what we see as the data engineer and the data scientist, so those roles are out there. If you’re excited to go out and learn those, remember what I was saying. Want to have a solid background as a data engineer with understanding how the Hadoop administration works. Also, the workflows, and some of the development skills. Want to be able to implement, if you’re using Mahout, if you’re using TensorFlow, any of those frameworks, you want to be able to implement those, but then you also want to have the math portion too, so make sure you understand the algorithms from a math level, and how to tweak, and how to tune those.

That’s all for today. Hope I answered your question. If you have any questions, anybody out there, make sure that you first go and subscribe, and then ask your question. I’ll try to answer them here. Have a good day.

Data Engineer vs. Data Scientist

December 27, 2017 by Thomas Henson 2 Comments

What’s the Difference Between the Data Engineer & Data Scientist

The Data Scientist has been one of the top trending careers choices for the past 3-4 years but where is the love of the Data Engineers? In reality I think more people are confused about the roles in Big Data. Both Data Scientist and Data Engineers are used interchangeably but the roles require different skills sets. In this video I will break down the differences between the Data Engineer vs. Data Scientist.

Transcript – Data Engineer vs. Data Scientist

Thomas: Hi, folks. I’m Thomas Henson with thomashenson.com, and today is another episode of Big Data, Big Questions. And so today, we’re going to be tackling the differences between a data scientist and a data engineer. Fine out more right after this.

Thomas: So, do you want to become a data engineer, or do you want to become a data scientist? So, this is a question…this is something we see a lot about is all about the data scientists, and big data, and then data engineering. But what’s the different between the two roles, and why is it that since 2012 data scientists have been the sexiest career in IT? And there’s been a lot of publication about it. There’s been a lot of information about it. We’re all talking about machine learning, and smart cars, and the battle between Tesla and Apple for machine learning engineers, data scientists, and how they can battle that out. But what about the data engineer, too? And kind of what are the differences? Well, I’ll tell you. Recently, Information Weekly had a survey out there for 2018 and the highest paying salaries in IT. Data engineer came in at number five. The only other roles that were above the data engineer were all C-level positions. So, think of your CIO, your CSO, your CTO. So, data engineers are getting a lot of love, too. But what are the differences between those two roles? So, we’ll break it down first jumping into what a data scientist does.

So, what does a data scientist do in their day to day work? Well, one of the things that they do is they evaluate data. So, that’s kind of a given. But how do they do that? So, they use different algorithms. They look at different results from data. So, say that we’re trying to find out a way for a web application to have more users that are engaged with it. So, how do I create more engaging content for my users and for my customers and give them something to value? Well, a data scientist would be able to take and look at different variables. So, maybe we get in a room, and maybe everyone kind of comes up with some variables and says, “Okay, how can we retain user retention? So, does this piece of content work? We’ve got some testing on these other pieces. Here’s some of our historical data.”

And so the data scientist, what they’ll do is they’ll evaluate all those data points and tell you which ones are going to be the most relevant. And they’ll do that by using algorithms. So, they’ll look, and maybe they’ll use SVD to find out, okay, which variables are going to make the most sense for us to have more engaging content, have a web application that makes users want to stay and engage with it longer. And so that’s kind of where their role is. Now, they’re not going to be the ones that are writing the MapReduce jobs or doing some of the Spark jobs. We really want them just evaluating the data and helping us build different data models that are going to give us the results we’re looking for.

So, if we can just increase our user retention time or increase our engagement of our content, our web application is going to be more popular. So, in our example, that’s what we want. We want our data scientists that are really evaluating…finding correlations between data and also eliminating correlations. So, this variable that we predicted that we thought was going to be very key to engaging for our web application for our users, it really doesn’t make a difference. And so it gives our developers, and our engineers, and our product and marketing team things for them to look at and say, “Hey, these are the variables that we need to focus on, and this is what’s going to make our web application…give us the desired results we’re looking for and increase that user retention time, increase the engagement for our users in our web application. So, that’s our data scientist.

Now, on the flip side, what is our data engineer going to do? So, our data engineer, they’re the ones that are going to say, “We’ve got this data here. We’re moving the data maybe into our Hadoop cluster. So, we’re moving it into our Hadoop cluster or wherever we’re storing it for analytics.” And so they’re the ones that are really moving that data there. They’re also writing those MapReduce jobs or Spark jobs. They’re doing the development portion of big data. So, our data scientists are over here saying this is the data that I need. The data engineer is saying, “Oh, we have the data. What kind of format…? How should we clean the data? How fast do you need the data, too?” So, how much speed is a concern for some of these variables and being able to fetter out some of the details, and being able to give…maybe improve that product a little bit faster to get it to the users.

And so that’s where you’re going to see the data engineer. They’re also going to be the ones that are managing and configuring our Hive and HBase deployments and doing some of the technical debt work that we’ve talked about before with making sure that we have a strategy for backup, making sure we have a strategy for high availability. So, this product that we’ve got here for our web application, we want to make sure that we’re still feeding our data in, and our data models are feeding our data back to our data scientists. But then we’re also pushing out those results from what the data scientists have given us, too. So, you kind of see two distinct roles.

So, our data engineer, they’re going to be involved in the tech. They’re going to be the ones that are really building those systems out. Where our data scientists, they’re involved with the data. They’re involved with the technology as far as how to use the…what tools are going to help them be able to [INAUDIBLE 00:05:20], use different algorithms, and be able to say, “This data point really makes a different where this other data point may not be making as much of a difference, and so it’s going to…” They’re going to be using those tools for that. But basically what they’re doing is they’re involved in the data. And you see the data engineers involved in the technology, and implementing, and kind of using that strategy.

So, I’m not saying one is better than the other one, but I may be a little bit biased because I’m a data engineer and like data engineering. But two different skillsets, two very important skillsets, two amazingly great career choices right now in IT, two of the probably highest paying individual contributor roles in IT right now. So, you can’t go wrong either way. If you’re looking for more tips and more information about being a data engineer, make sure you subscribe to this channel and find out more information about data engineers. We explore how to do different things. If you have any questions like this, make sure you submit them. Big Data, big questions, using the #bigdatabigquestions on Twitter, go to my website, thomashenson.com, Big Questions, submit your questions there. Put it in the comments below here. I’ll answer it on YouTube the best I can. Any questions like that, just get in touch with me. I’ll answer them on here. Make sure you subscribe. Thanks again for tuning in, and I’ll see you next time.