Archives for May 2019

Data Engineer in 2019

May 31, 2019 by Thomas Henson 3 Comments

What’s a Data Engineer Career Like in 2019?

Times change and keeping up with maintaining skills while managing day to day projects can be exhausting.

Which is better Flink or Spark?

How as a Data Engineer will I focus on Containers?

Questions like these come up all the times when I speaking with aspiring and career focused Data Engineers. Find out my thoughts around skills and career outlook for Data Engineers in 2019 on this episode of Big Data Big Questions.

Transcript – Data Engineer in 2019

Hi folks, Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in around, what does data engineering in 2019 look like? What are some of the trends? What are some of the things that are going on? Has this question come in from a comments section here on YouTube, so if you have a question, make sure you put it in the comments section here below or reach out to me on thomashenson.com. And, I’ll discuss it in an orderly fashion as they come in, provided I have the time. I’ve been getting a ton of questions, so I really appreciate it. Thank you for this community here.

Today’s question, before we jump into it, I want to give you my three top trends to watch for in 2019. Before we did, I did want to credit with an article that they did for their 10 trends in big data. I talked about them on my YouTube live session, so if you’re ever around Saturday mornings, jump on. Throw me a question in the chat. Let’s get to cracking. I try to answer as many questions as I can there, and try to do that Saturday mornings. Jump in there.

The 10 here, you can check in the comments section here below, where I have some of the link to the [Inaudible 00:01:13] trend here. I’m going to read some of them real quick. The first one they said for the top 10 trends in 2019. Data management and [Inaudible 00:01:22]. They’re talking a little bit about ETL and how ETL’s not going away. I’ve said that for a while, but we did read an article not too long ago that’s saying, “Hey, you know, there’s some tools out there that are really going to make ETL kind of a thing of the past.” We’ll see. Hopefully, right?

I’m not for ETL, I just, man. Started out there, and it seemed like I was never going to get out of it. Number two, data siloes continue to proliferate. This goes into what we saw when Hadoop emerged as this huge, big data lake, where the data’s only going to exist there. We’ve been talking about it, especially on this channel, over the past few years where, hey, data has a lot of gravity to it. There’s going to be data out on your edge. There’s going to be data in the cloud. There’s going to be data still in core data centers.

The idea of a fluid data lake is a little bit more consolidated. You still have those main areas, but you still have to do analytics and place in some area. number three, streaming analytics has a breakout year. Talked about streaming analytics on this channel for the last couple years. Actually did a session about the future architectures of streaming analytics at the 2017, was it Hadoop Summit? They call it Data Works, now.

Data governance builds steam, talked about some of that here. Soft skills start to emerge as tech evolves. Just talking about the soft skills of understanding the business, talks about that with the book, the big data MBA here. Deep learning gets a little bit deeper. Hm. Have we talked about deep learning on this channel? Special K expands footprint. They’re talking about Kubernetes and what’s going on with the doctorization. Clouds are hard to ignore. New tech will emerge, talking about how Silicon Valley and a lot of open source, and closed source, tools have been emerged, and they don’t see that stopping anytime soon. Then, smart things everywhere. I’ve talked about those a good bit here, too.

Without further ado, let’s jump into my three trends for 2019. My three trends to watch for in 2019. The first one, deep learning and Hadoop. How are these ecosystems going to interact with each other? A lot of project out there have talked about it last year, around project hydrogen, submarines, another project, and NVIDIA’s Rapid. It’s all about being able to use GPU and also be able to use those deep learning libraries with data that’s in your Hadoop ecosystem or just for some ETL. That’s one of the things that NVIDIA Rapid’s… Maybe I should do a video just specifically on that. Watch that trend. Start watching what’s going on with TensorFlow and being able to use integrated in with Spark and some of your other tools that are more traditional in the Hadoop ecosystem. That was number one. Number two. Two? Yep. Number two, containerization of the world overtakes data engineering. Similar to what they were talking about it [Inaudible 00:04:11], with their trends, with Special K being special. I think the containerization, we’ve seen it a lot, a lot of announcements here lately with cloud native applications and cloud native experiences on the Cloudera side, and you even saw in Hadoop 3.0 where they were laying the groundwork to be able to containerize your Yarn, schedule your engine, and some of the other components there. We’re going to continue to see that, and that’s one skill that you’re going to be looking for. If you’re in data engineering right now, you want to know what’s coming up down the pipe for you, I would look into doing some things and getting more familiar with the containerization. That’s actually in my roadmap for the end of the year for me, to understand a little more around docker, and Kubernetes, and that whole ecosystem. That is a big trend we will see for data engineering. It’s not going to slow down. It’s been picking up steam a lot here lately, but it’s going to go full force. My third trend, thing that I’m looking for, for data engineers in 2019, streaming analytics. I was doing some research and looking around some IDC numbers around where we’re talking about from a data perspective. We’re gonna be, one of the interesting tidbits that they were talking about is how streaming analytics will take up anywhere from around 30% of all the analytics and things that are going on in Azure. Think about all these different devices bringing in data here by 2025. 30% of that’s going to have to be streaming analytics. That’s a huge number. There’s a number of tools out there that are helping to try to deal with what’s going on from a streaming analytics perspective.

We’ve got , we’ve got Kafka. On the cloud side we’ve got Kinesis. A lot of different tools. We had [Inaudible 00:05:42] on this channel here, but there’s a lot of tools in place, a lot of tools being created, because streaming analytics is a huge beast of data to handle. It’s a different kind of problem than what we’ve seen, and it’s only going to get worse as we start bringing in more data, more devices. Really cool opportunities for you as data engineers. Outside of my goals for 2019, if you’re looking for some things to jump into and some educational paths for yourself as a data engineer in 2019, I would look into those three trends. Deep learning, containerization, and then streaming analytics. That’s all I have for today. Make sure to subscribe and ring that bell so that you never miss an episode of Big Data Big Questions. Throw a comment in the comments section here below if you have any questions. If you like the video, if you hated it, just let me know how you feel about this, and I will see you next time on this episode of Big Data Big Questions.

Review Google Machine Learning Crash Course

May 21, 2019 by Thomas Henson Leave a Comment

Learn Machine Learning From Google

Learning Machine Learning or Deep Learning can be hard! There are a ton of resources out there to help but which ones are going to help you accelerate your knowledge in Machine Learning? Recently I went through the Google Machine Learning Course and absolutely loved it! The Crash Course is a free and self paced course from Google in an attempt to help close the Machine Learning talent gap. So how does the Google Machine Learning Crash Course stack up against Andrew Ng’s Machine Learning Course? Also will this course help you in a Data Engineer role? Find out the answer to these questions in this special episode of Big Data Big Questions Review Google Machine Learning Crash Course.

https://youtu.be/0j-IdnLktY0

Transcript – Review Google Machine Learning Crash Course

Today is another episode of Big Data Big Questions. Today I’m going to talk about a little more education, where we’re talking about what’s going on with the Google Machine Learning Crash Course. It’s a course that I went through, and I’ve been going through it, and I thought I would share with everyone out there and tell them, hey! What’s in this course, why you should take it, how long it should take, and how awesome is it?

The Google Machine Learning Crash Course. What is it? It’s Google’s way, and it’s a free course, so very important. Free course. You can go out there. It’s in the comments section or description here below. Gives you an opportunity to go out and learn. It’s a free course around machine learning, and it really dives into the approach to machine learning, even the basics around what machine learning is, but then it also starts diving into some of the more mathematical concepts. It still keeps it high-level enough where you don’t feel like you’re going deep, but you have an understanding.

Like I said, some math functions and everything, but a lot of it on the application side. It’s broken into 25 different lessons. They were saying, they talked about how it’s 15 hours to complete. I don’t think it takes 15 hours so much. It’s something I was able to knock out doing 30 minutes a day for, I want to say I knocked it out in two and a half, maybe three, weeks. Very interactive, though. It gives you the opportunity. We’re not just watching a video. You’re not just reading. You’re actually getting hands-on with some code. You go through, and you can do it all from your browser, so you’re not having to download, or import, or install anything. You’re able to use Jupiter Notebooks that they host on GCP. As you go through those, you can actually go in and test out some code. Whether it’s all in Python, so you’re using Pandas or even Tensorflow in some areas, pretty cool. They give you the data sets to go through it. I think it’s a very valuable lesson. It’s, like I said, free for 30 minutes a day. It’s worth the opportunity to go in and just look. Find out what’s going on and get some level of understanding.

One of the things I like most about it is, it’s actually put on by the people that are involved in various projects in Google. You get to learn from the data engineers and data scientists at Google around their approach to how to implement these and walk through. Like I said, not always just video. Some of it has reading. Then you have code to back it up. It’s definitely something that’s awesome.

What did I learn in this course? Probably the most important thing that was really stressed in helping me understand was how to fight and how to combat bias in your machine learning code. Specifically around bias. Not bias like we have as humans around, hey! I’m biased to loving data engineers. It’s not that kind of bias, but even bias with your data. How hard it really is to have good data sets that are not going to be biased, because everything, you can’t train with every piece of data in the world. One, we don’t have it all documented yet and in a place to do that.

You’re always limited with what you have in your data set. You always want to make sure that you’re not training models specific to this data set so that when you go out and implement them in the real world, and they get different data sets, that they’re more accurate. There’s a lot of steps and a lot of things that they talk about, about preventing bias and really understanding it. Really, for me, that was a huge, important concept to really understand, and one of the reasons that I really like the course. Actually went through it a couple of times, took a ton of notes. My recommendations are go through it, make sure you take notes, and then go through the course. Let me know what you think about the course. If you go through the course, make sure you put in the comments section here below.

If you have any questions for Big Data Big Questions, put them in the comments section here below. I will do my best to answer those, and let me know, like I said, if you take the course, how did you feel about it? Did you like it? Am I totally off-base? I will see you again next time on Big Data Big Questions.

Data Engineer LinkedIN Profile

May 17, 2019 by Thomas Henson Leave a Comment

How Connect with Data Engineering Community on LinkedIN

When it comes to professional networking and social media LinkedIN is king! However, does it make sense for Data Engineers and Data Scientist to embrace LinkedIN? The simple answer is if you are not on LinkedIN you are missing a huge opportunity to network and get involved in the Data Analytics community. In this episode of Big Data Big Question we explore how to utilize LinkedIN for build a career in Data Engineers. Also we dig into tips for optimizing your Data Engineering or Data Scientist profile on LinkedIN. Watch the video below to learn how to amplify your reach in Data Engineering community.

Transcript Data Engineer LinkedIN Profile

Hi, folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. You’re probably wondering. Where the heck am I? Actually, on the other side is my office. I thought this would be a good opportunity for me just to record in a different location. New year, how about some new places to record? This is my gym that I’m continually building up over the years. New view. Today, what we’re going to talk about, this is a Big Data Big Question comes in. It’s all around how do you build a LinkedIn profile for a data engineer? Specifically, how do I build it if I’m in that role today, or how do I build it if, I don’t really have work experience? Are there some things that I can do? All while staying honest, right? I’m not giving you tips to say, “Hey, use this term even if you haven’t done it.” Those are all going to be some tips for us, and then make sure you stick around to the end. I’ll go through and show you what I’ve done on some of my LinkedIn profile, too. I’ll show you how I’ve stacked some projects, and added some videos, and other kinds of content that can help you stay in the community. We’re going to break everything done, and we’re going to talk about specifically for your LinkedIn profile as a data engineer, the things that you can control. The areas that you control. There are some things you can’t control, and we’ll talk about those a little bit.

The first thing is your title. You can come up with an awesome title. Obviously, try to keep it relevant. Don’t say you’re a data engineer if you’re not a data engineer, but you can be a data enthusiast. I’ve seen people talk about they’re data ninjas or data gurus. I’ve even seen somebody for a while that, actually right out of, I think they graduated like a year before I did, and they’re background was Excel, and they were an Excel, I think they did Excel Ninja, and that’s how they got their first role outside. It was not a data engineering role, but it was actually a developing role that came into GIS or something like that. You can get creative with that. You can go through, and you can also look at seeking opportunities. I’ve seen people that have put what they’re seeking there. Control that title. The next area, number two, that you can control, is your work experience. To some extent, right? If you don’t have work experience, there’s some things you can’t do there. You can put work experience from, come on? Come on? Open source projects, right? You can become a contributor. You can move your way up in those areas, and that can give you an opportunity to be able to add some things in there. If you do have work experience, put those in there. Make sure you’re putting your daily tasks, especially anything that’s data related, like if you’re doing stuff with SQL, you’re doing stuff with development, whether it be C# or BB. You can go back and look at my profile and see where I was a .NET developer. Put those tasks.

Then, also, find other tasks that maybe you’ve done some research. One of the things that I had to do was, I had to do research on different things. When we were moving, like I said, I was a web developer moving into the data engineering role. At the time, it was somewhat of a conscious effort, but not some, as well, too. I volunteered to get on a project, and so some of the things that I had to do was the research. Looking through Horton Works, looking through Cloudera, the Sandbox, going through and standing up our own Hadoop platform, and just testing those things out. That’s all valid. That might not be my day to day task. I didn’t do it the whole time I was there, but that was one of the things that I was tasked with doing, and even as small as that sounds, put that on there. That gives you more experience and, if somebody’s looking at your LinkedIn profile, goes, “Hey, man,” this person is moving into that role. They have this experience there. Another thing is, if you attended any conferences, there, too. That was another thing that I was very fortunate in my role, where it was like, “Hey, love to get into this, Mr. Customer.” We signed up for a big project. There’s a couple of conferences that I needed to attend to get skilled up.

Hadoop [Inaudible 00:03:48] some of them. You’ve probably seen me wearing, this one’s not it, but you’ve seen me wearing some of the hoodies from Hadoop conference. We’ll put those conference attendance in there, especially if you’ve spoken at any or anything like that. You can put project stuff in there. You’ll see it on my profile, but if you’ve done anything, even if you’ve made a simple demo or something like that, make sure it’s customer, it’s public-facing. Don’t put any information from a company you’re not supposed to, but you can actually add projects to it. Whether it’d be a link to a blog post that you wrote for your company or for a project that you’re on, or video. I’ve made some videos on my personal site, and you’ll see those here. I’m going to show you my profile.

Number three that you can control, there’s some things we can control about that work experience. Then, here we can control the education. If you have a college degree, if you have anything from that nature, even certifications. There’s a little section for certifications. Those are things that you can control. Controlling that, I’m not saying put that you went to MIT if you didn’t go to MIT, right? This is not going to help you. Short-term it might get you an interview, but that’s not the long game we want to play, and that’s just not the right thing to do. Make sure. I’m talking about, you can control it from the aspect of, “Hey,” if you’re planning on going to college, you have an anticipated graduation date, I would home in on that, and any kind of honors, projects, or denotations you’ve had in there, include that in your education section. Those are longer-term, but I’m saying, you can control it, because you can determine today what themes you’re going build on, what you’re going to do during your college and your education experience. It’s a long-term thing, right? Most of them are four years, five years, however long it takes you. Took me longer. Maybe I’ll do another video someday on how long it actually took me. Either way, you’re going through your education, the factors that you can control. Make sure you’re putting that on there. Short-term, we’ve talked about this [Inaudible 00:05:48] short-term still in education. You can control the certifications. We know what my goals are for 2019 as far as certifications and the certifications that I’m trying to knock out. Those are short term that you can. They have them with the [Inaudible 00:06:01] they have them with Coursera. Other education sites, and then there’s also the vender-specific. AWS, Horton Works, Cloudera, specific certifications that you can go through. You can start adding that, and that scenario where, with work experience it’s a little harder. With education, traditional four-year college, a little harder. I little long-term to go, but those, if you’re really fighting to take that next role or move into a role, whether it be within your company, whether you’re trying to, you’re a consultant trying to bring in more projects, go through some of those certifications. That’s something that you can tackle, and just depending on your knowledge base, something anywhere from one month to six months, you can knock out some of those certifications that are really going to help you build out that LinkedIn profile as a data engineer.

The last area that you can control. You control title, you can control the work experience, you can control your education and certifications. Activity. You have the most control, and you can pause this video and go post a relevant topic right now, assuming you have a LinkedIn profile, which if you don’t, I think it’s going to be very important to you. You should get one. You can control that activity. You can control what you do from a hashtag perspective. What you want to put out there as far as, hey, if you go and look at my site, you’ll see some of the things that I’m learning and I’m going through. Not only on my YouTube channel does everybody get to see behind the scenes of what I’m looking at, but more importantly, you can start to mold that, and pull that part into your education. From my perspective, you can see, for a good part of last year, I was really working on doubling down into deep learning and understanding what’s going on in that community from a Tensorflow perspective, [Inaudible 00:07:46] perspective, from a PyTorch, or just what the heck does a C&N mean? You can see it slowly evolving my education and sharing that knowledge, and same thing there. You control that activity, but it’s not a one-way street. You’re not trying to just put stuff out here. You want to be [Inaudible 00:08:03] communities, too. You want to like and comment on some of your peers and other people around that are interested in the same things that you’re interested in, too.

About to roll into my last section. That was the mailperson. Talked about how you can put in, how you can add projects, add experience, and really beef up your LinkedIn profile, specifically for data engineers, machine learning engineers, Hadoop developers, that whole ecosystem. Now, let’s take a look real quick at my profile. I promise that, if you stuck around, I’d show you. As we’re going through this, just check out here on the experience. This is what I was talking about. Whenever you’re looking at what you’ve specifically done for a job, and what’s your day to day task card, this is where you get to put in your experience. You can see here, not only do I have my day to day task and even some of what my day to day tasks might be, and what my job description is, but also some other things I’ve been involved with, like conferences spoke at. You can see here where I brought in projects. Whenever I do demos and some of these other things, even on my site, it’s good to be able to link it here and show those as projects, show people, hey, these are some of the things I’ve done. Same thing with conferences. I’ve had some conferences I spoke at, at other places, and this is how I roll.

You can see here too, from a Pluralsight perspective, this is one thing that I got involved with Pluralsight, and just love to be able to have this in my profile. This shows that in the industry, I’m taking this to heart, and not only am I doing this, and furthering my knowledge, but I’m giving back and helping others, too. This gives me that opportunity to be able to do it. Everybody here has that opportunity. As you’re learning things, document it, make videos. Do things to be a part of the community and be able to show on your profile. The next thing, the activity, here. Look at some of the activity. You can see there’s definitely something I’m posting. I’m not posting, maybe I’m shooting a video today [Inaudible 00:10:03], but I’m not posting, not over-rotating too much on topics outside of my interest. My interest is for data engineers, machine learning engineers, and the data science community. That’s what I’m posting. I’m posting things here, and I’m also actively liking and commenting on others’ posts just to have that communication and have, make it not just a one-way conversation. That’s just some tips, and that’s just some ways that I’ve crafted my LinkedIn profile. I hope that you’ll go out and find me on LinkedIn. Let’s connect, and just build out your profile, and this gives you an opportunity to, as you’re looking and building out your profile, you can see some gaps. There’s some holes in areas that you need to shore up, whether it be in work experience, certifications, education, or just activity. If you have any questions, make sure you put them in the comments section here below. Go subscribe and ring that bell so you never miss an episode, and you’re always notified whenever we do an upload here on Big Data Big Questions.

[Sound effects]

Tableau For Data Science?

May 15, 2019 by Thomas Henson Leave a Comment

Big Data Big Questions

Tableau is huge for interacting with data and empower users to find insight in their data. So does this mean Tableau is the primary tool for Data Scientist? In this episode of Big Data Big Questions we tackle the question of “Is Tableau used for Data Science”.

What is Tableau

Tableau is a business intelligence software that allows for users to visualize and drill down into data. Data Users leverage Tableau highly for visualization portion of Data Science projects. The sources for data can be from databases, CSVs, or almost any source with structured data. So if Tableau is for analyzing and visualizing data is it a tool specific Data Scientist? Watch the video below to find out Tableau’s role in the world of Data Science.

Transcript – Tableau For Data Science?

Hi folks! Thomas Henson here with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question comes in from a user, and it’s around data science, and Tableau, and how those go together. But, before we jump into the question, if you have a question that you want to know about data engineering, IT, data science, anything related to IT, or just want to throw a question at me, put it in the comments section here below or reach out to me on Twitter at #BigDataBigQuestions. Or, thomashenson.com/big-questions. Ton of ways to get your questions here, answered right on this show, all you have to do is type away and ask.

Now, let’s jump into today’s question. Today’s question comes in from a YouTube viewer, and it’s about, hey, in data science, do you use Tableau? You can see the question here as it pertains to this, and so this is a question we started up this show doing, around data engineering, but now we’re really jumping towards, hey, what’s going on from a data science and just encompassing all of it? Today’s question, we’re going to talk about where’s Tableau used, right? A lot of people use Tableau. It’s really, really popular. But, is that really a tool that a data scientist is going to use? Should you invest your time as a data engineer or a data scientist aspiring or not aspiring to get into data science? Should you spend time learning about that tool?

My thoughts on Tableau are that it’s really good for giving information out to users that could be not necessarily data scientists. They could be users of it. They could be analysts. They could be somebody who just has a stake in their business. I’ve used it at a lot of different corporations that I’ve worked at, and companies, and companies, and organizations, and really what I see is those tools are more for the end user, for visualization. They may fall more in the data visualization bucket. We’ve talked about the three tiers of work. You have your data scientist, you have your data engineer, and your data visualization specialist, the person who’s making sure that, hey, at the end of the day, it’s great that we have all these algorithms that are showing us and being able to predict whatever we’re trying to look at in our data, but if we can’t sell that and can’t convey that to the people that need the data to make a decision on, then it’s just an experiment, it’s just us having fun doing research.

When it comes to an end product or being able to really sell your point, data visualization, I think that’s the bucket that Tableau fits in more than just traditional data science. Could be wrong. Let me know if I am here in the comments section below, but let me talk a little bit about my use case and where I’ve seen it. Like I said, I’ve used it in a lot of different organizations that I’ve worked with or even contracted with. One of the main use cases, I’ll give you an example. Let’s say that you’re a YouTube viewer. I’m not saying YouTube uses Tableau, this is just an example. I don’t want to give away too much information, insider. If you have a YouTube channel, think about if you want to see the videos that are coming in. You’re a user. You’re a publisher, a creator. You want to know. Here is all the videos that you have. Here’s how long they’re watched. Here’s all the demographics from behind the scenes that you can pull. Maybe the times that they were watched. How long they were watched, so on this video here, if people drop out after 30 seconds, I did something wrong there. Versus, how many people go through the end of it. Same thing, too. What you would do is, you would have all this information and aggregate all this data, and you maybe even pull some insights. Like, hey, what’s your average? We can do some real simple things, or you can do some complex things, too. Tableau is where you’re going to give the end user the access.

At least what I’ve seen a lot. There’s a big need to be able to do that and be able to pull that data. It gives you a way to, I wouldn’t say that a data scientist wouldn’t, per se, use that as their tool. It wouldn’t be their only tool. Maybe that’s the way that they aggregate and look at large amounts of data before they go in and start to pick and choose. I’m sure there’s some modules out there that are incorporating machine learning and deep learning. I will say, if you’re really looking from an AI perspective to jump into, it’s not just going to be about Tableau. I’m not saying that you shouldn’t get up to speed on Tableau, but I wouldn’t say that, hey, I’m a brand-new person graduating high school, graduating college, or somebody that sees it in their career and looking to go into data science, my choice would not be to jump in and learn Tableau. I would start learning a little bit more about Python, and algorithms, and maybe R, or some of the other higher-level languages to talk around machine learning and deep learning, versus saying, “Hey, this is the tool that’s going to kind of take me there.” Now, if you’re a data visualization person, or you want to get into big data from that perspective, there’s a lot of things that you can use Tableau to do. You might add it to your bucket. As far as we talk about on this show, how to accelerate your career or how to break into the big data realm, this is not one of those tools that I’m going to say, hey, this is the only choice you have. Not really going to be the one that’s probably going to make the more sense. It’s not going to be the game changer, like hey, this person’s certified in Tableau or is a Tableau wizard. If you’re applying for a job that’s all around Tableau then, definitely. As far as, I really want to get down into data science, and I really want to get deep in it, Tableau’s one of those things. Definitely probably going to use or come across tools that are similar to that, but it’s not going to be your mainstay, probably, where you’re writing your algorithms and doing your analytics.

That’s all for today. If you have any questions, make sure you put them in the comments section here below, and then make sure you click subscribe to follow this channel, so that you never miss an episode of Big Data Big Questions.

[Music]