Archives for October 2019

Learning to Filtering Client Traffic in OneFS

October 10, 2019 by Thomas Henson 2 Comments

Learning to Filtering Client Traffic in OneFS

So I was hanging Instagram one day watching the Rock’s latest workout video…then all of a sudden I had a question slide into my DMs.

“Hey Thomas, Can you help me troubleshoot a specific client or protocol in OneFS?”

Of course you can! At least I thought so.

BTW my DMs are always open…

What is InsightIQ?

Well you actually can using a couple methods in InsightIQ. In OneFS InsightIQ is a software feature that allows you to analyze cluster performance and track data within your file system. InsightIQ runs outside OneFS in a VM monitoring OneFS clusters. There are quite a few ways to drill down into the data within InsightIQ, but one of my favorite features is the Data Filters. Checkout the video below to learn how to filter client traffic in OneFS.

Transcript – Data Filters in InsightIQ

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Unstructured Data Quick Tips. Today, this question that we’re going to talk about is how to do some advanced searching with data filters and InsightIQ. This question actually came up from a viewer who was going through and trying to pull out some special reports on InsightIQ, so they reached out to me on Instagram and said, “Hey, you know, is there a way that I can actually find and dig in through the protocol to find a specific client and what’s going on with that client?” One of the ways that we were able to do that and solve that problem was to use what is the data filters. Let’s jump in, and you can follow along as we go through and look how to use data filters in InsightIQ.

Log back into our InsightIQ environment, and you can see here that I’m pulling a report from a specific time range that I know I’ve got some information in my cluster. Once again, this is my development cluster, so I’ve only got one node connected, but I wanted to come through and look at some of the information we can look at. I’ve got a customized report that’s just a standard report on the Henson cluster, here. You can see that I’ve got it broken down into CPU utilization, connected clients, and external throughput. Those are the top three reports that we’re actually looking for, but let’s look through and see how we can use these data filters. One of the things you can do here is you can actually add a rule. This will work on any of the report types. I’m using the same report, but you can use any of the standard reports or the reports that you have to be able to filter the data. It’s just as easy as adding a specific rule. Let’s come in here, and let’s say that, hey. We’re looking for a specific client, and we’ll see all the information about that one client. Maybe you have a cluster, and if you come down here, and you were able to break it down, specifically with using these breakouts, these clients. Say you have more than I have in my environment. You’ve got thousands or hundreds of thousands, or you’ve got a ton of different clients, and you don’t want to break those down. You can’t really see what’s going on. A quick way to do that is to come in here and add a rule. Add a rule. You can see that we can break it down even by nodes or clients, paths, services, protocols. We’re going to do client. Specifically, I’m going to put in our IP address here. I want to put in an IP address, and now I can apply that rule. Now, you can see here, we’re showing our connected clients. We’re showing all our specific throughput just from this one client right here. In this report here, you can see it’s pulling out, hey, we’re only looking at this specific client and pulling that information. You can add more than just one, too. You can add another rule, and say, “Hey, you know, we want to look at a protocol perspective. We want to see it in smb2, or we could’ve done nfs, or smb1. If you were looking for smb1, or your SyncIQ traffic, you can apply that rule, and now you can see my data filters that are going to be keyed right here. We can go down and look, and now we have a data filter that’s showing all my smb2 traffic from my clients, and the information that’s pooling in here. You can add multiple rules, or you can delete these rules. You can just manage all those from here. Like I said, you could also apply those to any reports. I’m applying it to my standard report that I pulled, but if you wanted to go in and do it on these, cluster. Maybe you’ve just got the cluster performance or the client performance report. You can do that and build these into your reports as well. That’s all we have today for Unstructured Data Quick Tips. If you have an idea for a show or have a question, put it in the comments section here below, and I’ll do my best to answer them.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

O’Reilly AI Conference London 2019

October 9, 2019 by Thomas Henson Leave a Comment

The Big Data Big Data Questions show is heading to London for the O’Reilly AI Conference October 15 – 17 2019. I’m excited to be a part of the O’Reilly AI Conference series. In fact, this will be my third O’Reilly AI conference in the past year. Let’s look back at those events and forward to London.

San Jose & New York

View this post on Instagram

Late night packing my conference gear for my trip to O’Reilly AI Conference this week. Most important items: 1️⃣ Stickers 2️⃣ 🎧 3️⃣ 💻 4️⃣ Bandages? (I’ll explain later) 5️⃣ 📚 (this weeks its my Neural Networking) What’s your list of must have gear for tech conferences? #programming #coding #AI #conference #techconference

A post shared by Thomas Henson (@thomas_henson) on Sep 5, 2018 at 5:09am PDT

First in 2018 I attended the San Jose conference where I spent a good portion of the time in the Dell EMC booth talking with Data Engineers and Data Scientist. One of the major themes I heard from Data professionals was they were attending to learn how to incorporate Tensorflow into their workflows. In my opinion Tensorflow was talked about in every aspect of the conference. We had a blast learning from attendees and discussing how to Scale Deep Learning Workloads. Also this was my first time attending a conference with 14 stitches in my left hand (trouble on the pull up bar)!

Next was O’Reilly AI New York. Forever this conference will be known in my head as the Sofia the Robot trip. During this conference I worked with Sofia the Robot not only at the conference but in a Dell EMC event at Time Square Studios (part of the Dell Technologies Magic of AI Series). Before the Magic of AI event, Sofia and I spent the day recording with O’Reilly TV about the current state of AI and what’s driving the widespread adoption. After a day of recording, I had a keynote for day two of the O’Reilly AI Conference where I discussed how AI is impacting future generations already. Then there was a whirlwind of activity as Sofia the Robot took questions at the Dell Technologies booth. The last thing of the day was the Magic of AI event in Time Square Studio where we had 100 people taking part in a questions and answer session with Sofia the Robot.

Keynote O’Reilly AI Conference New York

Coffee with Sofia the Robot

http://https://youtu.be/KbBvdoUOpmY

On To London

Next up is O’Reilly AI London. To say I’m excited is an understatement. During this trip I will accomplish many first time moments.

To begin with it’s my first international conference along with my first time in London. So many things to see and so little time to do it. Feel free to give me suggestions about visit locations in the comment section below.

Second at O’Reilly AI London I will give my first breakout session at an O’Reilly Conference. While I’ve been on O’Reilly TV and given a keynote I’ve yet to have a breakout session. My session is titled AI Growing Pains: Platform Considerations for Moving from POC to Large-Scale Deployments. The world is changing to innovate and incorporate Artificial Intelligence in many applications and services. However, with all this excitement many Data Engineers are still struggling with how to get projects past the Proof-of-Concept phase (POC) and into Production. Production environments present a list of challenges. The 3 biggest challenges I see when moving from POC to Production are the following:

The gravity of data is just as real as the gravity in the physical world. As Deep Learning workloads continue grow so does the amount of data stored to train these models. The data has gravity that will attract services and applications to the data. The trouble here making sure you have correct Data pipelines Strategy on place.
Once I had dinner with one of the Co-founders of Hortonworks, during which he said “Everything as Scale is exponentially harder. Have you ever moved around photos on your desktop? For the most part this is an easy task except when you accidentally move a large set of photos. Instantly after moving these large folders you are endlessly waiting for the hour glass to finish. Image doing this with 10 PBs of data. I think you get the picture here.
The talent pool today compared to early days of “Big Data” is much larger. However, the demand for skills in Deep Learning, Machine Learning, and Data Engineering is stressing the system. Which still leaves a skills gap for experienced engineers with Deep Learning and Machine Learning skills. The skills gap is one huge factor for why many projects get stuck in the POC phase instead into production.

If you would like to know more about moving projects from POC to Production make sure to checkout my session if you are attending O’Reilly AI Conference in London. AI Growing Pains: Platform Considerations for Moving from POC to Large-Scale Deployments @ 11:55 on October 16, 2019.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Deep Learning Python vs. Java

October 8, 2019 by Thomas Henson Leave a Comment

What About Java in Deep Learning?

Years ago when I left Java in the rear view of my career, I never imagined someone would ask me if they could use Java over Python. Just kidding Java you know it’s only a joke and you will always have a special place in my heart. A place in my heart that probably won’t run because I have the wrong version of the JDK installed.

Python is king of the Machine Learning (ML) and Deep Learning (DL) workflow. Most of the popular ML libraries are in Python but are there Java offerings? How about in Deep Learning can you use Java? The answer is yes you can! Find the differences between Machine Learning and Deep Learning libraries in Java and Python in the video.

Transcript

Hi, folks. Thomas Henson here, with thomashenson.com, and today is another episode of Big Data Big Questions. Today’s question comes in around deep learning frameworks in Java, not Python. So, find out about how you can use Java instead of Python for deep learning frameworks. We’ve talked about it here on this channel, around using neural networks and being able to train models, but let’s find out what we can do with Java in deep learning.

Today’s episode comes in and we’re talking about deep learning frameworks that use Java, not Python. So, today the question is, “Are there specific deep learning frameworks that use Java, not Python?” First off, let’s talk a little bit about deep learning, do a recap. Deep learning, if you remember, is the use of neural networks whenever we’re trying to solve a problem. We see it a lot in multimedia, right, like, we see image detection. Does this image contain a cat or not contain a cat?

The deep learning approach is to take those images [Inaudible 00:01:10] you know, if we’re talking about supervised, so take those labeled images, so of a cat, not of a cat, feed those into your neural network, and let it decide what those features are. At the end you get a model that’s going to tell you, is this a cat or is this not a cat? Within some confidence. Hopefully not 50%, maybe closer to 99 or 97. But, that’s the deep learning approach versus the machine learning approach that we’ve seen a good bit.

We talk about Hadoop and traditional analytics from that perspective is in machine learning we’re probably going to use some kind of algorithm like singular value decomposition, or PCI, and we’re going to take these images and we’re going to look at each one and we’re going to define each feature, from the cat’s ears to the cat’s nose, and we’re going to feed that through the model and it’s going to give us some kind of confidence. While the deep learning approach we get to use a neural network, it defines some of those features, helps us out a lot. It’s not magic, but it is a little bit, so really, really innovative approach.

So, the popular languages, and what we’ve talked most about on this channel and probably other channels and most of the examples you’ve seen are all around Python, right? I did do a video before where I was wrong on C++. There was more C++ in deep learning than I really originally thought. You can check that video out, where we kind of go through and talk about that and I come in and say, “Hey, sorry. I missed the boat on that one.” But, the most popular language, one… I mean, I did a Pluralsight video on it, Take CTRL of Your Career, around TensorFlow and using TFLearn. TensorFlow is probably far and away the most popular one. You’ve seen it with stats that are out there. Also PyTorch, Caffe2, MXNet, and then some other, higher-level languages where Keras is able to use some of TensorFlow and be a higher-level abstraction, but most of those are going to use Python and then some of them have C++. Most examples that you’re going to see out there, just from my experience and just working in the community, is Python. Most people are looking for those Python examples.

But, on this channel, we’ve talked a lot about options and Hadoop for non-Java developers, but this is an opportunity where all you Java developers out there, you’re looking for, “Hey, we want to get into the deep learning framework. We don’t want to have to code everything ourselves. Are there some things that we can attach onto?” And the answer is yes, there are. It’s not as popular as Python right now, or R and C++ in the deep learning frameworks, but there is a framework called Deeplearning4j that is a Java-based framework. The Java-based framework is going to allow for you to use Java. You could still use Python, though. Even with the framework, you can abstract away and do Python, but if you’re specifically a Java developer and looking to… I mean, maybe you want to get in and contribute to the Deeplearning4j community and be able to take it from that perspective, or you’re just wanting to be able to implement it in some projects. Maybe you’re like, “Hey, you know what? I’m a Java developer. I want to continue doing Java.” Java’s been around since ’95, right? So, you want to jump into that? Then Deeplearning4j is the one for you.

So, really, maybe think about why would you want to use a Java-based deep learning framework, for people that maybe aren’t familiar with Java or don’t have it. One of the things is it claims to be a little bit more efficient, so it’s going to be more efficient than using an abstraction layer from that perspective in Python. But also, there’s a ton of Java developers out there, you know, there’s a community. Talked about how it’s been around since ’95, so there’s an opportunity out there to tap into a lot of developers that have the skills to be able to use it and so, there’s a growing need, right? There’s communities all around the globe and different little subsets and little subareas. Java’s one of those.

I mean, if you look at what we did from a Hadoop perspective, so many people that were Java developers moved to that community, also a lot of people that didn’t really do Java. It’s a lot like, like I said, at the point I was at in my career, I was more of a .NET C# developer. Fast forward to getting into the Hadoop community, went back to my roots as a Java, so I’d done some Java in the past, and went through that phase. And so, for somebody like me, maybe I would want to go back out. I don’t know. I’ve kind of gone through more Python, but a lot of different options out there. Just being able to give Java developers a platform to be able to get involved in deep learning, like, deep learning is very popular.

So, those are some of the reasons that you might want to go, but the question is, when you think about it, so if I’m not a Java developer, or what would you recommend? Would you recommend maybe not learn TensorFlow and go into Deeplearning4j? You know, I think that one’s going to depend… I mean, we say it a lot in here. It’s going to depend on what you’re using in your organization and what your skill set is. If you’re mostly a Python person, my recommendation would be continue on or jump into the TensorFlow area. But if you’re working on a project that is using Deeplearning4j then by all means go down that path and learn more about it. If you’re a Java developer and you want to get into it, you don’t want to transition skills or you’re just looking to be able to test something out and play with it, and you don’t want to have to write it in Python, you want to be able to do it in Java, yeah, use that.

These are all just tools. We’re not going to get transfixed on any tool. We’re not going to go all in and say, “You know what? I’m only going to be a Java developer,” or, “I’m only going to be this.” We’re going to be able to transition our skills and there’s always going to be options out there to do it. And in these frameworks too, right? Deeplearning4j is awesome, but maybe there’s another one that’s coming up that people would want to jump into, so like I said, don’t get so transfixed with certain frameworks. Like, Hadoop was awesome. We broke it apart. A lot of people navigated to Spark and still use HDFS as a base. There’s always kind of skills that you can go to, but if you go in and say, “Hey, I’m only going to ever do MapReduce and it’s always going to be in Java,” then you’re going to have some challenges throughout your career. That’s not just in data engineering, that’s throughout all IT. Heck, probably throughout all careers. Just be able to be flexible for it.

So, if you’re a Java developer, if you’re looking to test some things out, definitely jump into it. If you don’t have any Java skills and it’s not something that you’re particularly wanting to do, then I don’t recommend you running in and trying to learn Java just for this. If you’re doing Python, steady on with TensorFlow, or PyTorch, or Caffe, whatever you’re using.

So, until next time. See you again on Big Data Big Questions. Make sure you subscribe and ring that bell so you never miss an episode. If you have any questions, put them in the comment section here below. Thanks again.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

5 Types of Buckets in Splunk

October 7, 2019 by Thomas Henson Leave a Comment

Where does data go once ingested into Splunk?

Does Splunk use files and folders?

How Splunk Stores Data

In Splunk data is stored into buckets. Not real bucket filled with water but buckets filled with data. A bucket in Splunk is basically a directory for data and index files. In a Splunk deployment there are going to be many buckets that are arranged by time. In this video learn the 5 types of buckets in Splunk every administrator should understand.

Transcript – 5 Types of Buckers in Splunk

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s we’re going to be talking about the five different kind of buckets in Splunk. We’re going to go through, we’re going to talk about how Splunk uses buckets, and how it’s used to be able to store your data, and how to know which bucket your data is in. Find out more about the different buckets in Splunk right after this.

Today, we’re going to be going through the five different buckets in Splunk, and we’re going to be talking about that. If you have a question, remember, throw it in the comment section here below. Find me on Twitter, on Instagram, and I’ll do my best to answer those here on the show. Today, I wanted to go through the different buckets and how those are used in Splunk. Before we jump in and talk about those five different buckets, let’s just get a quick definition of how Splunk works with storing our data. Think about our data coming in, to our Splunk environment. The first thing that’s going to happen is, whether we’re uploading it, or whether it’s live streaming data, it’s going to be indexed. That index is going to help us, one, be able to search it a little bit better. Splunk’s going to put a timestamp on it, and it’s going to do some other things to give us some meta data, so that we can simply search through that data a lot quicker in our Splunk environment.

The other thing it’s going to do is, it’s going to store that data so we can find it. It’s going to store those in different buckets. Think of buckets just as the Splunk file system. Just like you have a file system, think of it in a Windows environment. I’ve got directories and subdirectories. Think of it in the Splunk environment as you have different buckets. I have buckets for different portions of my data. Those are all going to be with a timestamp. Right? Right. As it indexes, that’s how Splunk decides where they’re going to be in the bucket, and also, there’s some other things you can do to decide how long data’s going to sit, and sit in each one of your buckets, but before we jump in and talk about that, let’s make sure we understand what those buckets are.

The first bucket, or really the first two buckets, are going to be your hot and your warm bucket. Your hot and your warm bucket, this is where your most recent data is going to be. Your hot and your warm bucket, they’re both going to sit in the same specific area. They’re going to be put in there so that you have your data the most current, right? This is where Splunk really puts a lot of performance characteristics around where this data should live in these hot and warm buckets, and specifically really, if you think about it, your hot and your warm bucket, your warm bucket’s going to contain some of your most recent data, but your hot bucket is going to be the one that’s riding to the new data. As you set up your policies for how long your data is going to exist in your Splunk buckets, your hot and your warm bucket, let’s say that arbitrarily you can get 10 events. It’s a little bit more complicated than that, but let’s just make it simple. Say that you can get 10 events in your buckets. Every time you get to 10 events, your hot bucket is going to become a warm bucket, and you have a brand-new bucket. Think of your hot bucket as where the newest files go, and your warm bucket is where your more recent is. It gives you a life cycle policy.

The third kind of bucket is a cold bucket. This is where our more older data, where our data is kind of aging off. This data can actually live, doesn’t have to live with a hot and warm bucket. It can go to maybe a NAS device or some kind of object store where you can actually search on it. It still needs to have some requirements for how it’s being searched, because it could be, if we’re saying that we have 10 events in each one of our warm buckets, let’s say that after a week, those 10 events age off to our cold bucket. Our cold bucket could hold from a week to maybe three months on our policy. That’s where our data is going to exist, in that cold bucket. No new data’s being written to it, and there’s not as much performance requirements just because, probably not searching on it as frequently as we are on the newer events. They’re pulling out for our dashboard, then are stored in our hot and our warm buckets.

Then, we have our fourth bucket, which is going to be our frozen bucket. Think of this as really old, frozen data, hence the frozen bucket. Data that we’re holding onto for compliance reasons or we just want to be able to go back and search on it at some point in the past, but this data is actually going off to some kind of long-term retention. The thing about it is, we want to search on it, I’ll talk about it in just a second, but this data is not searchable in its current form of a frozen bucket. There’s another process to be able to do that, and that’ll include another bucket, but think of this as where you’re aging off your data. This gives you an opportunity to get a better cost per terabyte for how you’re storing the data, and get it out of your Splunk search, so better performance on your Splunk search as well, but still being able to hold on to that data, but think of this as, this is, if we’re saying three months is what we’re going to hold in our cold bucket. Think of it being more than three months that’s going to exist in that frozen bucket.

In our last bucket, number five just talked about it, it’s a thawed bucket. Our thawed bucket is how we get that frozen data back into a searchable state. You can go through and being able to thaw that bucket out. Think of it as taking some of the compression out of it, but also putting it in a better place to be able to store it. We talked about performance, and some of the other characteristics that you need to be able to search your data. In those thawed buckets is where you can start and go from that process. It’s a full life cycle process. Go from hot, to warm, to cold, to frozen, and then when we want to see your data again, put it in that thawed bucket. That’s all we have today. I hope you enjoyed this episode, where we talked about the five different kind of buckets in Splunk. If you have any questions, make sure you put them in the comments section here below. Reach out to me on Big Data Big Questions, and I’ll do my best to answer your questions right here.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Book Review Living With A Seal

October 6, 2019 by Thomas Henson Leave a Comment

31 Days Training with the Toughest Man on the Planet

Yet another great book review! Living with a Seal was a fun read about a 31 day period where Jesse Itzler hires a Navy Seal to live with him. During this 31 day period Jesse is put to the test both mentally and physically. Jesse wanted to take on this challenge to whip himself back into shape. The events throughout this book make for great entertainment and inspiration. After reading this book it definitely gave me some ideas of how to push myself in different areas of my life. Watch the video to learn my thought on Living with a Seal.

Transcript – Living With A Seal Book Review

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of… Book Club? I don’t know. Still don’t have a name for this, but today I’m going to be reviewing Living with a Seal. Awesome book. Let’s find out all about that right after this.

Today, in this episode where I’m reviewing a book, comma, I am on a mission. If you’ve watched my goals 2019, heard me talking about how many books I want to try to read. Missed my goal last year. We’re chugging along this year, trying to get to the halfway point. The books that I’m reviewing today, really good, was actually referenced and referred to me my Erin Banks. You’ve seen me do some videos with her with the Big Data Beard team, where we went through in machine learning course. Her and I have talked about some certifications.

The book that she recommended was Living with a Seal: 31 Days of Training with the Toughest Man on the Planet. I will say, throughout the whole book, you never know who that person is, but I know who it is. If you Google around you can find out. The real premise of the story was, it was not written as a book. I think it started off as a blog. Jesse Itzler, he’s really famous and entrepreneurial. He’s been involved with Zico water and NetJets [Phonetic 00:01:23]. When he wrote this book, was when he was really going through the big push for Zico water. You get to see a little bit of behind-the-scene what was going on from a business perspective. It’s not a business book per se. He’s an ultra-runner, he’s an adventurer. He was actually on MTV, so I think he was a musician at some point. I guess when I was young, or maybe a little bit older. Either way, the premise of the book is, he’s in a little bit of a funk. He’s an ultra-runner, and he really wants to push the limits. He’s doing his business, and running his day job there. He’s also got a son and a wife. He’s in a rut, like we all seem to get, as we go through phases in our career. He just wants to jump-start himself, to push himself really hard. He meets the SEAL at an ultra-running event, and he’s talked to a couple times, and he convinces him to come live with him for 30 days. The caveat is, he must do everything the SEAL says when the seal says as far as getting everything going. You still able to work, still able to do everything, but the SEAL follows him around for 30 days, and they come up with some crazy workouts.

Going throughout the book, you hear crazy workouts that they do, it’s really awesome. Actually, Trying to take some of those in my own day-to-day activities, and really trying to push the limit. It was really cool just to see how Jesse, who was probably already more fit than I was at that time, and even now, and how he felt like he was in a rut. It gives you that perspective of we all feel at times we’re not doing as much as we could, whether it be from gym, whether it be from learning, anything like that. You’ve got to try some crazy things you really get you out of your rut. Also, One of the things, followed SEAL after the fact, but one of the things you really follow through this is you have to do something that sucks every day. [Laughs] If you do something that sucks every day, by volunteering, you’re volunteering did you some kind of crazy workout. Maybe you going to go do 20 miles, 10 miles, whatever your craziness is for that day. If you volunteered to do those things, it makes it a lot easier do the things that you have to do in life, whether it be around the house, for family, or in your career.

That’s the really cool portion, that I took away from the book. Let’s go through a couple of things I really like. They go through the background for both Jesse and the SEALs. Jesse’s background was pretty cool, where he talks about some of the things that he kind of hacked his way. Think of hiring a SEAL to come live with you for 30 days. That’s like a life hack, right? It’s different ways of experimenting, getting things done. He brought that into his business life as well, so he talks about how he took chances on getting contacts, and meeting people throughout his career, and then also SEAL goes into some of his background, and all the things what he’s gone through, and it’s pretty cool. He’s living with Navy SEAL. They’re following around, and doing workout, so it’s really cool. It’s the workout and the mindset thing that really challenges Jesse and shifts his focus. When he goes into it, they start off doing, SEAL wants to see how many pull-ups you can do, how many push-ups you can do, and he kind of gauges a test. They also do some running things, because running was a big focus of it. When he goes in, Jessie can’t do 100 pull-ups, but they test out, they do however many pull-ups you can. Then, seal is like, “Hey, do 100. We’re not leaving this gym until you do 100.”

It really refocuses the mind, where it’s like, hey, you have something that you have to do? You’re going to do it, right? No matter how long it takes. Jesse was kind of able to break through, and do that. Back to the reader, if you’re reading it, it gives me that mindset, too. To really push, and so, I’ve done a lot of really cool things since reading the book, where maybe even setting a timer and trying to just some workouts in a hotel, or while I’m on the road, or even here in the Big Data Big Questions office, too. It really gives you an opportunity you really push, and do those thing, did you feel good afterwards, too. Some of the coolest workouts and I wanted to pick out. The burpee test. The goal was to get 10 to 12 minutes. You do a hundred burpees. Jesse did that between meetings. SEAL made him do it between meetings. I think he was his full get-up. I don’t know if you wear a suit to work, but he had some kind of button down it seems like, when he was talking about it.

The four miles every four hours 48 hours. They scaled up to this. I think they started off with two miles every four hours for 24 hours, at the end what they did was are we doing for miles every four hours 48 hours. I think that essentially turns into a marathon or two marathons. I’m really bad at math. Anyways, That was a really cool workout. Also, it was working their way up to some of the push-ups, too. It was pretty cool. Jesse, I think, at the end, he got 200 push-ups in a day. Just being able to test you do those things. A lot of this stuff you can hear on Joe Rogan’s podcast, too. Jesse actually appeared on there, I told some behind-the-scenes stories, around some of the things that he and SEAL did. Towards the end of the book, you still don’t get you know who SEAL is, but if you watch Joe Rogan, or if you subscribe to this channel here, we’re going to review a book, and I’ll tell you who was the seal, if you haven’t already Googled it and found out as well.

Would I recommend this book? Hell yeah! It was pretty awesome. You get to go through, and see what normal life for all of us are, as far as work, in family, and doing things like that, and then see what happens when you insert a SEAL it’s going to really kick you. Kick you in the rear, and get you rolling through doing things that suck, and see what it does to you. Maybe that’s why Zico water was so big. I don’t know, probably not. I think Jesse would have been successful either way, but it was really cool to see it all go down around that same time. Second off, it’s really going to you to do things outside your comfort zone. We talked about it a lot here. One of the things that I’d really been pushing and working on the last couple years of speaking. I’ll tell you, it’s a challenge to get on stage in front of 100-200-300, whatever your limits are, and keep pushing those limits. But I really can say, had I not probably been gaining confidence by doing things that get me uncomfortable in the gym, or running, and doing those other things, I think that really translates into what I’m trying to do from a, hey, you don’t want to get up, and do 20 minutes of learning? Too bad. Just do it. Give you that mindset, where you’re like, I’m already doing these other things from a health perspective in my life, so what can I do you feed my mind? Or, what can I do to challenge myself within my career? It doesn’t have to be speaking, just for me that is. Definitely check out this book. Then, the book that I read right after this one, I’m going to follow it up here, called Can’t Hurt Me. Find out more about that one next, but I definitely recommend this book and recommend pushing yourself outside your limits. Until next time, see you again on Big Data Book Club. We still don’t have a name.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

My Journey Why I Chose MBA Over Masters in Science

October 5, 2019 by Thomas Henson Leave a Comment

My Master Degree Journey

Once again here at Big Data Big Questions we tackle a College related question. Today is a little different where I discuss MY JOURNEY in choosing a MBA over a Masters of Science. After less than 6 months into my first Software Engineering role, I decide to pursue a Masters degree. One of the biggest reason I acted so fast was advice from peers. The advice was simple knock out the Masters before you get too busy with life in general.

Wow was that good advice for me!

Once the decision to go back to school was made, I had to select as Masters program. In reality I’m sure I made the decision a lot harder than it should have been. Looking back after all these years I’m confident I made the right decision. Watch this episode of Big Data Big Questions to find out my process for choosing a MBA over a Masters in Science.

Transcript – MBA Over Masters in Science

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question is another in the “how do I choose a degree” series I guess, that I’ve been getting in, and this one is more just around my personal journey. How did I decide to go with an MBA versus a Master of Science? It was a pretty big decision for me, so I thought I’d share my journey, because I know a lot of folks are looking at, even from an undergrad perspective, am I going to go more information systems, or am I going to go more computer science, or engineering from that perspective? How does it all go through, and what’s the thought process? I’m just going to provide my thought process to how I went through it, and maybe that can help you. Maybe you can give me some advice, tell me if you think I did the right decision.
I want to talk a little bit about my journey into choosing my MBA over choosing a Master of Science, just to give some thought process around that. I was not a traditional student in the fact that I graduated a little bit older. I had a career outside of tech a little bit before I really focused down, and buckled down, and went back and got my information systems, or CIS Computer Information Systems. It’s different at other places. Whenever I went through that, I made it, really, a focus that as soon as I graduated, I was really going to try to get back in. I think I only took six months off. During that six months, and even before then, I knew I wanted to go and get my Master’s, and so I was really struggling with the fact that, hey, what do I need to do? Should I go for a computer science or some kind of science master degree, since I had what was essentially a business degree in information systems, or should I continue on the path and go down the road of getting my MBA?

Sought out a lot of information from mentors, people I worked with. I was very fortunate in my first job, where they would pay for my college tuition for my Master’s degree. I was really excited about that, and also another one of the reasons that it probably really drove me within six months of graduating, getting a job, going back in and being like, “I want to work full-time and get my Master’s degree.” For me, sought out some mentors. My manager at the time, he had an MBA, and that was one of the things I asked him. I was like, “How did you decide?” He was like, didn’t really matter as much in his eyes from what he’s seen, and with him having an MBA, plus, a little bit biased. For him, he was like, “I liked it.” The thought process around having an MBA, being able to say MBA in your title, is a little bit different and has more of a pop, I guess, from his perspective. Another one of the mentors I talked to, actually he had a Master’s of Science, and he was upper-level director at the organization I worked for, and just talking with him, and he was along the same path of not necessarily saying that an MBA was going to matter. He was more that it matters that you finish. From that perspective, so really after I had some advice from that perspective, it really let me go in, and the way that I actually chose is like, all right, it’s not going to hurt my career path either way.

I really went through, and I looked at the programs. I compared the programs, compared how long it would take. It would take me a little bit longer to go through the Master’s of Science and really, it was more about some of the classes and some of the cool things. I’ve always had a knack for business. I really liked accounting when I had accounting classes previously in my undergrad. It really gave me an opportunity to dial in and look at some of the things across business from an economics perspective. Some of the computer information systems classes, because they’re still focused. You have an MBA. You have a focus. I still focused on that. Being able to do some things with more Java classes, because at the time, I liked Java. [Laughs] Some of the things with databases and information systems. Really, there were some cool things that were going on around healthcare that piqued my interest as well, too. Chose that path, so I know people have sought out advice, and looked around and asked. Should we do this? Should I go for a data engineering degree or a data science degree? It really depends on what you want to do, and I don’t think, just like with my journey, picking one or the other is going to hurt you down the road. Going back to the blunt advice I got from a senior director was, it matters more about if you finished it.

When you start out on that journey, make sure that you capture it and go on. That doesn’t say that anybody that’s watching this don’t think it’s a thing where you’re like, hey, you have to get a Master’s degree to be able to succeed within your role. We’ve proven that, especially in tech, so much in tech. There’s folks without Bachelor’s degrees, without Master’s degrees, and even without high school diplomas. It’s more about how creative you can be and how much you can focus, and just really pull yourself into your craft, whether it be development, whether it be analytics, or wherever you want to go. Just for me, for that journey, it was having me being a later student is kind of like, for me, I really wanted to go back and prove to myself that I could finish and stick it out. Being one of the first in my generation, between my family, to go and to have that Master’s degree, also was really awesome. Personal decision, but I’m sharing it with everybody here. Everybody’s situation’s different. Happy to give advice, happy to talk through it all, but that’s my story, my journey. If you have any questions, put them in the comments section here below or reach out to me on bigdatabigquestions.com. I’ll do my best to answer those, and I’ll see you again next time on Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Ultimate Battle Tensorflow vs. Hadoop

October 4, 2019 by Thomas Henson 1 Comment

The Battle for #BigData

This post has been a long time coming!

Today I talk about the difference between Tensorflow and Hadoop. While Hadoop was built for processing data in a distrubuted fashion their are some comparison with Tensorflow. One of which is both originated out of the Google development stack. Another one is that both were created to bring insight to data although they both have different approaches to that mission.

Who now is the king of #Bigdata? To be fair the comparison is not like for like but most of the time are bound together as it has to be one or the other. Find my thoughts on Tensorflow vs. Hadoop in the latest episode of Big Data Big Questions.

Transcript – Ultimate Battle Tensorflow vs. Hadoop

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question is really a conversation that I heard from, actually, my little brother when he was talking about something that he heard at a conference. He brought it to my attention. “Hey, Thomas, you’re involved in big data. I was talking to some folks at a GIS conference around Hadoop and TensorFlow.” He’s like, “One person came up to me and said, ‘Ah! Hadoop’s dead. It’s all TensorFlow now.” I really wanted to take today to really talk about the differences between Hadoop and TensorFlow, and just do a level set for all data engineers out there, all big data developers, or people that are just interested in finding out. “Okay, what’s happening in the marketplace?” Today’s question is going to come in around TensorFlow versus Hadoop and find out all the things that we need to know from a data engineering perspective. Even in the end, we’ll talk about which one’s going to be around in five years. Find out more right after this.

Welcome back. Today, as promised, what we’re going to do is, we’re going to tackle the question around which is better, what’s the differences of TensorFlow versus Hadoop, where does it fit in data analytics, the marketplace, and solving the world’s problems? If you’re watching this channel, and you’re invested in the data analytics community, you know how we feel about it, and we’re passionate about, we’re being able to solve problems using data. First thing we’re going to do is break them down, and then at the end, we’re going to talk about some of the differences, where we see the market going, and which one is going to make it in five years. Or, will both? Who knows. First, what is TensorFlow. We’ve talked about it a good bit on this channel, but TensorFlow is a framework to do deep learning. Deep learning gives you the ability to subset, and a branch of machine learning, but it’s just about processing data. The really cool thing about TensorFlow, and the reason TensorFlow and frameworks similar to TensorFlow in the deep learning realm are so awesome is because it gives you the portability to run and analyze your data on your local machine or even spread it out in a distributed environment. It comes with a lot of different algorithms and neural networks that you can use and incorporate into solving problems. One of the cool things about deep learning is just the ability to actually look and analyze more video data or voice recognition, right? Or, if you’re going on Instagram or you’re going on YouTube, and you’re looking for examples on deep learning, chances are somebody’s going to build some kind of video or some kind of photo identification that will help you identify a cat. That’s the classic example that you’ll see, is, “Hey, can we detect a cat by feeding in data, and looking, and analyzing this?” Tensorflow doesn’t use Hadoop, but TensorFlow uses big data. You use these large data sets to train your models that can be used on edge devices. If you’re even used a drone, or if you’ve ever used a remote control to use natural language processing to change the channel, then you’ve used some portion of deep learning or natural language processing. Not saying it’s TensorFlow, but that’s what TensorFlow, it really does. It’s very popular, developed by Google, open sourced, and housed by Google. A lot of free resources out there, and for data scientists and machine learning engineers, it’s a very, very exciting product to be able to build out and be able to start analyzing your data quicker and in a very popular fashion. Couple together the excitement for deep learning, couple together the ease of use of TensorFlow, and that’s why the market has just been hot for TensorFlow and those other frameworks.

What is Hadoop? Hadoop, it’s all about elephants, right? Hadoop has really been around since, I don’t know, we’re probably in 12 to 13 years of it being open source, but if we think back to what we did from analyzing data that was coming in from the web, think about being able to index the entire web, it’s kind of what Google helped develop that, and Yahoo, and a lot of the other teams from Cloudera and HortonWorks, really helped to push Hadoop into the open source arena. Hadoop is synonymous with saying big data. You can’t say big data without thinking about Hadoop. Hadoop’s been around for a long time. There’s a lot of different components to Hadoop, and even on this channel, whenever we talk about Hadoop, we’re specifically really talking about the ecosystem. The ability to process data, but the ability to also store large amounts of data with HDFS, so the Hadoop distributed file system, there’s a lot of components in there. There are APIs, and there are other tools that help for you to do it, but one of the things that I really like to think about when we talk about Hadoop and why it was so record-breaking, and just really open the market for big data was just the ability to set up distributed systems and be able to analyze large amounts of data. These large amounts of data would be more in the unstructured data, so think of it not being in a database, but a lot of it would still be in text-based. You could go out there, very popular example is going out here, setting up an API to pull in Twitter data, and be able to do cinnamon analysis [Phonetic 00:05:13] over that. Not so much the deep learning. They’re trying to get into the deep learning area right now, but more of machine learning, using algorithms like singular value decomposition or [Inaudible 00:05:25] neighbor, but being able to do that over large sets of data. Large sets of data with multiple machines. Hadoop, been around for a while, more seen as replacing the enterprise data warehouse. With TensorFlow now on the scene, where does Hadoop fit in, and what’s going on, and what are some of the differences?

Hadoop was written in Java. TensorFlow was written in C++. Both of them have APIs. They give you the ability to, whenever we’re talking about the processing of data, you can do it in Java, you can do it in Python, you can do it in Scala. There’s a lot of different options there from a Hadoop perspective. TensorFlow, too. You can see C++. You can also see it in Python. Python’s one of the more popular ones, actually did a course using TF Learn and TensorFlow to show that. When we think about the tools, it’s a little bit different. When we think about Hadoop, we’re actually building out a distributed system. Then, we’re using things like maybe Spark. Think of using Spark to be able to analyze that data. We’re going to pull insight from that data back to our cinnamon analysis that’s going to say, “Hey, these specific words in here, when we see them, this tweet is unhappy,” or, “This tweet is happy.” Versus TensorFlow, same thing. More of a processing engine, like framework to be able to pull in, analyze the data, and give you insights on whether that image contained a cat or not a cat. You’re starting to see some of the differences. We talked about Python versus Java. Both of them, there’s different APIs that you can start to use those. I’m probably talking right now about saying that I haven’t seen a lot of Java and TensorFlow, but I’m sure somebody has an API or some kind of framework out there that works on it. Another big difference, too, is the way that the processing is done. The Hadoop ecosystem’s really trying to get into it right now, but from a TensorFlow perspective, we’re really seeing it on GPUs, right? Think of being able to use GPUs to process data, 10-20, a lot faster than what we see on a CPU. Where Hadoop is more CPU-based, the way that we’re solving problems with Hadoop is we’re throwing a lot of CPUs in a distributed model to process the data and then pull it back in. TensorFlow, same thing, distributed networks. As you start to scale out your data, you really need to distribute those systems, but we’re doing it with GPUs. That’s speeding up the process. Little bit of a difference there, just in the approach, but that’s one of the big key differences. If we’re a data engineer, and we’re evaluating these, where do they come in? Ease of use, Hadoop, you’re building out your distributed system. Really Java-based, so if you have a Java background, it’s really good, but you can get by without it in some areas. It’s really not so much of a comparison with ease of use, but if we’re talking about just being able to stand something up and start messing around with it, it’s going to be a little bit more complicated and harder to do it from a Hadoop perspective with TensorFlow. You can actually look at an NFS file system. You can feed in data from different file systems, where with Hadoop, you’re building that system out, and also building out a file system. You’re building out distributed systems, and you’re building out disaster recovery and some of the other components. It’s harder to do from a Hadoop perspective, but there’s more expertise in it, because you’re actually building out a whole solution set, versus TensorFlow is the processing system that you’re using. The comparison on that perspective is somebody tries to talk to you about that, kind of explain that it’s, these are two different systems, right? When we’re talking about which are we using, that really comes down to it. If you’re looking for a project, and somebody says, “Hey! Should we use TensorFlow here, or Hadoop?” It’s going to be pretty easy to spot those, I think, because when you’re starting to look at them, if you think of Hadoop, think of something that’s replacing or falling in line to the enterprise data warehouse. What are we doing? Do we have massive amounts of data. It could be structured, semi-structured, but you’re wanting to offload, and you’re wanting to run huge analytics over that processing. Then, that’s probably going to be a Hadoop perspective. We’re probably building out that system when we think of the traditional enterprise data warehouse. That’s the bucket that we’re going to fall in. If we’re talking about doing some sort of artificial intelligence or doing some things with deep learning, maybe not so much in the machine learning era, you’re going to want to look at TensorFlow. Especially, listen for keywords like, hey, what are we doing from the perspective of images, or video, or voice? Any of those media-rich types of data, then you’re probably going to use TensorFlow, too. If you have machine learning engineers, a data scientist, and you’re trying to do rich media, TensorFlow’s going to be your really popular one. If you have more data analysts, and even your data scientist, but from the perspective of, we’re looking at large amounts of data and wanting to marry it, but we have it in some kind of structure and some kind of standardized system, then Hadoop may be your bucket.

Which one of these is going to be around in five years? I think they’ll both be around, but I will say that the popularity for Hadoop will continue in some degrees, but it’s more continuing to replace that enterprise data warehouse. Think of what you do from a traditional perspective in holding all your company’s information, from that perspective, where we’re seeing more product development, more media-rich things that are being done from an artificial intelligence. We’ll see more TensorFlow there. Will TensorFlow still be the number one deep learning framework in five years? Will deep learning, I can’t answer that here. Would I learn it if I were just starting out as a data engineer? Yeah, definitely. Definitely from the perspective of, I want to learn how to implement it and how to use it. You don’t have to become an expert. We’re not trying to become a data scientist from that perspective, but start looking at some of the frameworks, and building out, going through some of the simple examples that they have, and then heavy use on docker, container, and that whole world of being able to build those out. That’ll help you if you’re really trying to look into, hey, what could be next for data engineers? Or, what’s going on now? What’s cutting edge from that perspective? I hope you enjoyed this video, please, if you have any comments on it, if I missed something, put it in the comments section here below. I’m always happy to carry on the discussion. Until next time, see you again on Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

What I’m Learning Report #1 (Docker Deep Dive, K8s, & More)

October 3, 2019 by Thomas Henson Leave a Comment

One question I get a lot of on Big Data Big Questions is “Thomas what are you learning”. Honestly not as much as a Is should. It’s true I believe the key to being successful in any part of life is to continually learn.

Looking to change careers from Web Developer to DevOps?

Do you want to be a better partner or spouse?

Trouble with public speaking?

All the answer to the above questions start with learn and end with consistency. If you make it a habit to learn and are consistent with it there isn’t anything you can’t accomplish. Alright enough with selling you on learning! I wanted to share with you what I’m learning and to help motivate myself TO KEEP LEARNING. The way I plan to accomplish this is with monthly learning reports.

30 Minutes of Learning Everydayish…

For along time I’ve advocated for the idea of taking 30 minutes everyday to learn something new. I’ll go through time periods where I’m hitting that everyday then sometimes where I fall behind. While it’s only a target and nothing to beat myself up about, I find it a useful technique when learning any concept. The way I do my 30 minutes of learning is to set a timer on my phone for 30 minutes and focus only on that topic for 30 minutes. Recently, in order to track this habit I’ve been using the Super Habit App. Below you can see how I’ve done over the last month. Honestly not my best work but let’s see how it improves overtime.

Pluralsight

Mainly my 30 Minutes of Learning comes from Pluralsight courses. Not only am I an Author but I’m also a student. Pluralsight has been a part of my personal learning path long before I was an Author. Back when I was a fresh new Web developer I used Pluralsight to learn C# and ASP.NET frameworks. Of course, I also dove into the world of JavaScript, JQuery, and other JS frameworks. Now days I still learn with Pluralsight but the content is more Data Engineering and IT OPs focused.

Docker Big Picture Course

Docker seems to be taking over the world. In fact, contributions and adoption of Docker and Kubernetes has outpaced Hadoop exponentially. So many new applications and services offer a containerized version. In the Hadoop 3.0 release multiple features were to add support for containers. All this container talk has pushed me to learn this amazing Platform-as-a-Services for OS virtualization. My guide on this journey is the great Nigel Poulton. Checkout my notes from the Docker Big Picture Course:

Kubernetes originated out of Google (shocking I know)
Kubernetes is Greek for helmsman or Captain
Web Playground for K8s
Web Playground for Docker
Docker Engine – daemon –> containerd –> OCI
Docker has both community and enterprise versions
Kubernetes = K8s

Docker Deep Dive Course

After working my way through the Docker Big Picture Course I decided to stay in the Docker world by watching the Docker Deep Dive Course. I loved this course where I was able to get hands on with Docker and learn a good bit beyond the basics. Here are a few notes I jotted down throughout this course:

Docker Commands like
- List – docker ps
- Pull Image – docker image pull
- Build Image – docker image build
- Run Image – docker exec
Building Docker images is as easy as writing a YAML file
YAML = Yet Another Markup Language
Docker networking with bridge drive on Linux or NAT driver on Windows
Stacks and Services – code –> container –> services
Docker Universal Control Plane (UCP) is installed on top of Docker Engine
Docker Trusted Registry can be setup for storing all Docker Images.

Data Related Podcast & Blogs

Data Engineer learning doesn’t only take place in courses. I also wanted to track some of the Podcast and articles I consumed throughout the month. The great thing about Podcast is you can listen to them while commuting, working out, house chores, or just about anywhere. Here are a few Podcast and Articles I’ve consumed over the past month.

Conversational AI Best Practices with Cathy Pearl and Jessica Dene Earley-Cha – GCP podcast that digs into the aspects of conversational AI. I loved this podcast to explore where conversational AI is going and where to get started with NLP in GCP. Actually gave me some ideas for my October learning goals.
Microservices.io – Uh I can’t even begin to summarize how much content is on this site. If you are looking to learn more around Microservices (which you should!!) then bookmark this site and read this content over time.
Doctor AI – Dell Tech (full disclosure: #iworkforDell) podcast diving into different topics around AI. In this episode host Jessica explores the possibilities of AI augmentation in the medical field. One the areas I’ve spent a good bit of research in and spoke about. Earlier this year I spoke with a group of Medical Doctors and Researchers at NYU around advances in AI.
Exploring AWS Lake Formation – AWS podcast with guest from around the AWS world. A lot of great content on this on this podcast. Listened to this particularly episode while walking my son so my attention wasn’t what it should have been. Mostly I remember that Data Lake Formation is an AWS services that helps with cataloging and label data to support multiple services (MySQL, Redshift, S3).

On To Next Month

Thanks for supporting this new series and I’m excited to see how it matures over time. Also would love if I got more consistent with my learning as well. If you have ideas for things I should be learning or would like to share what you are learning put it in the comments below. Right now my thoughts are to wrap up the Kubernetes Deep Dive course then move on to Natural Language Processing (NLP). I’ve got ideas for some really cool projects in NLP so it should be fun.

Data Engineers: Data Science vs. Computer Science Degree

October 2, 2019 by Thomas Henson Leave a Comment

How Do you Choose the Right Degree?

College is such tough time when it comes to choosing education paths. For most folks College makes the first time they are making huge decision about their futures. So it’s easy to get analysis paralysis because the decision means so much. Or does it? At the end of the day it feels bigger than the decision really is over the long term.

The difference between a Data Science Degree and Computer Science degree might impact career outlook in the short term. The long term impacts of which degree you chose are minimal. Look around at the number of position where degrees aren’t even a requirement. When I was working on my first Big Data project our Data Scientist didn’t have a degree in Data Science but he was great in that role. Now I will say that Data Science degrees haven’t been around that long so it kinda of make sense.

Find out my thoughts of the differences between a Data Science Degree and Computer Science Degree in the video below.

Video Data Science vs. Computer Science Degree

Transcript – Data Science vs. Computer Science Degree

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today’s question comes in from a user, and it’s all about, what specific Master’s degree should I get? Find out how I answer this question and what Master’s degree you should get or should not get if you’re going into data engineering.

Today’s question. If you have a question, make sure you put it in the comments section here below. Reach out to me on thomashenson.com/bigquestions. Find me on Twitter, whatever you want to do, and I’ll do my best to answer your questions right here.

Today’s question. I’m looking for a career as a data engineer, but I’ve got a Bachelor’s in IT, and I’m looking to get into a Master’s degree. Awesome! Congratulations. It’s a pretty cool thing to go through. I went through a Master’s program as well. Which is better for data engineering career? Thinking about that specifically. A Master’s degree in data science or a Master’s degree in computer science.

This question, for me, really keys. I remember what it was like going through, when I’m trying to figure out which kind of Master’s I wanted to go to. I had a similar situation. Specifically wasn’t in the data engineering from that perspective, but I was looking in, to see, what do I want to do to take the next level in my engineering career? I looked at an MBA with an emphasis in information systems versus a Master in Science Computer Science. I ended up choosing to continue on down the business path and getting my MBA in information systems. Pretty excited to have gone through that, and really happy with my decision. I feel like it’s been fortunate with my career. I understand where you’re coming from. I’m not telling you to get an MBA. That’s not what I’m saying. I understand how much you look, going back and forth, and you’re like, man! What do I do here? I appreciate you asking for my opinion, as well. Which one should you get if you’re going into a data engineering? It’s an easy guess for me, here, just to say, “I think computer science and the skills that are involved in computer science are going to help.” If I were in your shoes, I would look, and pivot more towards the computer science. I would look into, though, there are new universities and other programs that are starting to emerge that actually have a data engineering track. Just like you were asking about, should I do the data science? In my opinion, if you’re not trying to go down the data science path, you maybe don’t go into that. If they do have a tack specific for data engineers, so data science in a newer program, a lot of universities and colleges are having around the globe, so if they have a specific data engineering path, I’d look into that. Specifically, I’d probably stay with the computer science track. However, like I said, there are some universities that are putting out a specific, “This is not data science,” but a specific data engineering path, where you’re going to go through more systems administration stuff, where you’re going to be building out programs that are going to analyze data, and being able to really focus on distributed systems, whether it be from Kubernetes, and containers, to different clouds. No one had to do it in AWS. Building out good data pipelines and really understanding what you’re doing from that perspective. I think I’d look into that, and also make sure you’re looking at some of those degrees.

One more bonus tip around as you’re going through that. I would definitely, at the university that you’re looking at, have a conversation with some advisors, and even some of the professors in the data science world or in the computer engineering world, and see if you can cross over. Maybe there’s an opportunity there to do something inter-disciplinarian. Maybe you can take a couple of the data science courses, because they would be really good for you to get exposure to it, not become a data scientist, but exposure to what goes on, on the data science side, and have those packaged together, and go through some of those courses while you’re going through the computer science course. Maybe they, not asking you to take double load. Hopefully there’s a crossover there, where it’s like, “Hey, I can pick and choose some of these.” With data engineering and just the boom that’s going on with that as far as careers and, if you look at just globally, we need more data engineers. The universities will be pretty excited for, especially somebody standing out to do that. Worst case scenario, what are they going to do? Your professors may tell you no, but they see that you’re engaged, and that you’re interested in data engineering, so they’re going to be able to look out for, maybe there’s new classes that are coming up. What about internships, right? Some of these universities have really good relationships with corporations. Your name is already at the top of the list, and it’s shown that you’re showing initiative, that hey, I’m excited about the data engineering world, so any opportunities to learn more or any opportunities for future career growth, might be a good thing. Something as simple as taking an hour to reach out and talk to a professor may be investing in yourself and in your career for further on down the future. Definitely try that out. Should you get a Master’s degree to become a data engineer? You don’t have to, but like I said, I’ve got a Master’s degree, and I went through that for my own purposes. If you’re watching this video, you’ve made it all the way to the end, which I hope you’ve made it to the end. Everybody that starts watching it, this was a specific question where we were talking about different degree options for your career. We’re not saying that you have to get the Master in Computer Science to become a data engineer. Heck, you can even go through, you can do the Master in Data Science and become a data engineer. This is just my advice for what we’re trying to do. There are other data engineers that don’t have degrees. We’ve covered that quite a few bit on this channel, and so I just want to be specific to that. I don’t want people watching this course, especially if you’re in college, or if you’re in high school and you’re starting to think about your data engineering path, like, “Aw, man! I’ve got to go get a Master’s degree to do this. Be in it for the long haul. That’s not what we’re talking about here. We’re just talking about options. Let me know if you have any questions about degrees, certifications, anything data engineering or technology-specific, and I will answer it on the next episode of Big Data Big Questions.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.