Data Scientist Archives - Thomas Henson

What does a Data Scientist do?

Data Scientist are changing the world but what do they really do? Basically a Data Scientist’s job is to find correlation in data that might be able to predict outcomes. Most of the time their job is spent data cleansing and building models using their heavy math skills. The development and architecture cluster management is ran by the Data Engineer. If you like math and love data then Data Scientist might be the right career path for you. Recently Deep Learning has emerged as a hot field within the Data Science community. Let’s explore some of the basic terms around Deep Learning.

What is Deep Learning

Deep Learning is a form of Artificial Intelligence where Data Scientist use Neural Networks to train models. Neural networks are comprised of 3 layers that allow for models to be trained by mimicking the way our brains learn. In Deep Learning the features and weights of the features are not explicitly programmed, but learned by the Neural Network. If you are looking to compare Machine Learning to Deep Learning just remember, machine learning is where we define the features, but Deep Learning is where the Neural Network will decide the features. For example a Machine Learning dog breed detection model would require us to program the features like ear length, nose size, color, height, weight, fur, etc. In Deep Learning we would allow the Neural Network to decide the features and weights for those features. Most of the Deep Learning environment will use GPUs to take advantage GPUs ability to quickly compute computations versus CPUs.

Must Know Deep Learning Terms

So are you ready to take on the challenge of Deep Learning? Let’s start out with learning the basic Deep Learning terms before we build our first model.

#1 Training

The easiest way to understand training in Deep Learning is to think of it as testing. In software development we talk about test environments and how you never code in production right? Deep Learning we refer to test as our training environment where we allow our models to learn. If you were creating a model to identify breeds of dogs, the training phase is where you would feed the input layer with millions and millions of images. During this phase both forward and backward propagation allow the model to be developed. Then it’s on the #2…

#2 Inference

If the training environment is basically test/development then inference is production. After building our model and throwing terabytes or petabytes to get as accurate as we can, it’s time to put it in production. Now typically in software development our production environment is larger than test/development. However, in Deep Learning it’s the inverse because these models are typically deployed on edge devices. One of the largest markets for Deep Learning has been in autonomous vehicles with the goal to deploy these models in vehicles around the planet. While most Data Engineers would love to ride around in a mobile data center it’s not going to be practical.

#3 Overfittiing

Data Scientist and Machine Learning Engineers can get so involved in solving a particular problem that the model create will only solve that particular data set. When a model follows too closely to a particular data set the model is overfitted to the data. The problem is more common because as Engineers we know what we are looking for when we train the models. Typically overfitting can be attributed to making models more complex than necessary. One way to combat overfitting is to never training test set. No seriously never train on testing set.

#4 Supervised Learning

Data is the most valuable resource, behind people for building amazing Deep Learning models. We can train the Data in two ways in Deep Learning. The first way is the Supervised Learning. In Supervised Learning we have labeled data sets where understand the outcome. Back to our Dog Breed detector we have millions of labeled images of different dog breeds to feed in our input layer. Most of Deep Learning training is done by Supervised Learning. Labeled data sets are hard to gather and take a lot of time from the Data Science team. Right now Data Wrangling is something we still have to spend a majority of time doing.

#5 Unsupervised Learning

The second form of learning in Deep Learning is Unsupervised Learning. In Unsupervised Learning we don’t have answer or labeled data data sets. In our dog breed application we would feed the images without label sets identifying the breeds. If Supervised Learning is costly on find labeled data then Unsupervised Learning is the easier form. So why not only use Unsupervised Learning? The reason is simple…we just are quite there from a technology perspective. Back in July I spoke with Wayne Thompson, SAS Chief Data Scientist, about when we will achieve Unsupervised Learning. He believes we are still 5 years out from significant break through in Unsupervised Learning.

#6 Tensorflow

Tensorflow is the most popular Deep Learning framework right now. The Google Brain team released Tensorflow to the open source community in 2015. Tensorflow is a Deep Learning framework that package together execution engines and libraries required to run Neural Networks. Both CPU and GPU processing can be run with Tensorflow but GPU is the chip of choice in Deep Learning.

#7 Caffe

Caffe is an open source highly scalable Deep Learning Framework. Both Python and C++ are supported as first class in Caffe. Caffe is another framework developed and still supported heavily by Facebook. In fact a huge announcement was released in May 2018 about the merging of both Pytorch and Caffe2 into the same codebase. Although Caffe is widely popular in Deep Learning it still lags behind Tensorflow is adoption, users, and documentation. Still Machine Learning Engineers should follow this framework closely.

#8 Learning Rate

Learning Rate is parameter used to calculate the minimal loss of function. In Deep Learning the learning rate is one of the most important tools for calculating the weights for feature in your models. Using a lower value learning rate in general provides more accurate results but takes a lot more time because it slows the steps down to find the minimal loss. If you were walking on a balance beam, you can take smaller steps to ensure foot placement, but it also increases your time on the balance beam. Same concept with Learning rate except we just taking longer time to find our minimal loss.

#9 Embarrassingly Parallel

Embarrassingly Parallel commonly used term in High Performance Computing for problems that can be parallelized. Basically Embarrassingly Parallel means that a problem can be split into to many many parts and computed. An example of Embarrassingly Parallel would be how each image in our dog breed application could be performed independently.

#10 Neural Network

Process that attempts to mimic the way our brains in that of computing. Neural Networks often referred to as Artificial Neural Networks are key to Deep Learning. When I first heard about Neural Networks I imagined multiple servers all connected together in the data center. I was wrong! Neural Networks is at the software and mathematical layer. It’s how the data is processed and guided through the layers in the Neural Network. See #17 Layers.

#11 Pytorch

Pytorch is an open source Machine Learning & Deep Learning framework (sound familiar?). Facebook’s AI group originally developed and released Pytorch for GPU accelerated workloads. Recently it was announced that Pytorch and Caffe2 would merge the two code bases together. Still a popular framework to be followed closely. Both Caffe & Pytorch were heavily used at Facebook.

#12 CNN

Convolutional Neural Network (CNN) is a type of Neural Network typically used visualization. CNNs use a forward feed processing that mimics the human brain which makes it optimal for visualizing images like in our dog breed application. The most popular Neural Network is the CNN because of the ease of use. Images are broken down pixel by pixel to process using a CNN.

#13 RNN

Recursive Neural Networks (RNN) differ from Convolution Neural Networks in they are a recurring loop. The key for RNNs is the feedback loop which act as the reward system for hitting desired outcome. During training the feedback loop helps train the model based on previous runs and desired outcome. RNNs are primary used with time series data because of the ability to loop through.

#14 Forward Propagation

In Deep Learning forward propagation is the process for weighting each feature to test the output. Data moves through the neural network in the forward propagation phase. In our example of the for dog bread assume feature of tail length and assign it a certain value for how much it matters for determining dog breed. After assigning a weight of the feature we then calculate if the assumption was correct.

#15 Back Propagation

Back propagation or backward propagation in training is moving backward through the neural network. This allows us to review how bad we missed our target. Essentially we used the wrong weight values in and the output was wrong. Remember the forward propagation phase is about testing and assigning weight thus the back propagation phase test why we missed.

#16 Convergence

Convergence is the process of moving closer to the correct target or output. Essentially convergence is where we are find the best solution for our problem. As Neural Networks continue to run through multiple iterations the results will begin to converge as reach the target. However, when results take a long time to converge it’s often times called poor convergence.

#17 Layers

Neural Networks are composed of three distinct layers: input, hidden, and output. The first layer is the input which is our data. In our dog breed application all the images both with and without a dog are our input layer. Next is the hidden layer where features for the data are given weights. For example, features of our dogs like ears, fur, color, etc are experimented with different weights in the hidden layer. Also the hidden layer is where Deep Learning received it’s name because the hidden layer can go deep. The final layer is the output layer where find out if the model was correct or not. Back to our dog breed application, did the model predict the dog breed or not? Understanding these 3 layers of a Neural Network is essential for Data Scientist to using Neural Networks.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Will AWS Certification Help Data Scientist?

Every discipline in IT has different certification and the debate about the worth of those certification will go on forever. Data Scientist cross over with needing skills in coding, operations, and math. However the Data Scientist isn’t the only person on the Big Data Team. The Data Engineer tends to build and maintain the application, leaving the data modeling to the Data Scientist. With the division of labor should a Data Scientist get an AWS Certification?

In this video I will explore the requirements for Data Scientist and even break down a job posting from AWS for a Data Scientist. Watch now to find out about AWS certifications for Data Scientist.

Transcript – AWS Certification Required for Data Scientist?

Hey, how are you doing today? My name’s Thomas Henson, and welcome to another episode of Big Data Big Questions. And so, today I’ve got a very special question that came in from a user that we’re going to tackle. It’s about a certification in AWS. So, do you need a Certification AWS to be a data scientist? And so, I’m going to tackle that question. And then, we’ll also, actually, going to look and try to see some job descriptions out there that are posted and see what those job descriptions are, and how I would approach it, and where my thoughts are on the AWS Certification for data scientists, and looking all into the job description, too. So, find out all about this, right after this.

So, today’s question comes in from YouTube. So, I’ve got the question here. Before we start and jump into the question, I do want to remind you, if you have any Big Data Big Questions and you would like them answered, put them in the comments section here below, throw it out on Twitter using the hashtag Big Data Big Questions, or go to our website and you can send me any kind of question that you want and I’ll try my best to answer them here for you. Also, make sure you subscribe and hit the notification button so you never miss an episode and never miss when your question gets answered all here on YouTube.

So, today’s question comes in from… It’s on my Cloudera Data Engineering Certification question. So, we’re following along with certification questions here. So, he says, “Hi, will AWS Certified Solutions Architects Certification help in my data science career path?” So, this question is a large, large topic, right? So, I’m going to have to take some assumptions here and think, “Okay, so this person is looking for a career path into data science.” So, I’m thinking that they want to become a data scientist. And so, what they’re saying is, “Hey, to become a data scientist, do I need to have the base level AWS Certification?” The quick and dirty answer is no, but that’s all going to depend, too. So, it’s going to depend on what the job description is. And towards the end of this video, will actually go through and look at a job description specifically from Amazon and see, does even Amazon require AWS Certification for their data scientist?

So, jumping back into the question though, let’s assume that we’re not talking specifically just about becoming an AWS Certification for a data science career path. Let’s say that it’s a more broad topic. Maybe, it’s going to be a data engineer or a machine learning engineer. So, with those topics, remember, those are more hands-on as far as the technology and implementing different packages, whether it be HDFS, Yarn, Kafka, Pig, Hive, doing some of the systems administration work, but also doing some of the hands-on a machine learning work where you get to maybe implement some of the algorithms and doing the tuning and coding there.

So, in those career paths, do you have to have the AWS Solutions Architects Certification? It’s going to depend there. So, the first thing I would do is, would find out what the basis of wanting to get that certification is. So, if you’re data engineer or a machine learning engineer and you know that you’re… Say, within your company, you guys are using AWS, you’re using AWS for your big data projects, then it’s probably going to make sense for you to have some level of understanding of AWS platform. And specifically, if you’re at a company where you’re required to get the AWS Big Data Certification, then yes, getting this lower level certification for the Solutions Architect Associate, that’s going to benefit you greatly because AWS now requires for the AWS Big Data speciality, that you have a baseline certification. The Solutions Architect is one that’ll get you covered.

I will say that I have the AWS Solutions Architect Certification Associate. I was looking into the Big Data Certification there for AWS and doing some of the tactical things there. It was a great certification to give you those baseline skills because with my skill set, I came in, didn’t really have an understanding of all the offerings for AWS. Most of the stuff that I’ve always worked with is On-Prem. Working at a company that’s using AWS or knowing that you’re applying for a position that requires that AWS Certification, I’m going to say that most the time you’re not really going to have to have that AWS Certification. A lot of deployments… And you can look with… Hortonworks and Cloudera, a lot of their different deployments if you look, they’re overarchingly On-Prem. So, not saying that it’s going to hurt you for AWS, but if you’re trying really quickly, and this is a tactical decision to, within the next six months be able to roll into a position, odds are in your favor that you’re not going to deploying it in AWS, or, Azure, or even Google at this point.

So, I would look into maybe getting the Cloudera Certification or getting Hortonworks and just having that baseline information for the machine learning engineer, for your data engineer. Especially, if you’re data scientist, I don’t think that you’re going to need that.

So, let’s go back in and actually look at the question about data scientists and see what the job description is there. So, for this, let’s look… I’ve just pulled in here an Amazon data scientist position. This looks like it’s an Alexa position. So, you can see what they’re looking for is somebody that’s probably going to be able to do some kind of machine learning, maybe some deep learning on voice recognition as it comes in from Alexa and be able to provide some kind of prescriptive, or maybe even predictive analytics on it. But you can see, the majority of what they’re asking for is, they’re looking for… Yeah, they’re looking for some scripting languages here, so, maybe, some Perl or Python, or just some familiarity with those.

But it’s real heavy in the high-level techniques, right? Like what are we doing with machine learning, like building up those models and specifically really having more math-based skills? If you even look at the description here, and when we talk about the technical degree, they’re not really looking for a technical degree like we would think about the computer science, information system, management information system, computer information systems. They’re saying, “Hey, it’s okay, if you have a statistics background, some kind of applied math, or even an economics background.” And so, this right here, just looking at this one, so this is Amazon, right?

So, at Amazon, not saying that they’re not using AWS platform, but I’m saying, for a data scientist, and if that’s specifically the role that you’re looking at, necessarily, you’re not going to have to have that AWS Certification. You probably want to be somewhat familiar if you’re applying at Amazon. But outside of that, I wouldn’t think you’d need the certification. Even Amazon’s not asking for it here.

So, that’s my two cents on the AWS Certification in data science. If you have any questions, any follow-up questions, go ahead and put them in the comments section below. Make you subscribe so you never miss an episode, and I will see you next time.

17 Deep Learning Terms Every Data Scientist Must Know

What does a Data Scientist do?

What is Deep Learning

Must Know Deep Learning Terms

#1 Training

#2 Inference

#3 Overfittiing

#4 Supervised Learning

#5 Unsupervised Learning

#6 Tensorflow

#7 Caffe

#8 Learning Rate

#9 Embarrassingly Parallel

#10 Neural Network

#11 Pytorch

#12 CNN

#13 RNN

#14 Forward Propagation

#15 Back Propagation

#16 Convergence

#17 Layers

Want More Data Engineering Tips?

Is an AWS Certification Required for Data Scientist?

Will AWS Certification Help Data Scientist?

Transcript – AWS Certification Required for Data Scientist?

Want More Data Engineering Tips?