What is a Data Lake

July 23, 2017 by Thomas Henson Leave a Comment

Explaining the Data Lake

The Enterprise space is notorious for throwing around jargon. Take Data Lake for example the term Data lake. Does it mean there is a real lake in my data center because that sounds like a horrible idea. Or is a Data Lake just my Hadoop cluster?

Data Lakes or Data Hubs have become mainstream in the past 2 years because the explosion in unstructured data. However, the one person’s Data Lake is another’s data silo. In this video I’ll put a definition and strategy around the term data lake. Watch this video to learn how to build your own Data Lake.

Transcript

Thomas Henson: Hi, I’m Thomas Henson, with thomashenson.com. And today is another episode of “Big Data, Big Questions.” So, today’s question we’re going to tackle is, “What is a data lake?” Find out more, right after this.

So, what exactly is a data lake? If you’ve ever been to a conference, or if you’ve ever heard anybody talk about big data, you’ve probably heard them use the term “data lake”. And so, if you haven’t heard them say data lake, they might have said data hub. So, what do we really mean when we talk about a data lake? Is that just what we call our Hadoop cluster? Well, yes and no, so really, what we look for when we talk about a data lake is we want an area that has all of our data that we want to analyze. And so, it can be the raw data that comes in, off of, maybe, some sensors, or it can be our transactional sales quarterly, yearly – historical reporting for our data – that we can have all in one area so that we can analyze and bring that data together.

And so, when we talk about data lake – and we really want to look for where that term data lake comes from – it comes from, actually, a blog post that was published some years ago, and… Apologize for not remembering the name, to give credit, but what they talked about is, they said that when we talk about unstructured data and data that we’re going to analyze in Hadoop, it’s really… If we look at it in the real world, it’s more like a lake. And we call it more like a lake because it’s uneven. You don’t know how much data is going to be in it, it’s not a perfect circle. If you’ve ever gone to a lake, it doesn’t have a specific shape. And in the way that data comes into it… So, you might have some underground streams, you might have some above-ground streams, you might just have some water that runs in from a rain – runoff, there, that’s all coming into it. And you compare that to what we’ve traditionally seen when we talk about our structured data that’s in our enterprise data warehouse is, if you look at that, that’s more like bottled water, right? It’s perfect, we know the exact amount that’s going to go into each bottle. It’s all packaged up, it’s been filtered, we know the source from it, and so that’s really structured. But, in the real world, data exists unstructured. And to get to that point where we can have that bottled water, we can have that structured data all put tight into one package that we can send out, you need to take it from the unstructured to the structured.

And so, this is where the concept of data lake comes in, is we talk about being able to have all this data that’s unstructured, in the form that it already exists. And so, your unstructured data is there, and it’s available for all analysts, so maybe I want to have a science experiment and test out some different models with maybe some historical data and some new data that’s coming in. And my counterpart, maybe in another division, or just from another project, can use the same data. And so, we all have access to that data, to have experiments, so that when the time comes for it to support, maybe, our enterprise dashboards or applications, we can push that data back out in a more structured form. But, until we get to that point, we really want that data to all be in one central location, so that we all have access to it. Because if you’ve ever worked in some organizations, you’ll go in, and they all have different data sets that may be the same. I mean, I’ve sat on different enterprise data boards in different corporations and different projects, just because of the fact that we all have the same data, but we may not call it all the same thing. And so, it really prohibits us being able to share the data.

And so, from a data lake perspective, don’t just think of your data lake as your Hadoop cluster, right? You want it to be multi-protocol, you want it to have different ways for data to come in and be accessed, and you don’t want it to just be another data silo, too. And so, that’s what we look at, and that’s what we mean when we talk about data lake or data hub. That’s our analytics platform, but it’s really where our company, where our corporation data exists, and it gives us the ability to share that data as well.

So, that’s our big data big question for today. Make sure you subscribe so that you never miss an episode, and also, if you have any questions that you want me to answer, go ahead and submit them. Go to thomashenson.com, big data big questions, you can submit them there, you can find me on Twitter, you can submit your questions here on YouTube – however you want, just ask those questions and I’ll do my best to get back and answer those. Thanks again.

Big Data Big Questions: Big Data Kappa Architecture Explained

July 9, 2017 by Thomas Henson Leave a Comment

Learning how to develop streaming architectures can be tricky and difficult. In Big Data the Kappa Architecture has become the powerful streaming architecture because of the growing need to analyze streaming data. For the past few years the Lambda architecture has been king but in past year the Big Data community has seen a transformation to the Kappa Architecture.

What is the Kappa Architecture? How can you implement the Kappa Architecture in your environment? Watch this video and find out!

Transcript

(forgive any errors the video was transcribed by a machine..)

Hi folks Thomas Henson here with thomashenson.com and this is another episode of big data big questions and so today what we’re going to do is we’re going to tackle the Kappa architecture and explain how we can use that in Big Data and why it’s so popular right now find out more right after this.

[Music]

So in a previous episode we talked about the lambda architecture and how the land architecture is kind of the standard that we’ve seen in big data before we had spark and streaming and Flink and you know all those processing engines that work with a you know in big data to do streaming and so you can find that video right here Oh check it out we’re in the same shirt pretty cool so after you watch that video now we need to talk about the capital architecture and the reason we’re going to talk about the Kappa is because it’s based and it’s more kind of morphed actually from what the lambda architecture is and so when we talk about the Lambda architecture we talked how we had a to dualistic you know framework so we have your speed layer and your batch or MapReduce later but more of a transactional and right so you have two layers you’re still moving your data in HDFS you’re still point your data into Q well the capital architecture what we’re trying to do there and where the industry is going is not to have to support two different frameworks right so I mean anytime you’re supporting two but two versions of code or two different layers of code it just it’s more complicated you know you mean more developers and is just more risk right you know you look at the 80/20 rule you’re always going have you know probably 20% of you know 20% of bugs cause 80% of your problems so you know why have to manage two different layers and so what we’re starting to see is we’re starting to move all our data into one system where we can interact with it through our API and you know pull out you know whether you know whether we’re running a you know flute job or whether we’re running some kind of distributed search maybe using solar or ElasticSeach but we want to collapse all that down into one different framework and so okay that that sounds pretty simple but it’s not really implemented like we think and so one of the big tips and one thing I want you to pay attention to is when you’re talking about the capital architecture you’re saying okay I’m going have all this let I’m going have this one layer here that’s going to interact and I want to run all my jobs you know whether I’m running through spark around through Flink that’s how we’re going to process this data what you want to make sure is we want to make sure that you’re not just using Kafka or some kind of message queue and you know you’re pulling your job you’re still doing you know you’re still pulling this your API’s and still running your streaming jobs from there but you may still be you know taking that data and moving it into HDFS and still running some processing here and so really what we want to see with the Kappa architecture is we want to see where we’re taking our data and you know whatever our queuing system is you can check out per Vega I oh and there’s some information there about that architecture layer and what you’ll see is you want that data to be able to you so your source data comes in you want your data to exist and it’s a kind of queuing system but then you also want that to auto to your app but you don’t want your API’s where you’re writing directly to HDFS because then you’re just writing to two different systems as well so you want something to abstract away all that storage so whether your data comes in and it’s more archival and it’s sitting in HDFS or sitting in some kind of object-based storage or it’s the streaming you know it’s the streaming applications and you’re trying to pull that data off as fast as you can you only want to interact with that one system and so that’s what we say when we talk about Kappa and that’s what Kappa really is intended to be so remember you want to abstract away that storage layer once your queuing system where you’re only dealing with API’s and so you want to be pulling your spark jobs your Flink jobs in your stripping research through one pipeline not through two different pipelines where you’re breaking up your speed layer and you’re breaking up you know maybe your batch of your transactional layer so that’s what the Kappa architecture is explained make sure you subscribe to this video so you never miss an episode you definitely want to keep up with what’s going on in Big Data any questions you have submit those big data big questions do in the comments below send me an email you know put it on the comment section or go to the Big Data big question section on my blog thanks again and I’ll see you next time.

Big Data Big Questions: Do I need to know Java to become a Big Data Developer?

May 31, 2017 by Thomas Henson 1 Comment

Today there are so many applications and frameworks in the Hadoop ecosystem, most of which are written in Java. So does this mean anyone wanting to become a Hadoop developer or Big Data Developer must learn Java? Should you go through hours and weeks of training to learn Java to become an awesome Hadoop Ninja or Big Data Developer? Will not knowing Java hinder your Big Data career? Watch this video and find out.

Transcript Of The Video

Thomas Henson:

Hi, I’m Thomas Henson with thomashenson.com. Today, we’re starting a new series called “Big Data, Big Questions.” This is a series where I’m going to answer questions, all from the community, all about big data. So, feel free to submit your questions, and at the end of this episode, I’ll show you how. So, today, the first question I have is a very common question. A lot of people ask, “Do you need to know Java in order to be a big data developer?” Find out the answer, right after this.

So, do you need to know Java in order to be a big data developer? The simple answer is no. Maybe that was the case in early Hadoop 1.0, but even then, there were a lot of tools that were being created like Pig, and Hive, and HBase, that are all using different syntax so that you can extrapolate and kind of abstract away Java. Because the key is, if you’re a data analyst or a Hadoop administrator, most of those people aren’t going to have Java skills. So, for the community to really move forward with this big data and Hadoop, we needed to be able to say that it was a tool that not only Java developers were going to be able to use. So, that’s where Pig, and Hive, and a lot of those other tools came. Now, as we start to look into Hadoop 2.0 and Hadoop 3.0, it’s really not the case.

Now, Java is not going to hinder you, right? So, it’s going to be beneficial if you do know it, but I don’t think it’s something that you would want to run out and have to learn just to be able to become a big data developer. Then, the question is, too, when you say big data developer, what are we really talking about? So, are we talking about somebody that’s writing MapReduce jobs or writing Spark jobs? That’s where we look at it as a big data developer. Or, are we talking about maybe a data scientist, where a data scientist is probably using more like R, and Python, and some of those skills, to pull their insights back? Then, of course, your Hadoop administrators, they don’t need to know Java. It’s beneficial if they know Linux and some of the other pieces, but Java’s not really necessary.

Now, I will say, in a lot of this technology… So, if you look at getting out of the Hadoop world but start looking at Spark – Spark has Java, so you can write your Spark jobs in Java, but you can also do it in Python and Scala. So, it’s not a requirement for people to have Java. I would say that there’s a lot of developers out there that are big data developers that don’t have any Java skills, and that’s quite okay. So, don’t let that hinder you. Jump in, join an open-source community project, do something to expand your big data knowledge and become a big data developer.

Well, that’s all we have today. Make sure to submit your questions. So, I’ve got a space on my blog where you can submit the questions or just submit them here, in the comments section, and I’ll answer your big data big questions. See you again!

Big Data MBA Book Review

May 1, 2017 by Thomas Henson 1 Comment

Big Data MBA Book Review Video

Today’s book review on the Big Data MBA holds a special place in my library. I had read this book before meeting Bill Scharmzo and after sharing a steak with him I reread it. It was already an amazing book in my eyes because as a developer it opened my eyes to many of the problems I’ve had on projects. Hadoop and Big Data projects are especially bad about missing the business objective. Many times the process for using a new framework goes down like this…

Manager/Developer 1: “We have to start using Hadoop”

Questions the team should ask:

What is the business purpose of taking this project problem
How will this help us solve a problem we are having
Will this project generate more revenue? How much more?

What the team really does:

Quick search on StackOverflow for Hadoop related questions and answers
Research on best tools for using Hadoop
Find a Hadoop conference to attend

Boom! Now we are doing Hadoop. Forget the fact we don’t have a business case identified yet.

One of the things stressed in the Big Data MBA is connecting a single business problem to Big Data Analytics. Just like how User Stories in Scrum describe the work developers will do, our business problem will describe the data used to solve the problem.

The Big Data MBA is a book about setting up the business objectives to tackle. Once those objectives are fettered out and the data is identified, developers can work their magic. Understanding how to map the business objectives to your technology is key for any developer/engineer. In fact the more you understand this the further you can go in your career. For this reason, I highly recommend this book for anyone working with Big Data.

Transcript

(forgive any errors it was transcribed by a machine)
Hi folks and welcome back to thomashenson.com and today’s episode is all about Big Data and so the book review that I’ve been wanting to do for quite some time so stay tuned.

[Music]

So today’s book review is the Big Data MBA it’s by Bill Schmarzo a fellow DELL EMC employee and a person who worked at Yahoo during the early days of data analytics ad buying and also the Hadoop era – this book focuses on the business objectives of Big Data a lot of things that we as developers and Technology technologists actually kind of overlook and I know I have in the past and it’s all about okay you know we want to be able to take a dupe and ready to implement it but this comes into what the business objectives are – one of the things that I really like about this book is Bill talks about anything over six months is really just a science experiment right and so that’s really an agile principle – so if you’re in the DevOps and agile software development you’ll kind of understand the concepts of hey let’s find one or two small objectives that we can make a quick impact on and you know anywhere from 6 to 12 weeks and then we can just build those use cases – and so a couple of the examples he uses on just single products right so instead of trying to increase like all your products you know revenue by like 10 or 20% you said let’s just pick one or two and I really like that approach right because what you can do is you get everybody together so it’s not just your you know developers your business analyst and the product owners it’s you know people from marketing your executives everybody gets in a room and you know a lot of whiteboards up and you actually sit down and you talk about these objectives so if we’re willing to you know increase the process of one product in two months what are we going to do so we’ll look at what we have from a data perspective and we’ll start data mapping so currently this is the data that we have what are some outside you know data sources that we can bring in what would help us answer questions right so what what questions will we love to be able to answer about our customer and is there data already out there about that and so I really like this book I think anybody that’s involved in big data or data analytics should be it I mean it’s definitely you know high-level business objectives but even for I do you know developers and you know your big data you know developers I think I think it’s really important for them to understand those objectives one of the big reasons there’s a lot of projects in software development and Big Data fail is we don’t tie to a business objective you know we have a tool or widget or a framework that we want to use and it’s great but we’re having a hard time really bringing it to into the CFO or the upper level management on what objective and what benefit we’re going to get out of using this tool and so for people you know they’re involved in Big Data you know even from the development side I think this will help you be able to champion those initiatives and be able to you know have more successful projects too so make sure you check out the big data MBA by Bill Schmarzo and to keep in tune with more Big Data tips make sure to subscribe to my YOUTUBE channel or check out thomashenson.com thanks

[Music]

Isilon Quick Tips: Enabling FTP in OneFS

March 10, 2017 by Thomas Henson 1 Comment

In this episode of Isilon Quick, I’ll demo enabling FTP in OneFS. Isilon supports FTP, but to take advantage you have to enable it on your cluster. Learn setup FTP on your cluster in under 3 minutes with the video below.

FTP In OneFS

Transcript

(forgive any errors it was transcribed by a machine)

Hi welcome back to another episode of Isilon quick tips. In today’s episode I’m going to show you how to enable FTP on your Isilon cluster. So get ready to follow along so to enable your ftp access the first thing we’re Going do is we’re Going go to our protocols and go to our ftp settings so that page loads up you can see that we only really have four options here and the first option is just to enable the ftp service that’s something that doesn’t come to fault enabled but you see that i already have it checked here so now i know that have enabled the ftp service and so now I can move data back and forth there are a couple different options here in the settings one of them i want to point out is the enable anonymous exes and that’s something that ninety-five percent of the time you’re not going to want to set that up but if you wanted to set that up this is where you would do it so after we have that setup let’s go back and look at our members and rolls and i just want to show you the account that I’m going to be using so I’m going to be using my file system account settings and this user admin here in your environment you might have active directory which you can access your ftp users for their you just have to make sure you use your full domain name but I’m going to use this admin account here now we just need to pull up an ftp client so I’m going to use WinSCP but you’re able to use anything you want to put in our host name and I’m going use an IP address because i don’t have my smart connect zones or my DNS server setup on my local machine here in most cases you’re going to that smart connect name here for hosting and then once you’re logged in we’re going to move over our slide Powerpoint here and i put that just in the IFS directory and now we’re going to verify it just in RIFs share here and we can see that yes in the IFS directory we have our slots and so our data was able to move over using our ftp service so this is how to enable ftp on your Isilon cluster just remember all you have to do is enable that ftp and then those users can log in using their own credentials in a future episode I’ll cover some more options around the ftp servers and something things you can customize on thanks again for tuning into Isilon quick tips and make sure to subscribe so you never miss an episode.

Everything You’ve Wanted to Know About HDFS Federation

March 6, 2017 by Thomas Henson Leave a Comment

2017 might have just started, but I’ve already noticed a trend that I believe will be huge this year. Many of the people I talk with who are using Hadoop & friends are curious about HDFS Federation.

Here are a few of the questions I hear

How can we use HDFS Federation to extend our current Hadoop environment?

Is there anyway to offload some of the workloads from our NameNode to help speed it up?

Or my favorite……

We originally thought we were boxed in with our Hadoop architecture but now with HDFS Federation our cluster has more flexibility.

So what is HDFS Federation? First we need to level set on how the NameNode and Namespace work in HDFS.

How the NameNode Works in Hadoop

The HDFS architecture is a master/slave architecture. The NameNode is the leader with the DataNodes being the followers in HDFS. Before data is ingested or moved into HDFS it must first pass through the NameNode to be indexed. The DataNodes in HDFS are responsible for storing the data blocks, but have no clue about the other DataNode or data blocks. So if the NameNode falls off the end of the earth your in trouble because what good are the data blocks without the indexing.

HDFS not only stores the data, but provides the file system for users/clients to access the data inside HDFS. For example in my Hadoop environment I have Sales and Marketing data I want to logically separate. So I would, setup to different directories and populate sub directories in each depending on the data. Just like you have setup on your own work space environment. Pictures and Documents are in different directories or file folders. The key is that structure is stored as meta data and the NameNode in HDFS retains all that data.

HDFS Namespace

The NameNode is also responsible for the HDFS namespace in the Hadoop environment. The namespace is set at the file level, meaning all files are hierarchical and follow a tree structure. NameSpace gives the structure users need to traverse the file system. Imagine an organized toolbox with all the tools laid out in a structured way. Once the tools are used they are put back in the same place.

Back to our Windows example the “C” drive is the top level file and everything else on the computer resides under it. Try to create another “Program Files” directory and you will get an error stating that file name already exists. However, if you drop down one level into another file and create a “Program Files” because it would be C:/Program Files/Program Files.

HDFS Federation Namespace — Windows NameSpace Example

As data is accelerated into HDFS, the NameNode begins to grow out of it’s compute and storage. Just like a hermit crab moving into a new shell, so is the same for the NameNode (vicious and expensive cycle). What if we could begin using scale-out architecture without having to re-architect the entire Hadoop environment? Well this is where HDFS Federation helps big time.

Hadoop Federation to the Rescue

A little know change in HDFS 2.x was the addition of HDFS Federation. Oftentimes confused with the ability to create high availability (HA) in Hadoop clusters or secondary NameNodes. However HDFS Federation allows for Hadoop clusters to add another NameNode and namespace. This Federated NameNode is one that has access to the DataNodes and indexes data moved to those nodes, but only when the data flows through that NameNode.

For example, I have two NameNodes in my cluster NN1 and NN2. NN1 will support all data in hdfs/data/…and NN2 will handle the hdfs/users directory. So as data from users/applications comes my hdfs/data namespace NN1 will index it and move it to the DataNodes. However if an application connects to NN1 and tries to query data in the hdfs/user directory it will get an error saying no known directory. For the application to query data in that namespace requires a connection to NN2. Think of HDFS Federation as adding a new cluster, in the form of a NameNode, while still using the DataNodes for storage.

Benefits of HDFS Federation

Here are a few of the immediate benefits I see being played out with HDFS Federation in the Hadoop world.

NameNode Dockerization – The ability to set up multiple NameNodes allows for new Hadoop architectures now allows for a module Hadoop architecture. As we start to move into a Microservices world, we will see architectures that contain multiple NameNodes. Hadoop environment will have the ability to break down and spin up new NameNodes on the fly.
Logically Separate Namespaces – For charge back IT enterprise HDFS Federation gives another tool for Hadoop administrators to setup multiple environments. These environments will still have the cost saving of a single Hadoop environment.
Ease NameNode Bottlenecks – The pain of having all data index through a single NameNode can be eliminated by create multiple NameNodes.
Options for Tiering Performance – Segmenting different NameNodes and namespaces by customer requirements instead of setting up multiple complicated performance quotas is now an option. Simply provision the NameNode specs and move customer to NameNode based on the initial requirements.

One of the big reasons for HDFS Federations uptick this year is based on the growing adoption of Hadoop and the sheer amount of data being analyzed. More data more problems and particularly those problems are at scale.

Final Thoughts on HDFS Federation

HDFS Federation is helping solve problems at scale with the NameNode. Since Hadoop’s 1.x code release the NameNode has always been the soft underbelly of the architecture. The NameNode has continued to struggle with high availability, bottlenecks, and replications. The community is continually working on improving the NameNode. HDFS Federation and the movement of Virtualized/Dockerized Hadoop are moving to mitigate these issues. As the Hadoop community continues to innovate with projects like Kudu and others, look for HDFS Federation to play a bigger role.

Ultimate Big Data Podcast List

December 13, 2016 by Thomas Henson 3 Comments

My Ultimate Agile Podcast blog post was such a hit I though it only appropriate to do one for Big Data. Who doesn’t need to data geek out when in the car, plane, train, or treadmill? Listening to podcast is one of the easiest ways to keep or skill up. However find a cultivated list of podcasts on just Big Data is not easy.

The list is intended to be a resource for the Big Data/Hadoop/Data Analytics community. So I will continue to update the list with new Big Data podcast or episodes.

If you a host of a big data related podcast below or new podcast and would like to interview me on your show, reach out by Twitter, comments, or etc..

Big Data Podcast — podcast rustic sign – letterpress wood type over grained cedar plank against red barn wood

Let me know you notice a podcast missing or broken links. Just add a comment or contact me and I will make the changes.

Since I’ve created this list, I’m putting the episodes of the podcast I was in first.

Big Data Podcast List by Category

Hadoop/Spark/MapReduce

Big Data Beard Podcast – Newly released podcast exploring the trends, technology, and talented people making Big Data a big deal. Host are Brett Roberts, Cory Minton, Kyle Prins, Robert Hout, Keith Quebodeaux, and myself. Join us as we talk to about our Big Data journey and with others in the community.

Get Up And Code 093: All About Running With Thomas Henson – In this Podcast episode my friend John Sonmez and I talk about how I ran my first 1/2 Marathon and the release of my Pig Latin Getting Started course. Pig Latin was one of my first languages I learned in the Hadoop ecosystem and I was excited to be able to give back to the community with this course.
My Life for the Code 02: Big Data Niche, Pluralsight, Family, and more with Thomas Henson – Another podcast I appeared on talking more about Pig Latin and where I see big data going on the next 10 years. Shawn and I also jump into to talking about pursuing your passion(spoiler mine is data analytics) while raising a family. We even threw in a couple of my books recommendations and teased my 2nd Pluralsight course HDFS Getting Started.

LinkedIn’s Kafka, Digital Ocean gets deep about cloud and Red Sox data! – LinkedIN’s Kafka processing 1 trillion messages…..

All Things Hadoop – Favorite episode Hadoop and Pig from Alan gates at Yahoo the title alone gives you an indication of how old it is but still awesome listen.

Puppet Podcast Provisioning Hadoop Clusters with Puppet – Learn how to use Puppet to automate your CDH environment with Puppet. Mike Arnold the creator of the Puppet module talks about to deploy CDH on a large scale with Puppet. If you virtualizing Hadoop (and you should be) then you’ll want to take note in the episode on how speed up your deployment process. My prediction is in the next year we will see more automation tools in the Hadoop ecosystem.

Roaring Elephant Podcast – Awesome insight from two guys working out in the field in Europe. They talk through hot topics in Hadoop ecosystem and also give some real world story from the customers they speak with. Great Podcast if you are just starting out in your Hadoop journey.

Episode 49: Thomas Henson on IoT Architectures

TechTarget Talking Data – Quick short digestible episodes all about data Build vs rent, Kafka and Spark Streaming

Data Engineering Podcast – Podcast dedicated to those who are running the Data pipelines in Big Data and Analytics workflows. Host Tobias does an amazing job keeping Data Engineers up to date with data workflows and tools to create those workflows.

Business of Big Data

Hot Aisle with Bill Smarzo – One of my favorite podcast episodes (full disclosure: I work with both the hosts of the Hot Aisle and Bill Schmarzo) on the topic of the business of big data. Bill’s insight into to what Big Data can mean for a business is something a lot of us as developers/admin lack when talking outside of the wall of IT. One of the biggest reasons Hadoop projects fail is because they aren’t tied to a business objective. In this episode learn about how to tie your Hadoop project to a business to generate more revenue for the company, which brings in more money to expand your Hadoop cluster (win-win-win).

Cloud of Data – Wow talk about an all-star cast of interview it looks like a who’s who of Data CEOs . The first episode was with InfoChimp’s CEO, which I actually worked at CSC during the InfoChimp’s acquisition. Those were some really bright data scientist.

Data Analytics/ Machine Learning

Data Skeptic – usually short format on specific topics in data analytics. the podcast is great. It’s about data analytics and not just about big data but confused as the same thing. My favorite episodes are the algorithm explanations b/c as someone who mostly stays on the software side I like to keep up with the use of these algorithms b/c it helps when working with the DS team.

Partially Derivative – Another great podcast on data analytics, my favorite episode was done live from Stitchfix my wife’s favorite product and mine to but for a different reason. Stichfix is a monthly subscription company that matches a customer with their own personal stylist, but behind the dressing room curtain Stichfix is really a data company. Listen in to hear about all the experimentation that take place on a daily basis at Stichtfix. Also hear about how they are using machine learning to pick out clothes you’d like.

Linear Digressions – Another short quick hit on Data analytics Machine learning on Genomics, how polls got Brexit, and Election forecasting.

Data Crunch -Podcast devoted to highlighting how data analytics is changing the world. Released 1 -2 times a month coming in under 30 minutes per episode.

Internet of Things (IoT)

Inquiring Minds Understanding Heart Disease with Big Data – Not a podcast dedicated IT or Big Data but in this episode Greg Marcus talks about analyzing the heart with IOT. Think that smartwatch is just for tracking steps and sending text messages? That smartwatch could help advance the science behind heart disease by giving your doctor access. Really great episode to hear how IOT is offering lower cost research in healthcare and provide more data than traditional studies.

Oh, and if you are looking for a quick tips on Hadoop/Data Analytics/Big Data, subscribe to my YouTube channel which is all about getting started in Big Data. Make sure to bookmark this page to check for frequent updates to the list. As Big Data gets more popular this list is sure to grow.

Top 4 Places to Find Big Data

December 9, 2016 by Thomas Henson Leave a Comment

Finding data data for testing in your own Hadoop projects doesn’t have to be hard!

There are many place to find free data sets for running in your development environments. Checkout this video to find out my Top 4 places to find Big Data. Spoiler alert you can also find small data in these places….

YouTube Video

—

Transcript

Hi and welcome back to Thomas Hanson com have you ever been working in your big data environment thought we could have one more data to test it be great if I could have more data synthetic eye test out this new open source tool or just maybe this new function that you want to run today I’m going to talk about my four favorite places to find big data number four on the list is Yahoo actually the yahoo finance section you can actually go in here and look up your favorite stock or even your favorite mutual fund and find historic information and so what I like to do is I like to come in here and get historic information that will give you daily values on the stop you can take that data and inserted into HDFS or a database for however you want there’s a lot of different options and this data actually export to csv it’s really accurate data but it is limited in the set because you’re only looking at stock values but if you need a quick fix to get some data this is where I come to first coming in at number 3 is actually some weather data from Noah this data is very accurate but one of the drawbacks to getting the data and the reason it’s only number three on the list you have to actually open an account and request hey in this geographic information i would like to compile the weather data from here and so if you’re looking for accurate data this is a very good site that i would use but if you’re looking for something quick this is not going to be something that you want to use typically you’ll receive the data in less than 24 hours but just know that it could be a lot longer and that’s why weather data is number three on the list coming in at number two and a really close favorite to number one is tableaus public website and their sample data sets and this is relatively new to me but they have a lot of different information sets and a lot of different categories so like government lifestyle health and then one of my favorites that sports the format’s come back in Excel or CSV format so it’s really easy another cool thing is you don’t have to login so you can just come in download these datasets upload them into HDFS and start playing away and so that’s why this is number two on my list tableaus public data sets and now for number one on my list of your favorite places to find data is Kaggle’s website and cable start off on the scene is just a contest side for data scientist or amateur data scientists to be able to test out and solve problems one of the famous examples was Netflix there was a contest out there to see if you could be Netflix data scientist in how to recommend better videos for people and so it’s really cool I think they gave out like a million dollars for the contest but now this website is more than just a contest site it actually has data sets and its really a one-stop-shop for data scientist so it’s one of those websites you want to come in and you want to check for me i really like the datasets ight now you do have to login to be able to access the data but you have a vast amounts of data sets and if I were stuck on an island i can only have one of these it would be the Kegel website because they’re always updating a lot of different datasets they have something small and something large and so you can see here you can go through in search and you can see the latest data that’s been updated you can search by different features and like I said it’s community-driven so there’s always new data sets available this is why it’s number one on my list and so just a recap remember for top favorite places number four was Yahoo’s finance section number three was the weather data and Noah number two and a close favorite was tableaus public website where they have the sample data sets and the number one the best place was Kaggle datasets thanks for tuning in and be sure to

Schema On Read vs. Schema On Write Explained

November 14, 2016 by Thomas Henson 17 Comments