Data Analytics Archives - Thomas Henson

Complete Guide to Splunk Add-Ons

May 24, 2017 by Thomas Henson Leave a Comment

Splunk is a popular application for analyzing machine data in the data center. What happens when Splunk Administrators want to add new data sources to their Splunk environment outside the default list?

The Administrators have two options:

First they can import the data source using the regular expression option. Only fun if you like regular expressions.
Second they can use a Splunk Ad-On or Application.

Let’s learn how Splunk Add-Ons are developed and how to install them.

How to Create Splunk Plugins

Developers have a couple of options to create Splunk Application or Add-Ons. Let’s step through the options for creating Splunk Add-Ons by going from the easiest to hardest.

The first option to create a Splunk Add-Ons is by using the dashboard editor inside the Splunk app. Using the dashboard editor you can create custom visualizations of your Splunk data. Simply click to add custom searches, tables, and fields. Next save the dashboard and test out the Splunk Application.

The second option developers have is to use XML or HTML markup inside the Splunk dashboard. Using either markup language gives developers more flexibility into the look and feel of their dashboards. Most developer with basic HTML, CSS, and XML skills will choose this option over the standard dashboard editor.

The last option inside the local Splunk environment is SplunkJS. Out of all the option for creating application in the local Splunk environment SplunkJS allows the greatest control for developing Splunk applications. Developer with intermediate JavaScript skills will find using SplunkJS fairly easy while those without JavaScript skills will have a more difficult time.

Finally for developers who want the most control and flexibility for their Splunk Ad-Ons Splunk offers Application SDK options. These applications leverage the Splunk API and allow for developer to write the application in their favorite language. By far using the SDK is the most difficult but also creates the ultimate Splunk Application.

Splunk Application SDK options:

What is Splunkbase

After developers create their applications they can then be uploaded to the Splunkbase. Splunkbase is the de facto marketplace for Splunk Add-Ons and applications. It’s a community driven market place for both licensed (paid) and non-licensed (not paid) Splunk Ad-Ons and Applications. Splunk certified applications ensure secure and mature Splunk Applications.

Think of Splunkbase as Apple’s App store. Users download applications that run on top of iOS to extend the functionality of the iPhone. Both the community and corporate developers build Apple’s iOS Apps. Just like the iOS App store, Splunkbase offers both paid and free applications and Ad-Ons as well.

How to install from Splunkbase

The local Splunk environment integrates with Splunkbase. Meaning Splunkbase install are seamless. Let’s walk through a scenario below installing the Splunk Analytics for Hadoop in my local Splunk environment.

Steps for Installing App from Splunkbase:

First log into local Splunk environment
Second click Splunk Apps
Next browse for “Splunk Analytics for Hadoop”
Click Install & enter log in information
Finally view App to begin using App

Another option is to install directly to the local Splunk environment. Simply download application directly and upload to local Splunk environment. Make sure to practice good Splunk hygiene by only downloading trusted Splunk Apps.

Closing thoughts on Splunk Apps & Add-Ons

In addition to extending Splunk, Add-Ons increase the Splunk environment’s use cases. The problem with Splunk is as user begin using they want to add new data sources. While often the new data sources are supported, times when data sources aren’t default Splunk’s community of App developers fill that gap. Splunk’s hockey adoption comes from the ability to add new data sources. New insights are constantly pushing new data sources in Splunk.

Looking to learn more about Splunk? Checkout my Pluralsight Course Analyzing Machine Data with Splunk.

Everything You’ve Wanted to Know About HDFS Federation

March 6, 2017 by Thomas Henson Leave a Comment

2017 might have just started, but I’ve already noticed a trend that I believe will be huge this year. Many of the people I talk with who are using Hadoop & friends are curious about HDFS Federation.

Here are a few of the questions I hear

How can we use HDFS Federation to extend our current Hadoop environment?

Is there anyway to offload some of the workloads from our NameNode to help speed it up?

Or my favorite……

We originally thought we were boxed in with our Hadoop architecture but now with HDFS Federation our cluster has more flexibility.

So what is HDFS Federation? First we need to level set on how the NameNode and Namespace work in HDFS.

How the NameNode Works in Hadoop

The HDFS architecture is a master/slave architecture. The NameNode is the leader with the DataNodes being the followers in HDFS. Before data is ingested or moved into HDFS it must first pass through the NameNode to be indexed. The DataNodes in HDFS are responsible for storing the data blocks, but have no clue about the other DataNode or data blocks. So if the NameNode falls off the end of the earth your in trouble because what good are the data blocks without the indexing.

HDFS not only stores the data, but provides the file system for users/clients to access the data inside HDFS. For example in my Hadoop environment I have Sales and Marketing data I want to logically separate. So I would, setup to different directories and populate sub directories in each depending on the data. Just like you have setup on your own work space environment. Pictures and Documents are in different directories or file folders. The key is that structure is stored as meta data and the NameNode in HDFS retains all that data.

HDFS Namespace

The NameNode is also responsible for the HDFS namespace in the Hadoop environment. The namespace is set at the file level, meaning all files are hierarchical and follow a tree structure. NameSpace gives the structure users need to traverse the file system. Imagine an organized toolbox with all the tools laid out in a structured way. Once the tools are used they are put back in the same place.

Back to our Windows example the “C” drive is the top level file and everything else on the computer resides under it. Try to create another “Program Files” directory and you will get an error stating that file name already exists. However, if you drop down one level into another file and create a “Program Files” because it would be C:/Program Files/Program Files.

HDFS Federation Namespace — Windows NameSpace Example

As data is accelerated into HDFS, the NameNode begins to grow out of it’s compute and storage. Just like a hermit crab moving into a new shell, so is the same for the NameNode (vicious and expensive cycle). What if we could begin using scale-out architecture without having to re-architect the entire Hadoop environment? Well this is where HDFS Federation helps big time.

Hadoop Federation to the Rescue

A little know change in HDFS 2.x was the addition of HDFS Federation. Oftentimes confused with the ability to create high availability (HA) in Hadoop clusters or secondary NameNodes. However HDFS Federation allows for Hadoop clusters to add another NameNode and namespace. This Federated NameNode is one that has access to the DataNodes and indexes data moved to those nodes, but only when the data flows through that NameNode.

For example, I have two NameNodes in my cluster NN1 and NN2. NN1 will support all data in hdfs/data/…and NN2 will handle the hdfs/users directory. So as data from users/applications comes my hdfs/data namespace NN1 will index it and move it to the DataNodes. However if an application connects to NN1 and tries to query data in the hdfs/user directory it will get an error saying no known directory. For the application to query data in that namespace requires a connection to NN2. Think of HDFS Federation as adding a new cluster, in the form of a NameNode, while still using the DataNodes for storage.

Benefits of HDFS Federation

Here are a few of the immediate benefits I see being played out with HDFS Federation in the Hadoop world.

NameNode Dockerization – The ability to set up multiple NameNodes allows for new Hadoop architectures now allows for a module Hadoop architecture. As we start to move into a Microservices world, we will see architectures that contain multiple NameNodes. Hadoop environment will have the ability to break down and spin up new NameNodes on the fly.
Logically Separate Namespaces – For charge back IT enterprise HDFS Federation gives another tool for Hadoop administrators to setup multiple environments. These environments will still have the cost saving of a single Hadoop environment.
Ease NameNode Bottlenecks – The pain of having all data index through a single NameNode can be eliminated by create multiple NameNodes.
Options for Tiering Performance – Segmenting different NameNodes and namespaces by customer requirements instead of setting up multiple complicated performance quotas is now an option. Simply provision the NameNode specs and move customer to NameNode based on the initial requirements.

One of the big reasons for HDFS Federations uptick this year is based on the growing adoption of Hadoop and the sheer amount of data being analyzed. More data more problems and particularly those problems are at scale.

Final Thoughts on HDFS Federation

HDFS Federation is helping solve problems at scale with the NameNode. Since Hadoop’s 1.x code release the NameNode has always been the soft underbelly of the architecture. The NameNode has continued to struggle with high availability, bottlenecks, and replications. The community is continually working on improving the NameNode. HDFS Federation and the movement of Virtualized/Dockerized Hadoop are moving to mitigate these issues. As the Hadoop community continues to innovate with projects like Kudu and others, look for HDFS Federation to play a bigger role.

How to Use a Splunk Universal Forwarder

January 23, 2017 by Thomas Henson Leave a Comment

Imagine you’re a Systems Administrator responsible for keeping your companies’ custom developed application up and running. It is a critical application responsible all the ordering and payments for your company and is the sole interface for your customers to buy your products.

Today that application went down for 4 hours. During that time your company lost 10 million dollars in sales. You have been called into the CIO’s office for a debriefing.

You walk in her office, she quickly asks you “How could this have been prevented”.

By using Splunk and specifically using Splunk with Universal forwarders to proactively monitor those critical applications.

How are you going to stay out in front of issues that may happen?

What about preventive fixes and DevOps?

Splunk is the answer to keeping you from having system crashes and pulling all nighters. Analyzing applications with Splunk can allow developers and administrators to test scenarios before going to production with applications. How did Nissan test their Website before their first ever Super Bowl commercial? Nissan used Splunk to thoroughly

What is Splunk?

Splunk is huge in the data center when it comes to analyzing log files and IT security. Splunk is an application that allows for machine data to be stored, indexed and visualized quickly. In the past log files were parsed and stored by writing custom scripts with regular expressions to make the files human readable. Splunk simplifies all that with setting up default parsers for many common and uncommon log files and letting users start visualizing their data with in the Splunk application.

Since Splunk setup is so easy to setup the popularity of Splunk has been going through enormous growth. Recently I attended an Big Data conference where they said 70% of companies are using Splunk in some fashion. Gartner placed Splunk in the Leader Magic Quadrant for 2016.

What is a Splunk Universal Forwarder

So how can you analyze application server log files while running the application in production?

Splunk has forwarders for sending data between different instances of Splunk. Using a forwarder allows to move log files from one machine to another without having to write custom batch scripts and clog up bandwidth. Let’s talk about how the Splunk forwarder is used in the data center.

FedEx as a Splunk Forwarder?

FedEx is amazing at moving packages. Here recently my cousin graduated from college and I wanted to send her the book The Obstacle is the Way (seriously check it out the book). Think of the book as the data and my cousin and myself as machines and FedEx as the forwarder. I was able to package up the book (data) and send it off to my cousin. The package was wrapped (encrypted) and the correct address (URL) was placed on it. The Splunk Universal Forwarder is like FedEx. It will deliver machine data to other instances of Splunk.

When installing universal forwarders, Splunk has two option to chose from depending on the use case.

What is a Splunk Light Forwarder

The first type of Splunk forwarder is the light or universal forwarder. Think of it as a lightweight or minimal installation. The light forwarder has minimal features and its main objective to move data from one machine to another. No analysis or indexing. It’s even limited in the data that will parse because it’s goal is to move data to an Splunk Indexer. Another thing missing with the light forwarder is the Web CLI so it’s strictly from the command line for this forwarder. Since the goal of the light forwarder is low impact not having a Web CLI isn’t a deal breaker. Why add features not needed if we are going to analyze the data else where.

What is a Splunk Heavy Forwarder

The second type of forwarder is called a heavy forwarder. Think of a full blown instance of Splunk. It’s similar to what we have running in local development environment. The only difference is what we choose to disable. Remember depending of the scenario we want to the option to have the lowest impact to the CPU of the machine we are hosting on. So the heavy forwarder allows for us to disable features we aren’t going use. Management of the heavy forwarder can be done through the Web CLI, which we have been using, or the command line like in the universal forwarder.

All the Splunk forwarders have build in enterprise features like encryption and compression. Encryption offers the ability to protect data in-flight and prevent unwanted reads of log files from packet capture. The compression option will vary on the amount off data that is duplicated and white space in the log file. So if you looking to calculate the compression just know it’s going to depend. data Both encryption and compression are opt in features and are not enabled by default.

Learn more about Splunk in my Analyzing Machine Data with Splunk course

Where are Universal Forwarders used?

Anywhere you don’t want to install a full blown instance of Splunk or remote offices where you want to use Splunk for data analysis but also forward the data on another instance of Splunk. Think about multiple smaller Splunk hubs that can forward data to larger Splunk instance for a system wide view.

Use Cases

Application Servers
Database Servers
Networking Infrastructure
Web Servers
Internet of Things
Continuous Integration and Testing
Detecting Insider Threats
Securing Networks

How to Install Splunk Universal Forwarder

Let’s look at how to setup a Splunk Universal Forwarder. Just like the full blown Splunk instance you have to pick the flavor of OS for the host machine. After getting the correct Splunk version you will run the default install unless you are the light version (which I recommend) it will all be down for the command line.

For example below are the steps for installing the light Forwarder for Ubuntu server:

Download Specific version of Splunk for Windows, Mac or Linux Distributions – Download directly or use wget -url to download from command line.
Install on Ubuntu Machine – Move downloaded package to Ubuntu /tmp directory. Once .tgz is in /tmp directory run dpkg -i splunk-verison-xxx.tgz. Command will kick off the installation of the Splunk Universal forwarder.
Start up Splunk Forwarder – After running the dpkg command for installation move to Splunk directory cd /opt/splunkforwarder/bin. Next start up Splunk server ./splunk start.
Set up forwarding machine on Ubuntu – Last configuration change is to ensure log files will be forwarder through port 9997. Port 9997 is default but it won’t hurt to run the following command ./splunk add forward-server hostmachineIP:9997.
Configure receiving on Splunk instance – Finally now that the install is complete on the host machine you will need to configure Splunk to receive the log files from Ubuntu server. On the Splunk instance enable receiving from UI in settings –> receiving. Ensure that Splunk is listening on the default port of 9997.

Final Thoughts on Splunk Universal Forwarder

Splunk forwarding is the secret sauce for Splunking. It allows for data to be streamed in real time to the main Splunk instance with little performance concern on host machine. Installation for Splunk Universal Forwarders is a little tricky at first but once you get one installed the next one are simple.

[Read more…]

Top 9 SPL Commands in Splunk For Splunk Ninjas

December 19, 2016 by Thomas Henson 1 Comment