Build Data Insights and Business Metrics with ELK Stack

With the world around us getting more and more connected, there is advent of different types of computing devices. It could be a heavy-duty server, laptop, desktop, mobile phone or even your refrigerator. One unique thread that connects all these devices is their logging of system information. These logs are nothing but a stream of messages in time-sequence. Systems can now log any piece of structured or unstructured data, application logs, transactions, audit logs, alarms, statistics or even tweets. Add to this the scale of logs. The earlier methodology of human analysis would not work in this kind of scenario. There has to be some automated mechanism for log analysis and deciphering useful information from them.

The trio of Logstash, Kibana and Elasticsearch is one of the most popular open source solutions for logs management. The three products together are known as the ELK stack and provide an elegant solution for log management. At the heart of ELK stack is Elasticsearch which is a distributed, open source search and analytic engine. It is based on Apache Lucene and is designed for horizontal scalability, reliability, and easy management. Logstash is a data collection, enrichment, and transportation pipeline. The ELK stack is completed by Kibana, which is a data visualization platform enabling interaction with data through stunning, powerful graphics.

In order to start your discovery of ELK stack, check out my book titled – Applied ELK Stack: Data Insights and Business Metrics with Collective Capability of ElasticSearch, Logstash and Kibana. With this book you will discover:

  • The need for log analytics, and current challenges
  • How to perform real-time data analytics on streaming data, and turn them into actionable insights
  • How to create indexing and delete data
  • The different components of ELK (Elasticsearch, Logstash, and Kibana) stack
  • Shipping, Filtering, and Parsing Events with Logstash
  • How to build amazing visualizations and dashboards using Data Discovery, Visualization, and Dashboard with Kibana

I hope this book is able to help you with log management along with providing business insights. Do let me know your valuable feedback on the book.

Mining Mailboxes with Elasticsearch and Kibana

In a previous post I had mentioned that the trio of Logstash, Kibana and Elasticsearch (ELK stack) is one of the most popular open source solutions for not only logs management but also data analysis. In this post I will demonstrate how ELK can be used to effectively and efficiently perform big data analysis. As a reference let’s take some huge mailbox data. Mail archives are arguably one of the most interesting kind of social web data. It is omnipresent and each message throws light on the communication which people are having. As a CXO of an organization you may want to analyze corporate mails for trends and patterns.

As a reference, I will take the well-known Enron corpus as it has a huge collection of mails and there is no risk of any legal or privacy concerns. This data will be standardized into Unix mailbox (mbox) format. From the mbox format it will be again transformed into a single json file.

Getting the Enron corpus data

The full Enron dataset in a raw form is available for download in various formats. I will start with the original raw form of the data set that is essentially a set of folders that organizes a collection of mailboxes by person and folder. The following snippet would illustrate the basic structure of the corpus after you have downloaded and unarchived it. Go ahead and play with it a little bit so that you become familiar with it.


C:> cd enron_mail_20110402maildir # Go into the mail directory

C:\enron_mail_20110402\maildir>dir # Show folders/files in the current directory
allen-p         crandell-s     gay-r           horton-s

lokey-t         nemec-g         rogers-b       slinger-r

tycholiz-b     arnold-j       cuilla-m       geaccone-t
<pre>               …directory listing truncated…</pre>
neal-s         rodrique-r     skilling-j     townsend-j
<pre>C:enron_mail_20110402maildir> cd allen-p/ # Go into the allen-p folder

C:enron_mail_20110402maildirallen-p> dir # Show files in the current directory</pre>
_sent_mail         contacts         discussion_threads notes_inbox

sent_items         all_documents     deleted_items     inbox
sent               straw
C:\enron_mail_20110402\maildir\allen-p> cd inbox/ # Go into the inbox for allen-p
C:\enron_mail_20110402\maildirallen-p\inbox> dir # Show the files in the inbox for allen-p

  1. 11. 13. 15. 17. 19. 20. 22. 24. 26. 28. 3. 31. 33. 35. 37. 39. 40.
  2. 44. 5. 62. 64. 66. 68. 7. 71. 73. 75. 79. 83. 85. 87. 10. 12. 14.
  3. 18. 2. 21. 23. 25. 27. 29. 30. 32. 34. 36. 38. 4. 41. 43. 45. 6.

63. 65. 67. 69. 70. 72. 74. 78. 8. 84. 86. 9.
C:\enron_mail_20110402\maildir\allen-p\inbox> cat 1. # Show contents of the file named “1.”

Message-ID: &amp;lt;16159836.1075855377439.JavaMail.evans@thyme&amp;gt;
Date: Fri, 7 Dec 2001 10:06:42 -0800 (PST)
From: heather.dunton@enron.com
To: k..allen@enron.com
Subject: RE: West Position
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Dunton, Heather &amp;lt;/O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON&amp;gt;
X-To: Allen, Phillip K. &amp;lt;/O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen&amp;gt;
X-cc:
X-bcc:
X-Folder: Phillip_Allen_Jan2002_1Allen, Phillip K.Inbox
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Please let me know if you still need Curve Shift.

Thanks,

 

Now the next step is to convert the mail data into Unix mbox format. An mbox is in fact just a large text file of concatenated mail messages that are easily accessible by text-based tools. I have used python script to convert it into mbox format. Thereafter, this mbox file would be converted into ELK compatible JSON format. The json file can be found here. A snippet of json file can be found below:


{"index":{"_index":"enron","_type":"inbox"}}


[{"X-cc": "", "From": "r-3-728402-1640008-2-359-us2-982d4478@xmr3.com", "X-Folder": "\jskillin\Inbox", "Content-Transfer-Encoding": "7bit", "X-bcc": "", "X-Origin": "SKILLING-J", "To": ["jeff.skilling@enron.com"], "parts": [{"content": "n[IMAGE]n[IMAGE]nJoin us June 26th for an on-line seminar featuring Steven J. Kafka, Senior Analyst at Forrester Research, as he discusses how technology can create more effective collaboration in today's virtualized enterprise. Also featuring Mike Hager, VP, OppenheimerFunds, offering insights into implementing these technologies through real-world experiences. Brian Anderson, CMO, Access360 will share techniques and provide tips on how to successfully deploy resources across the virtualized enterprise. nDon't miss this important event. Register now at http://www.access360.com/webinar/ . For a sneak preview, check out our one-minute animation that illustrates the challenges of provisioning access rights across the "virtualized" enterprise.nAbout Access360nAccess360 provides the software and services needed for deploying policy-based provisioning solutions. Our solutions help companies automate the process of provisioning employees, contractors and business partners with access rights to the applications they need. With Access360, companies can react instantaneously to changing business environments and relationships and operate with confidence, whether in a closed enterprise environment or across a virtual or extended enterprise.n nAccess360 nnIf you would prefer not to receive further messages from this sender:n1. Click on the Reply button.n2. Replace the Subject field with the word REMOVE.n3. Click the Send button.nYou will receive one additional e-mail message confirming your removal.nn", "contentType": "text/plain"}], "X-FileName": "jskillin.pst", "Mime-Version": "1.0", "X-From": "Access360 <R-3-728402-1640008-2-359-US2-982D4478@xmr3.com>@ENRON", "Date": {"$date": 991326029000}, "X-To": "Skilling, Jeff </o=ENRON/ou=NA/cn=Recipients/cn=JSKILLIN>", "Message-ID": "<14649554.1075840159275.JavaMail.evans@thyme>", "Content-Type": "text/plain; charset=us-ascii", "Subject": "Forrester Research on Best Practices for the "Virtualized" Enterprise"}

 

When you have huge amount of data to be pushed into Elasticsearch then it is better to do bulk import by specifying the data file. Each mail message is in a line of its own associated with an entry specifying the index (enron) and document (inbox). There is no need to specify the id as Elasticsearch would automatically specify the id.

Data in Elasticsearch can be broadly divided into two types – exact values and full text. Exact values are exactly what they sound like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. For e.g., the exact value Foo is not the same as the exact value foo. The exact value 2014 is not the same as the exact value 2014-09-15. On the other hand Full text refers to textual data – usually written in some human language – like the text of a tweet or the body of an email. For the purpose of this exercise, it is better to treat Email addresses (To, CC, BCC) as exact values. Hence, we first need to specify the mapping, which can be done in the following manner.

curl -XPUT “localhost:9200/enron” -d "{
"settings":
{
    "number_of_shards": 5,
    "number_of_replicas": 1
},
"mappings":
{
    "inbox":
    {
        "_all":
        {
            "enabled": false
        },
        "properties":
        {
            "To":
            {
                "type": "string",
                "index": "not_analyzed"
            },
            "From":
            {
                "type": "string",
                "index": "not_analyzed"
            },
            "CC":
            {
                "type": "string",
                "index": "not_analyzed"
            },
            "BCC":
            {
                "type": "string",
                "index": "not_analyzed"
            }
        }
    }
}
}"

 

You can verify that the mapping has indeed been set.

curl -XGET "http://localhost:9200/_mapping?pretty"
{
    "enron" :
    {
        "mappings" :
        {
            "inbox" :
            {
                "_all" :
                {
                    "enabled" : false
                },
                "properties" :
                {
                    "BCC" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    },
                    "CC" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    },
                    "From" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    },
                    "To" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    }
                }
            }
        }
    }
}

 

Now let’s load all the mailbox data by using the json file, in the following manner:


curl -XPOST "http://localhost:9200/_bulk" --data-binary @enron.json

 

We can check if all the data has been uploaded successfully.

curl "localhost:9200/enron/inbox/_count?pretty"
{
    "count" : 41299,
    "_shards" :
    {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
    }
}

 

You can see that 41299 records each corresponding to a different message, have been uploaded. Now lets start the fun part by doing some analysis on this data. Kibana provides awesome analytic capability and associated charts. Lets try to see how many messages are circulated on a weekly basis.

enron-date

The above histogram shows the message spread on a weekly basis. The date value is in terms of milliseconds past the epoch. You can see that one particular week has a peak of 3546 messages. Something interesting must be happening that week. Now lets see who are the top recipients of messages

enron-to

You can see that Gerald, Sara, Kenneth are some of the top recipients of messages. How about checking out the top senders?

enron-from

You can see that Pete, Jae and Ken are the top senders of messages. In case you are wondering what exactly Enron employees used to discuss, let’s check out top keywords from message subjects.

enorn-subject

It seems most interesting discussions centered on enron, gas, energy, power. There can be a lot more interesting analysis done with the Enron mail data. I would recommend you try the following:

  • Counting sent/received messages for particular email addresses
  • What was the maximum number of recipients on a message?
  • Which two people exchanged the most messages amongst one another?
  • How many messages were person-to-person messages?

 

 

Log Management in the Cloud Age

In traditional systems, logs are lines of text intended for offline human consumption. With the advent of Cloud and Big Data, there is a paradigm shift in what can be logged. Systems can now log any piece of structured or unstructured data, application logs, transactions, audit logs, alarms, statistics or even tweets. Add to this the scale of logs. The earlier methodology of human analysis would not work in this kind of scenario. There has to be some automated mechanism for log analysis and deciphering useful information from them.

The trio of Logstash, Kibana and Elasticsearch is one of the most popular open source solutions for logs management. The three products together are known as the ELK stack and provide an elegant solution for log management.

Elasticsearch is a distributed, flexible and powerful, RESTful, search and analytics engine based on Apache Lucene Index. It gives the ability to move beyond simple-full text search. It categorizes data using indices which can be easily divided into shards (equivalent to partitions in RDBMS) and each shard can have zero or more replicas. This helps in providing near real-time search. Elasticsearch provides robust set of APIs and query DSL in addition to clients for most of the popular programming languages.

Elasticsearch was built from the ground up to handle any kind of data and. It can slice and aggregate data on the fly, based on any field in the logs. This creates valuable insights from raw logs.

Kibana is a data visualization engine used along with Elasticsearch. It helps in natively interacting with all data in Elasticsearch via custom dashboards. You can make dynamic, shareable and exportable dashboards. Data analyses becomes a breeze with Kibana’s elegant user interface using pre – designed or custom dashboards in real-time for on-the-fly data analysis. Kibana is easy to setup and can integrate seamlessly with different log aggregators like Logstash, Apache Flume, etc. See below for a sample Kibana dashboard:

kibana

Logstash is one of the most popular open source logs and events shipper/processor. It takes as input logs, processes and other time based events from any stem and stores data in a single place for additional processing. It scrubs logs and parses all data sources into an easy to read JSON format. This means that your logging data can now be analyzed in real time. You can then use Kibana to explore and monitor the analytics. The logstash – elasticsearch – kibana is illustrated below:

file-logstash-es-kibana

The ELK stack is very powerful tool for monitoring and analytics of cloud scale logs.