Mining Mailboxes with Elasticsearch and Kibana

In a previous post I had mentioned that the trio of Logstash, Kibana and Elasticsearch (ELK stack) is one of the most popular open source solutions for not only logs management but also data analysis. In this post I will demonstrate how ELK can be used to effectively and efficiently perform big data analysis. As a reference let’s take some huge mailbox data. Mail archives are arguably one of the most interesting kind of social web data. It is omnipresent and each message throws light on the communication which people are having. As a CXO of an organization you may want to analyze corporate mails for trends and patterns.

As a reference, I will take the well-known Enron corpus as it has a huge collection of mails and there is no risk of any legal or privacy concerns. This data will be standardized into Unix mailbox (mbox) format. From the mbox format it will be again transformed into a single json file.

Getting the Enron corpus data

The full Enron dataset in a raw form is available for download in various formats. I will start with the original raw form of the data set that is essentially a set of folders that organizes a collection of mailboxes by person and folder. The following snippet would illustrate the basic structure of the corpus after you have downloaded and unarchived it. Go ahead and play with it a little bit so that you become familiar with it.


C:> cd enron_mail_20110402maildir # Go into the mail directory

C:\enron_mail_20110402\maildir>dir # Show folders/files in the current directory
allen-p         crandell-s     gay-r           horton-s

lokey-t         nemec-g         rogers-b       slinger-r

tycholiz-b     arnold-j       cuilla-m       geaccone-t
<pre>               …directory listing truncated…</pre>
neal-s         rodrique-r     skilling-j     townsend-j
<pre>C:enron_mail_20110402maildir> cd allen-p/ # Go into the allen-p folder

C:enron_mail_20110402maildirallen-p> dir # Show files in the current directory</pre>
_sent_mail         contacts         discussion_threads notes_inbox

sent_items         all_documents     deleted_items     inbox
sent               straw
C:\enron_mail_20110402\maildir\allen-p> cd inbox/ # Go into the inbox for allen-p
C:\enron_mail_20110402\maildirallen-p\inbox> dir # Show the files in the inbox for allen-p

  1. 11. 13. 15. 17. 19. 20. 22. 24. 26. 28. 3. 31. 33. 35. 37. 39. 40.
  2. 44. 5. 62. 64. 66. 68. 7. 71. 73. 75. 79. 83. 85. 87. 10. 12. 14.
  3. 18. 2. 21. 23. 25. 27. 29. 30. 32. 34. 36. 38. 4. 41. 43. 45. 6.

63. 65. 67. 69. 70. 72. 74. 78. 8. 84. 86. 9.
C:\enron_mail_20110402\maildir\allen-p\inbox> cat 1. # Show contents of the file named “1.”

Message-ID: &amp;lt;16159836.1075855377439.JavaMail.evans@thyme&amp;gt;
Date: Fri, 7 Dec 2001 10:06:42 -0800 (PST)
From: heather.dunton@enron.com
To: k..allen@enron.com
Subject: RE: West Position
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Dunton, Heather &amp;lt;/O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON&amp;gt;
X-To: Allen, Phillip K. &amp;lt;/O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen&amp;gt;
X-cc:
X-bcc:
X-Folder: Phillip_Allen_Jan2002_1Allen, Phillip K.Inbox
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Please let me know if you still need Curve Shift.

Thanks,

 

Now the next step is to convert the mail data into Unix mbox format. An mbox is in fact just a large text file of concatenated mail messages that are easily accessible by text-based tools. I have used python script to convert it into mbox format. Thereafter, this mbox file would be converted into ELK compatible JSON format. The json file can be found here. A snippet of json file can be found below:


{"index":{"_index":"enron","_type":"inbox"}}


[{"X-cc": "", "From": "r-3-728402-1640008-2-359-us2-982d4478@xmr3.com", "X-Folder": "\jskillin\Inbox", "Content-Transfer-Encoding": "7bit", "X-bcc": "", "X-Origin": "SKILLING-J", "To": ["jeff.skilling@enron.com"], "parts": [{"content": "n[IMAGE]n[IMAGE]nJoin us June 26th for an on-line seminar featuring Steven J. Kafka, Senior Analyst at Forrester Research, as he discusses how technology can create more effective collaboration in today's virtualized enterprise. Also featuring Mike Hager, VP, OppenheimerFunds, offering insights into implementing these technologies through real-world experiences. Brian Anderson, CMO, Access360 will share techniques and provide tips on how to successfully deploy resources across the virtualized enterprise. nDon't miss this important event. Register now at http://www.access360.com/webinar/ . For a sneak preview, check out our one-minute animation that illustrates the challenges of provisioning access rights across the "virtualized" enterprise.nAbout Access360nAccess360 provides the software and services needed for deploying policy-based provisioning solutions. Our solutions help companies automate the process of provisioning employees, contractors and business partners with access rights to the applications they need. With Access360, companies can react instantaneously to changing business environments and relationships and operate with confidence, whether in a closed enterprise environment or across a virtual or extended enterprise.n nAccess360 nnIf you would prefer not to receive further messages from this sender:n1. Click on the Reply button.n2. Replace the Subject field with the word REMOVE.n3. Click the Send button.nYou will receive one additional e-mail message confirming your removal.nn", "contentType": "text/plain"}], "X-FileName": "jskillin.pst", "Mime-Version": "1.0", "X-From": "Access360 <R-3-728402-1640008-2-359-US2-982D4478@xmr3.com>@ENRON", "Date": {"$date": 991326029000}, "X-To": "Skilling, Jeff </o=ENRON/ou=NA/cn=Recipients/cn=JSKILLIN>", "Message-ID": "<14649554.1075840159275.JavaMail.evans@thyme>", "Content-Type": "text/plain; charset=us-ascii", "Subject": "Forrester Research on Best Practices for the "Virtualized" Enterprise"}

 

When you have huge amount of data to be pushed into Elasticsearch then it is better to do bulk import by specifying the data file. Each mail message is in a line of its own associated with an entry specifying the index (enron) and document (inbox). There is no need to specify the id as Elasticsearch would automatically specify the id.

Data in Elasticsearch can be broadly divided into two types – exact values and full text. Exact values are exactly what they sound like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. For e.g., the exact value Foo is not the same as the exact value foo. The exact value 2014 is not the same as the exact value 2014-09-15. On the other hand Full text refers to textual data – usually written in some human language – like the text of a tweet or the body of an email. For the purpose of this exercise, it is better to treat Email addresses (To, CC, BCC) as exact values. Hence, we first need to specify the mapping, which can be done in the following manner.

curl -XPUT “localhost:9200/enron” -d "{
"settings":
{
    "number_of_shards": 5,
    "number_of_replicas": 1
},
"mappings":
{
    "inbox":
    {
        "_all":
        {
            "enabled": false
        },
        "properties":
        {
            "To":
            {
                "type": "string",
                "index": "not_analyzed"
            },
            "From":
            {
                "type": "string",
                "index": "not_analyzed"
            },
            "CC":
            {
                "type": "string",
                "index": "not_analyzed"
            },
            "BCC":
            {
                "type": "string",
                "index": "not_analyzed"
            }
        }
    }
}
}"

 

You can verify that the mapping has indeed been set.

curl -XGET "http://localhost:9200/_mapping?pretty"
{
    "enron" :
    {
        "mappings" :
        {
            "inbox" :
            {
                "_all" :
                {
                    "enabled" : false
                },
                "properties" :
                {
                    "BCC" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    },
                    "CC" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    },
                    "From" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    },
                    "To" :
                    {
                        "type" : "string",
                        "index" : "not_analyzed"
                    }
                }
            }
        }
    }
}

 

Now let’s load all the mailbox data by using the json file, in the following manner:


curl -XPOST "http://localhost:9200/_bulk" --data-binary @enron.json

 

We can check if all the data has been uploaded successfully.

curl "localhost:9200/enron/inbox/_count?pretty"
{
    "count" : 41299,
    "_shards" :
    {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
    }
}

 

You can see that 41299 records each corresponding to a different message, have been uploaded. Now lets start the fun part by doing some analysis on this data. Kibana provides awesome analytic capability and associated charts. Lets try to see how many messages are circulated on a weekly basis.

enron-date

The above histogram shows the message spread on a weekly basis. The date value is in terms of milliseconds past the epoch. You can see that one particular week has a peak of 3546 messages. Something interesting must be happening that week. Now lets see who are the top recipients of messages

enron-to

You can see that Gerald, Sara, Kenneth are some of the top recipients of messages. How about checking out the top senders?

enron-from

You can see that Pete, Jae and Ken are the top senders of messages. In case you are wondering what exactly Enron employees used to discuss, let’s check out top keywords from message subjects.

enorn-subject

It seems most interesting discussions centered on enron, gas, energy, power. There can be a lot more interesting analysis done with the Enron mail data. I would recommend you try the following:

  • Counting sent/received messages for particular email addresses
  • What was the maximum number of recipients on a message?
  • Which two people exchanged the most messages amongst one another?
  • How many messages were person-to-person messages?

 

 

Microservices Orchestration

In my earlier post, I had introduced the concept of microservices. As a quick recap, microservices model promises ease of development and maintenance, flexibility for developers/teams to work on different things, building block for a scalable system and a true polygot model of development. However, this model is not without challenges and the biggest being addressing the complexity of a distributed system. Since now we have to deal with multiple services spread across multiple hosts, it becomes difficult to keep track of different hosts and services. In order to scale further, the number of instances of services would further increase and this would in turn lead to increase in the number of hosts.

Container technology can help in reducing the complexity of microservice orchestration. The idea of operating system container is not new and has been present in one form or another for decades. Operating systems in general provide access to various resources in some standard way and share it amongst the running processes. Container technology takes this idea further by not only sharing but also segregating access to networks, filesystems, cpu and memory. Each container can have its own restricted amount of CPU, memory and network independent of the other containers.

Lets see how container technology helps with microservice deployment.

  • This isolation between containers running on the same host makes deploying microservices developed using different languages and tools very easy.
  • Portability of containers makes deployment of microservices seamless. Whenever a new version of a service is to be rolled, the running container can be stopped and a new container can be started using the latest version of service code. All the other containers will keep on running unaffected.
  • Efficient resource utilization as containers comprise of just the application and its dependencies.
  • Strictly controlled resource requirements without using virtualization.

There are two popular container technologies.

Docker – Google was amongst the biggest contributor for underlying container technology in Linux and it evolved into LXC. Docker was initially developed as a LXC client and allows the easy creation of containers. The container images are essentially filesystems with a configuration file that gives a startup profile for the process that runs inside the container. The biggest USP of Docker is its ease of use wherein a single command can download an existing image and boot it into a running container. This has helped in its popularity and widespread use. Latest versions of Docker have no dependency on LXC and through their libcontainer project have a full stack of container technology all the way to the kernel. Docker is all set to shake up the virtualization market by consuming it from the bottom.

There are some opensource additions on Docker – Kubernetes can deploy and manage multiple Docker containers of the same type and TOSCA can help with more complex orchestration involving multiple tiers.

Amazon ECS – The Amazon EC2 Container Service (Amazon ECS) defines a pool of compute resources called a “cluster.” Each cluster consists of one or more Amazon EC2 instances. Amazon ECS manages the state of all container-based applications running in cluster, provides telemetry and logging, and manages capacity utilization of the cluster, allowing for efficient scheduling of work. A group of containers can be defined using a construct called “task definition”. Each container in the task definition specifies the resources required by that container, and Amazon ECS will schedule that task for execution based on the available resources in the cluster. Amazon ECS simplifies “service discovery” by integrating with 3rd party tools like Zookeeper.

I would say use of containers to deploy microservices is an evolutionary step driven mostly by the need to make better use of compute resources and to maintain increasingly complex applications. The use of microservices architectural philosophy with containers addresses these needs. Containers offer the right abstraction to encapsulate Microservices.

Microservices in a Nutshell

At the onset, “Microservices” sounds like yet another term on the long list of software architecture styles. Many experts consider it as lightweight or fine-grained SOA. Though it is not entirely a new idea, Microservices seem to have peaked in popularity in recent years, what with numerous blogs, articles and conferences evangelizing the benefits of building software systems in this style. The coming together of trends like Cloud, DevOps and Continuous Delivery has contributed to this popularity. Netflix, in particular, has been the poster boy of Microservices architectural model.

Nevertheless, Microservices deserves its place as a separate architectural style which while leveraging from earlier styles has a certain novelty. The fundamental idea behind Microservices is to develop systems as set of fine grained, independent and collaborating services which run in their own process space and communicate with lightweight mechanisms like HTTP. The stress is on services to be small and they can evolve over time.

In order to better understand the concept of microservices, let’s start by comparing them with traditional monolithic systems. Let’s consider an E-commerce store which sells products online. Customers can browse through the catalog for the product of their choice and give orders. The store verifies the inventory, takes the payment and then ships the order. The figure below demonstrates a sample E-commerce store.

monolith

The online store consists of many components like the UI which helps customers with the interaction. Besides, the UI there are other components for organizing the product catalog, order processing, taking care of any deals or discounts and managing the customer’s account. The business domain shares resources like Product, Order and Deals which are persisted in a Relational Database. The customers may access the store using a desktop browser or a mobile device.

Despite being a modularly designed system, the application is still monolithic. In the Java world the application would be deployed as a WAR file on a Web Server like Tomcat. The monolithic architecture has its own benefits:

  • Simple to develop as most of the IDEs and development tools are tailored for developing monolithic applications.
  • Ease of testing as only one application has to be launched.
  • Deployment friendly as in most cases a single file or directory has to be deployed on a server.

Till the monolithic application is simple everything seems normal from the outside. However, when the application starts getting complex in terms of scale or functionality, the handicaps become apparent. Some of the undesired shortcomings of a monolithic application are as following:

  • A large monolithic system becomes unmanageable when it comes to bug fixes or new features.
  • Even a single line of code change leads to a significant cycle time for deploying on production environment. The entire application has to be rebuilt and redeployed even if one module gets changed.
  • The entry barrier is high for trial and adoption of new technologies. It is difficult for different modules to use different technologies.
  • It is not possible to scale different modules differently. This can be a huge headache as monolithic architectures don’t scale to support large, long lived applications.

This is where the book, “The Art of Scalability” comes as a big help. It describes other architectural styles that do scale. It describes scalability model on the basis of a three dimensional scale cube.

scalability-axisThe X-axis scaling is the easiest way of scaling an application. It requires running multiple identical copies of the application behind a load balancer. In many cases it can provide good improvement in capacity and availability of an application without any refactoring.

In case of Z-axis scaling each server runs an identical copy of application. It sounds similar to X-axis scaling but the crucial difference is that each server is responsible for only a subset of the data. Requests are routed to appropriate server based on some intelligent mechanism. A common way is to route on the basis of the primary key of the entity being accessed, i.e. sharding. This technique for scaling is similar to X-axis scaling as it improves the application’s capacity and availability. However, none of these approaches does a good job of easing the development and application complexity. This is where the Y-axis scaling helps.

Y-axis scaling or functional decomposition is the 3rd dimension to scaling. We have seen that Z-axis splits things that are similar. On the other hand, Y-axis scaling splits things that are different. It divides a monolithic application into a set of services. Each service corresponds to a related functionality like catalog, order management, etc. How to split a system into services? It is not as easy as it appears to be. One way could be to split by verb or use case. For e.g., UI for compare products use case could be a separate service.

Another approach could be to split on the basis of nouns or system resources. This service takes ownership of all operations of an entity/resource. For e.g., there could be a separate catalog service which manages the catalog of products.

This leads to another interesting question. How big should a service really be? One school of thought recommends each service should have only a small set of responsibilities. As per Uncle Martin, services should be designed using the Single Responsibility Principle (SRP). The SRP suggests that a class should only have one reason to change. It makes great sense to apply the SRP to service design as well.

Another way of addressing this is to design services just like Unix utilities. Unix provides a large number of utilities such as grep, cat, ls and find. Each utility does exactly one thing, often exceptionally well, and can be combined with other utilities using a shell script to perform complex tasks. While designing services, try to model them on Unix utilities so that they perform a single function.

Some people argue that the lines of code (10 – 100 LOC) could be a benchmark for service size. I am not sure if it is a good talisman, what with many languages being more verbose than others. For e.g., Java is quite verbose as compared to Scala. An interesting take is that a microservice size should be such that an entire microservice can be developed from scratch by a scrum team in one sprint.

In all these interesting and diverse viewpoints, it should be remembered that the goal is to create a scalable system without any problems associated with monolithic architecture. It is perfectly fine to have some services which are very tiny while others are substantially larger.

On applying the Y-axis decomposition to the example monolithic application, we would get the following microservices based system:

servicesAs you can see after splitting the monolithic architecture, there are different frontend services addressing a specific UI need and interacting with multiple backend services. One such example is of the Catalog UI service which communicates with both the Catalog service and Product service. The backend services encompass the same logical modules which we had seen earlier in the monolithic system.

Benefits of Microservices

The microservices based architecture has numerous benefits and some of them can be found below:

  • The small size of microservice leads to ease of development and maintenance.
  • Multiple developers and teams can deliver relatively independently of each other.
  • Each microservice can be deployed independently of other services. If there is a change in a particular service then only that service need be deployed. Other services are unaffected. This in turn leads to easy transition to continuous deployment.
  • This architectural pattern leads to truly scalable systems. Each service can be scaled independently of other services.
  • The microservice architecture helps in troubleshooting. For example, a memory leak in one service only affects that service. Other services will continue to run normally.
  • Microservice architecture leads to a true polyglot model of development. Different services can deploy different technology stacks as per their needs. This gives a lot of flexibility to experiment with various tools and technologies. The small nature of these services makes the task of rewriting the service possible without much effort.

Drawbacks of Microservices

No technology is a silver bullet and microservice architecture is no exception. Let’s look at some challenges with this model.

  • There is added complexity of dealing with a distributed system. Each service would be a separate process and inter-process communication mechanism is required for message exchange amongst services. There are lots of remote procedure calls, REST APIs or messaging to glue components together across different processes and servers. We have to consider various challenges like network latency, fault tolerance, message serialization, unreliable networks, asynchronicity, versioning, varying loads within our application tiers etc.
  • Microservices are more likely to communicate with each other in an asynchronous manner. This helps in decompose work into genuinely separate independent tasks which can happen out of order at different times. Things start getting complex with the need for managing correlation IDs and distributed transactions to tie various actions together.
  • Different test strategies are required as asynchronous communication and dynamic message loads make it much harder to test systems. Services have to be tested both in isolation and their interactions also have to be tested.
  • Since there are many moving parts, the complexity is shifted towards orchestration. To manage this a PaaS-like technology such as Pivotal Cloud Foundry is required.
  • Careful planning required for features spanning multiple services.
  • Substantial DevOps skills are required as different services may use different tools. The developers need to be involved in operations also.

Summary

I am sure with all the aspects mentioned above you must be wondering whether you should start splitting your legacy monolithic architectures into microservices or in fact start any fresh system with this architectural style. The answer as in most cases is – “It Depends!” There are some organizations like Netflix, Amazon, The Guardian, etc. who have pioneered this architectural style and demonstrated that it can give tremendous benefits.

Success with microservices depends a lot on how well you componentize your system. If the components do not compose cleanly, then all you are doing is shifting complexity from inside a component to the connections between components. Not only that, it also moves complexity to a place that’s less explicit and harder to control.

The success of using microservices depends on the complexity of the system you’re contemplating. This approach introduces its own set of complexities like continuous deployment, monitoring, dealing with failure, eventual consistency, and other factors that a distributed system introduces. These complexities can be handled but its extra effort. So, I would say if you can manage to keep your system simple, then you may not need microservices. If your system runs the risk of becoming a complex one then you can certainly think about microservices.

Log Management in the Cloud Age

In traditional systems, logs are lines of text intended for offline human consumption. With the advent of Cloud and Big Data, there is a paradigm shift in what can be logged. Systems can now log any piece of structured or unstructured data, application logs, transactions, audit logs, alarms, statistics or even tweets. Add to this the scale of logs. The earlier methodology of human analysis would not work in this kind of scenario. There has to be some automated mechanism for log analysis and deciphering useful information from them.

The trio of Logstash, Kibana and Elasticsearch is one of the most popular open source solutions for logs management. The three products together are known as the ELK stack and provide an elegant solution for log management.

Elasticsearch is a distributed, flexible and powerful, RESTful, search and analytics engine based on Apache Lucene Index. It gives the ability to move beyond simple-full text search. It categorizes data using indices which can be easily divided into shards (equivalent to partitions in RDBMS) and each shard can have zero or more replicas. This helps in providing near real-time search. Elasticsearch provides robust set of APIs and query DSL in addition to clients for most of the popular programming languages.

Elasticsearch was built from the ground up to handle any kind of data and. It can slice and aggregate data on the fly, based on any field in the logs. This creates valuable insights from raw logs.

Kibana is a data visualization engine used along with Elasticsearch. It helps in natively interacting with all data in Elasticsearch via custom dashboards. You can make dynamic, shareable and exportable dashboards. Data analyses becomes a breeze with Kibana’s elegant user interface using pre – designed or custom dashboards in real-time for on-the-fly data analysis. Kibana is easy to setup and can integrate seamlessly with different log aggregators like Logstash, Apache Flume, etc. See below for a sample Kibana dashboard:

kibana

Logstash is one of the most popular open source logs and events shipper/processor. It takes as input logs, processes and other time based events from any stem and stores data in a single place for additional processing. It scrubs logs and parses all data sources into an easy to read JSON format. This means that your logging data can now be analyzed in real time. You can then use Kibana to explore and monitor the analytics. The logstash – elasticsearch – kibana is illustrated below:

file-logstash-es-kibana

The ELK stack is very powerful tool for monitoring and analytics of cloud scale logs.