The Brain of an IoT System: Analytics Engines and Databases
This is the fourth article in our series on building an IoT platform using open source components. In the earlier articles in this series, we looked at basic components of an IoT system and discussed a messaging based architecture which can serve as the nervous system to exchange information between the various IoT components. Now, let us look at what could be considered as the brain of an IoT system – the analytics engine which decides what to make of the information and the database(s) storing the information.
There are multiple different ways of building this ‘nervous system and brains’, and which specific approach we take depends on the use case. It usually involves trade-offs between speed, size, cost, features, support, etc. In this article, we develop a configuration which is suitable for both real-time as well as non-real time scenarios.
Choice of Analytics Engines and Databases
We have looked at the messaging architecture in the previous article. Now let us expand the scope further to add the details for analytics engine, storage, and connectivity beyond the MQTT message broker.
IoT Platform: MQTT and Apache Storm
The data received from the MQTT broker may need further processing, like adding a time-stamp (if not already present), identifying missing readings, filtering, etc. Tools like Apache Pig can be used for this. Another good alternative is Apache Storm, which provides preprocessing abilities in limited scale but also provides a very good analytics system. A single instance of Apache Storm can handle both of these tasks. For a large system, deploying separate instances can be considered. Apache Storm also has support for MQTT (by means of an MQTT Spout), which makes integration easy.
As mentioned elsewhere, the data in an IoT system should be stored at various points for historical archival purposes as well as for avoiding unnecessary repeat computations. Instead of writing data directly to a database through an MQTT client, an alternative way is to write it through Storm. Storm provides a mechanism called Bolt which can be used to interface with various databases. For storing incoming raw data and the preprocessed data, Cassandra or CouchDB (both from Apache) are good alternatives. The same databases could be used to store the results of the analytics and the reports. It is a good idea to provide an ability to write to HDFS via a bolt so that Hadoop can be used later on for off-line batch processing of really large data sets. This combination of the databases increases the complexity but gives more flexibility in real life use cases.
The bolts in Apache Storm are the processing units and can be chained together to perform stepwise, complex calculations. Bolts can be stateless (to monitor a single event) or can maintain state for calculation rolling metrics using sliding windows, event correlations, etc. Apache Storm provides real-time analytics capabilities and hence is very suitable for IoT systems. It has a distributed architecture and manages distribution of messages (data) itself without requiring any external components.
Another powerful alternative is Apache Spark, which primarily supports batch processing mode and provides only near-real time analytics capabilities. A comparison of these two is out of the scope here.
The visualization tools in the above diagram could be external custom tools or open-source tools like JasperReports. Similarly, the notification and alerting mechanisms could be third-party email clients and SMS service.
The architecture proposed above is suitable for smaller systems, but it does not scale very well for large system. In addition, MQTT does not provide any buffering mechanism. Both of these features are necessary when a large amount of data is coming in from a multitude of different sources. An intermediate messaging system like RabbitMQ or Apache Kafka can be used. Using such an intermediate broker between the MQTT broker and the analytics system helps improve the overall system performance as well as provides easy scalability.
IoT Platform: MQTT, Apache Kafka, and Storm
Apache Kafka already has MQTT support which makes integration effortless. Kafka sends the data received from an MQTT broker to different Kafka consumers. For example, one Kafka consumer could be used to send raw data to a database and another Kafka consumer could be used to send data to Storm for analytics.
In this series of articles on building an IoT platform using open source components we looked at what are the common logical building blocks in an end-to-end pipeline, how they communicate with each other and how they work together to provide a much bigger value. We also explored one of the many possible ways to create such a platform with specific open-source components with standards and modularity in mind. This modular architecture provides interoperability, scalability, performance, and fast time to market because it can be used as a fundamental building block for a variety of IoT use cases. The beauty of this open-source approach is that the integration of the individual components yields something much bigger than the sum of all components. After all, that is what IoT is all about!
Image Credits: zliving.com
Dr. Siddhartha Chatterjee is Chief Technology Officer at Persistent Systems. Umesh Puranik is a Principal Architect at Persistent Systems. We thank Sachin Kurlekar for his insightful comments and Ken Montgomery for his editorial assistance.