Realtime Temperature Monitoring System using Raspberry Pi

Realtime temperature sensing is one of the common requirement. There are a lot of digital thermometers and temperature monitoring devices available in online shopping sites. But most of them just monitors and displays the realtime values. It does not have any intelligence.  The one we are going to build is a smart temperature monitoring system. This system can be used for monitoring atmospheric temperature as well as liquid temperature.

The following blog post explains the set up of a digital temperature monitoring system.

Digital Temperature Monitoring System

We will enhance the above system by adding analytical capability. So that we can analyse and show the temperature trends. The block diagram below shows the high level architecture of the system.

temperature_monitoring

As shown in the above diagram, the system has three blocks.

  • Edge Device & Sensor (Raspberry Pi & Sensor)
  • Data Storage Server
  • Dashboards for the end user

The following are the software components required for this project

  • MQTT for sending the data from the edge to the server.
  • PostgreSQL for storing the data in the server.
  • Python based backend
  • HTML Web UI

I am not going to explain the working of MQTT in this blog post. This was already explained in one of my earlier posts.

Before we start implementing the solution, lets summarize the story line.

  • The requirement is to perform realtime temperature monitoring and analyse the trends & patterns using the historic data.
  • A temperature sensor is attached to a Raspberry Pi which acts as the edge device.
  • Need provision to support multiple edge devices.
  • Capability to monitor the temperature from anywhere

Bird’s eye view of the system

Here we have considered multiple edge devices and also considered the provision of web and mobile application.

temperature_monitoring_full

Data Model Design

In the PostgreSQL database, we need two base tables for storing the data. We will be able to store data from multiple edge devices located at different locations using this data model. This is a very basic data model. We can enhance this based on our requirement.

  • device_info – This has the metadata of the edge devices. This includes the location details of the device. The column names are given below
    • device_id, device_name, location
  • temperature_data – We store the temperature data from each of the edge devices in this tables. The column names are given below.
    • device_id, timestamp, value

Now let us start developing the application from the edge device. We will modify the program to send the messages to an MQTT topic with the timestamp. The temperature readings will be sent to the server once in every minute. The message format will be as follows. We will be using epoch timestamp in seconds and temperature in Degree Celsius.

{"device_id":"xxx", "timestamp":1584284353, "value": 27.01}

Now lets develop a small python program that send this values to the MQTT topic. For this, we need an MQTT broker to be up and accessible from the Raspberry Pi

Here my central server is a CentOS 7 server and I will be using mosquitto MQTT. The installation steps are explained very detailed in this blog post.

In the central server, these messages will be collected and stored in the database tables.

A sample view of the temperature_data table is shown below.

device_id timestamp value
device_01 1587825234 27
device_02 1587825234 28
device_03 1587825234 23
device_04 1587825234 28
device_05 1587825234 30
device_06 1587825234 26
device_07 1587825234 22
device_08 1587825234 28
device_09 1587825234 32
device_10 1587825234 29
device_11 1587825234 31

Now from this table, we can query and get the required information based on the user requirement. We can either develop custom visualization using javascript or we can query the DB using workbenches or we can even connect & visualize data using visualization tools like Apache Superset, PowerBI etc.

timeseries_chart01

With this I have explained the highlevel architecture and implementation of a sample IoT system. This system can be scaled further by using a proper time series database instead of the Postgres DB.

How to set Kafka Heap Size?

Setting Kafka Heap size is simple, by default Kafka runs with 512MB as the heap size. For increasing the heap size, set the following environment variable and restart Kafka.

export KAFKA_HEAP_OPTS="-Xmx2G -Xms2G"

Kafka will check for KAFKA_HEAP_OPTS before it starts and if there is no value set for this variable, it assigns 512MB as the value, else it will pick up the configured value.

Delta Lake – The New Generation Data Lake

‘Delta Lake is the need of the present era. From the past several years, we have been hearing about DataLakes. I myself worked on several Data Lake implementations also. But the previous generation Data Lake was not a complete solution. One of the main fall backs in the old gen data lake is the difficulty in handling ACID transactions.

Delta Lake brings ACID transactions in the storage layer and thus makes the system more robust and efficient. One of my recent projects was to build a Data Lake for one of the India’s Largest Ride Sharing company. Their requirements include handling CDC (Change Data Capture) in the lake.  Their customers make several rides per day and there will be lot of transactions and changes happening in various entities associated with the platform such as debiting money from wallet, crediting money to wallet, creating ride, deleting ride, updating ride, updating user profile etc.

The initial version of the Lake that I designed was capable of recording only the latest values of each of these entities. But that was not a proper solution as it will not bring the complete analytics capability. So after that I came up with a design using Delta that has the capability to handle the CDC. In this way we will be able to track all the changes happening to the data and also instead of updating the records, we will be keeping the historic data also in the system. Delta-Lake-Architecture

Image Credits: Delta Lake 

The Delta format is the main magic behind the Delta Lake. The Delta format is open sourced by DataBricks and it is available with Apache Spark.

Some of the key features of Delta Lake are listed below.

  1. Support to ACID transactions. It is very tedious to bring data integrity in the conventional Data Lake. The transaction handling capability was missing in the old generation Data Lakes. With the support to transactions, the Delta Lake becomes more efficient and reduces the workload of Data Engineers.
  2. Data Versioning: Delta Lake supports time travel. This helps us for rollback, audit control, version control etc. In this way, the old records are not getting deleted instead it is getting versioned.
  3. Support for Merge, Update and Delete operations.
  4. No major change is required in the existing system to implement Delta Lake. Delta Lake is 100% open source and it is 100% compatible with the Spark APIs. The Delta Lake uses Apache Parquet format to store the data. The following snippet shows show to save data in Delta format. It is very simple, just use “delta” instead of “parquet”
dataframe
   .write
   .format("parquet")
   .save("/dataset")
dataframe
   .write
   .format("delta")
   .save("/dataset")

For trying out Delta in detail, use the community version of DataBricks

Sample code snippet for trying out the same is attached below.

What is Big Data? Why Big Data?

Big data is becoming a very hot topic in the Industry and every companies are trying to open an account in providing a solution for this BigData.

Why this BigData became very popular..?

Technology became very advanced and now we have reached a situation where data are able to speak. That means, previously we used data to find the results of events happened previously. Now we are using data to predict events that are going to happen.

Consider a doctor who learned everything with his several years of education and practice. This means he is doing his consulting based on his experience.  If we build a system that learns from historic data and predict events based on its learning, it will be very helpful in a lot of areas. As the size of useful data (data that contains details) increases, the accuracy of the prediction also increases. If the data size exceeds some limits, it will be very difficult to process it using conventional data processing technologies. Here the problem of big data arises. Now a days the demand of insight generation and prediction using big data is very high. Because almost systems everything that interacts with people are now giving recommendations. These recommendations are given based on some analysis over previous data. The accuracy of the recommendations increases as the size of data.  For example, we are getting item recommendations from Amazon, flipkart, ebay, friend suggestions from facebook, google advertisements etc. All these are happening by processing large data. This increases the sales in case of business and usability in case of other systems.

For solving these big data problems, frameworks such as hadoop, storm, spark etc evolved.

Most of the big data solutions are the implementation of distributed storage, distributed processing, in-memory processing etc.
Now a lot of analytics are happening in social media. You can’t do a control+Z on things that you uploaded to social media.

Facebook Opensourced Presto

Facebook opensourced its data processing technique ‘Presto’ to the world. Presto is a distributed query engine based on ANSI SQL. It is very optimized and currently running with more than 300 petabytes of data, which may one among the top big data processing systems. Presto is a totally different from mapreduce. It is an in memory data processing mechanism and is very much optimised. From the details given in the facebook newsletter and presto website, it is 10 times faster than Hive. mapreduce.Hive came from facebook only, so presto will definitely beat hive. Hive queries are ultimately running as multiple mapreduce jobs and it will take more time. From my point of view, the competition may be between Cloudera Impala and Presto. Impala’s performance with huge datasets is not available now from any production environments because it is a budding technology from cloudera family, but presto is already tested and running in huge dataset production environment. Another interesting fact about presto is that we can use the already existing infrastructure and hadoop cluster for deploying presto, because presto supports hdfs as its underlying data storage. It supports other storage systems also. So it is flexible. Leading internet companies including Airbnb and Dropbox are using Presto. Presto code and further details are available in this link

I have deployed Presto and Impala on a small cluster of 8 nodes. I haven’t got enough time to explore more on presto. I am planning to explore more on the coming days. 🙂

An Introduction to Apache Hive

Hive is an important member of hadoop ecosystem. It runs on top of hadoop.  Hive uses a SQL type query language to process the data in hdfs. Hive is very simple as compared to writing several lines of mapreduce codes using programming languages such as Java. Hive was developed by facebook in a vision to support their SQL experts to handle big data without much difficulty.  Hive queries are easy to learn for people who don’t know any programming languages.  People having experience in SQL can go straight forward with hive queries. The queries fired into hive will ultimately run as mapreduce.

Hive runs in two execution modes, local and distributed mode.

In local, the hive queries run as a single process and uses the local file system. In distributed mode, the mapper and reducer runs as different process and uses the hadoop distributed file system.

The installation of hive was explained well in my previous post Hive Installation.

Hive stores its contents in hdfs and table details (metadata) in some databases. By default the metadata is stored in derby database, but this is just for play around setups only and cannot be used for multiuser environments. For multiuser environments, we can use databases such as mysql, postgresql , oracle etc for storing the hive metadata. The data are stored in hdfs and it is contained in a location called hive warehouse directory which is defined by the property hive.metastore.warehouse.dir. By default this will be /user/hive/warehouse

We can fire queries into hive using a command line interface or using clients written in different programming languages. Hive server exposes a thrift service making hive accessible from various programming languages .

The simplicity and power of hive can be explained by comparing the word count program written in java program and in hive query.

The word count program written in java is well explained in my previous post A Simple Mapreduce Program – Wordcount . For that have to write a lot of lines of code and it will take time and it needs some good programming knowledge also.

The same word count can be done using hive query in a few lines of hive query.

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) word
GROUP BY word
ORDER BY word;

Hadoop Mapreduce- A Good Presentation…

This is a video  I found from youtube that explains hadoop mapreduce very clearly using the wordcount example.

Deployment and Management of Hadoop Clusters