Stream Processing Framework in Python – Faust

I was looking for a highly scalable streaming framework in python. I was using spark streaming till now for reading data from streams with heavy through puts. But somehow I felt spark a little heavy as the minimum system requirement is high.

Last day I was researching on this and found one framework called Faust. I started exploring the framework and my initial impression is very good.

This framework is capable of running in distributed way. So we can run the same program in multiple machines. This will enhance the performance.

I tried executing the sample program present in their website and it worked properly. The same program is pasted below. I have used CDH Kafka 4.1.0. The program worked seamlessly.

To execute the program, I have used the following command.

python sample_faust.py worker -l info

The above program reads the data from Kafka and prints the message. This framework is not just about reading messages in parallel from streaming sources. This has integrations with an embedded key-value data store RockDB. This is opensourced by Facebook and is written in C++.

VPN installation in Raspberry Pi

What is a VPN ?

VPN stands for Virtual Private Network. VPN extends the private network to external networks so that the users can securely interact with the systems within the private network. I will write another post with the complete details of VPN. We will be concentrating on the installation of VPN in raspberry pi in this post.

VPN is a very important requirement for every enterprises. Now a days even individual started using VPN. It is very easy to configure a VPN. Most of the large enterprises use paid VPN services. There are so many VPN service providers available in the market.

This post is about setting up a free VPN service. This can be used in small or medium scale businesses or for your personal purpose as well. I am using this VPN service from the past several years and it worked very well without any issues.

Installation of VPN in raspberry Pi

raspberrypi

I have used raspberry Pi for the installation of OpenVpn. The simplest way to install and configure VPN is raspberry Pi is using Pi-VPN. PiVPN supports two VPN backends

  • OpenVPN
  • WireGuard

While doing the installation, it asks for the user to select the preference and it installs accordingly. OpenVPN can be operated in TCP and UDP. I have used both of these protocols. From my personal experience, the best performing and stable one is UDP.

The only advantage with TCP is that we can run Open VPN in TCP port 443 and it bypasses almost all firewalls in external network. The TCP port 443 is globally open for HTTPS. So we can easily access the VPN using the same port. In this way we will not have to request for additional exceptions in the firewall to enable the VPN access.

WireGuard is a new VPN protocol. It uses a completely new protocol as compared to Open VPN. It is fast and secure. This is under development. Currently if you look at the installations, the majority share goes to Open VPN. This is mainly because it was there in the industry from several years and it already proved its capability. WireGuard will be up in the market soon.

More details about the configuration of PiVPN is described in the following URLs.

  1. PiVPN installation
  2. Additional Reference

Integration with Network

The integration is very easy. In two steps we can integrate the VPN.

  • Connect the raspberry Pi to your network using an ethernet cable
  • Create a rule in your firewall or router to allow the traffic from outside to the raspberry Pi through a NAT rule. (Create a port forwarding rule to route the requests from outside to the raspberry Pi connected to the internal network.)

How to configure Delta Lake on EMR ?

EMR versions 5.24.x and higher versions has Apache Spark version 2.4.2 and higher. So Delta Lake can be enabled in EMR versions 5.24.x and above. By default Delta Lake is not enabled in EMR. It is easy to enable Delta Lake in EMR.

We just need to add the delta jar to the spark jars. We can either add it manually or can be performed easily by using a custom bootstrap script. A Sample script is given below. Upload the delta-core jar to an S3 bucket and download it to the spark jars folder using the below shell script. The delta core jar can be downloaded from maven repository. You can even build it yourselves also. The source code is available in github.

Adding this as a bootstrap action will automatically perform this activity while provisioning the cluster. Keep the below script in an S3 location and pass it as bootstrap script.

copydeltajar.sh

#!/bin/bash

aws s3 cp s3://mybucket/delta/delta-core_2.11.0.4.0.jar /usr/lib/spark/jars/

You can launch the cluster either by using the aws web console or by using the aws cli.

aws emr create-cluster --name "Test cluster" --release-label emr-5.25.0 \
--use-default-roles --ec2-attributes KeyName=myDeltaKey \
--applications Name=Hive Name=Spark \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://mybucket/bootstrap/copydeltajar.sh"