This post goes over doing a few aggregations on streaming data using Spark Streaming and Kafka. We will be setting up a local environment for the purpose of the tutorial. If you have Spark and Kafka running on a cluster, you can skip the getting setup steps.
The Challenge of Stream Computations
Computations on streams can be challenging due to multiple reasons, including the size of a dataset. Certain metrics such as quantiles need to iterate over the entire dataset in a sorted order using standard formulae/ practices and they may not be the most suited approach, for example, mean = sum of value/ count. For a streaming dataset, this is not fully scalable. Instead, suppose we store the sum and count and each new item is added to the sum. For every new item, we increment the count, and whenever we need the average, we divide the sum by the count. Then we get the mean at that instance.Calculating Percentile
Percentile requires finding the location of an item in a large dataset; for example, 90th percentile would mean the value that is over 90 percent of the values in a sorted dataset. To illustrate, in [9, 1, 8, 7, 6, 5, 2, 4, 3, 0], the 80th percentile would be 8. This means we need to sort the dataset and then find an item by its location. This clearly is not scalable. Scaling this operation involves using an algorithm called tdigest. This is a way of approximatingpercentile at scale. tdigest creates digests that create centroids at positions that are approximated at the appropriate quantiles. These digests can be added to get a complete digest that can be used to estimate the quantiles of the whole dataset. Spark allows us to do computations on partitions of data, unlike traditional Map Reduce. So we calculate the digests for every partition and add them in the reduce phase to get a complete digest. This is the only time we need to converge that data at one point (reduce operation). We then use Spark's broadcast feature to broadcast the value. This value is then used for filtering the dataset to leave us an RDD matching our criteria (top 5 percentile). We then use mapPartitions to send the values of each partition to Kafka (this could be any message handler, post, and so on).Nature of the Data
We are using fictitious data. It contains two columns: user_id, and activity_type. We are going to compute popular users. The activity can be of the following types: profile.picture.like, profile.view, and message.private. Each of these activities will have a different score.Metrics We Would Like to Compute
We would like to compute the most popular users, that is, the top 5 percentile of users (score and list of users).Prerequisites
You must have Docker, Python 2.7, and JRE 1.7 installed, as well as Scala and basic familiarity with Spark and the concept of RDDs.Getting Setup with Kafka
Download the Kafka container. For the purpose of this tutorial, we will run Kafka as a Docker container. The container can be run with Mac: [crayon-57480dcf0acaf017052086/] Linux (Docker installed directly on the machine): [crayon-57480dcf0acbd978859377/] More information about the container can be found at here. This should get you started with running a Kafka instance that we will be using for this tutorial. We also download the Kafka binaries locally to test the Kafka consumer, create topics, and so on. Kafka binaries can be found at here. Download and extract the latest version. The directory containing the Kafka binaries will be referred to as $KAFKA_HOME.Getting Setup with Spark
The next step is to install Spark. We have two options to run spark:- Run it on a Docker container
- Run it locally
Running Spark Locally
Download Spark binaries from here: 解压文件 [crayon-57480dcf0acc3780246734/] If you have IPython installed, you can also use IPython with pyspark by using the following line: [crayon-57480dcf0acc8992228456/]在Docker容器中把Spark跑起来
[crayon-57480dcf0accd375546956/] This will mount a directory named my_code on your local system to the /app directory on the Docker container. The Spark shell starts with the Spark Context available as sc and the HiveContext available as the following: [crayon-57480dcf0acd2072951593/] Here is a simple Spark job for testing the installation: [crayon-57480dcf0acd7612296631/]Spark Streaming Basics
Spark streaming is an extension of the core Spark API. It can be used to process high-throughput, fault-tolerant data streams. These data streams can be nested from various sources, such as ZeroMQ, Flume, Twitter, Kafka, and so on. Spark Streaming breaks the data into small batches, and these batches are then processed by Spark to generate the stream of results, again in batches. The code abstraction from this is called DStream, which represents a continuous stream of data. A DStream is a sequence of RDDs loaded incrementally. More information on Spark Streaming can be found in the Spark Streaming Programming guide.Kafka Basics
Kafka is a publish-subscribe messaging system. It is distributed, partitioned, and replicated. Terminology: A category of feeds is called a topic; for example, weather data from two different stations could be different topics.- The publishers are called Producers.
- The subscribers of these topics are called Consumers.
- The Kafka cluster has one or more servers each of which is called a broker.
- More details can be found at here.
Generating Mock Data
We can generate data in two ways:- Statically generated data
- Continuous data generation
Aggregation and Processing Using Spark Streaming
This process can be broken down into the following steps:- Reading the message from the Kafka queue.
- Decoding the message.
- Converting the message type text to its numeric score.
- Updating the score counts for incoming data.
- Filtering for the most popular users.