Stream processing-A Basic Overview.
Stream processing is a dynamic process where the data is being processed immediately after it has been received or generated in the source end of the stream processing platform.It is a big data technology where streams of unbounded or bounded data arrives the processing arena in a large scale.What are these bounded and unbounded data?
Bounded data-Finite and static data where the starting ending ,length and everything about the data is known.(eg- Issues for a specific git repository in last hour)
Unbounded data-Unbound data is unpredictable, infinite, and not always sequential. The data creation is a never ending cycle and goes on for ever.
Sources receive events via multiple transport protocols such as HTTP, TCP, Kafka, JMS, etc., and in different data formats such as XML, JSON, Text, Binary, etc. The events received via sources are mapped into streams for processing.So the stream processor should be able to handle and manage these large scale data incomings and process those in real time to the maximum accuracy and pass it to the appropriate sink.Sinks publish events arriving at streams via multiple transport protocols such as HTTP, Email, TCP, Kafka, JMS, etc., by mapping the events to different data formats such as XML, JSON, Text, Binary, etc.
Source operator will direct the input stream to the stream processing arena and sink operator will receive the processed data set as shown in figure 1.1
Consider a cosmetics manufacturing business with an inventory management system.The production amount of the business should not go beyond a specific level as surplus production which is greater than demand will result in heavy losses.The different production departments produce the cosmetics and the amount of cosmetics produced in each of them will be received by the stream processor in real time and when the total amount exceeds the threshold level alert should be sent to the inventory manager as “The production amount has been breached”. This is one of an example of a basic stream processing scenario.
Now there are many contenders like apache fink and WSO2-SP to provide this stream processing mechanism.Every organizations has their unique way of processing and publishing.
Why Stream processor?
Stream processing is highly beneficial if the events you wish to track are happening frequently and close together in time. It is also best to utilize if the event needs to be detected right away and responded to quickly. Stream processing, then, is useful for tasks like fraud detection and cybersecurity. If transaction data is stream-processed, fraudulent transactions can be identified and stopped before they are even complete.
Consider a scenario of processing an unbounded data.To do a batch processing you need to stop the data receiving at some point, store the data ,do the processing and then collect the next batch of data and so on.There arise an anomaly of aggregating the processed data to the multiple batches.In contrast, streaming handles never ending data streams gracefully and naturally. You can detect patterns, inspect results, look at multiple levels of focus, and also easily look at data from multiple streams simultaneously.
For example, if you are trying to detect the length of a web session in a never-ending stream ( this is an example of trying to detect a sequence). It is very hard to do it with batches as some session will fall into two batches. Stream processing can handle this easily.
In a fast scaling business domains we cannot think batch processing as it will cause more and more memory consumption on time.Streaming is the best choice in that scenario.Daily intervals are often not sufficient in giving the outputs. Speed is becoming a major requirement in most of the use cases. Evaluations are expected promptly and not minutes or even hours later. At this point stream processing comes into play as data is being processed as soon as they are known to the system.
Challenge in Stream processing.
One of the biggest challenges that organizations face with stream processing is that the system’s long-term data output rate must be just as fast or faster than the long-term data input rate otherwise the system will begin to have issues with storage,memory and speed.data should be stored only in the contexts which need past data for the processing.In other contexts data should be purged in a periodic manner to avoid memory issues.