Real-time data is a unique data source in that it expires quickly. To deal with this you’ll need some specialised real-time data technology. The value of real-time data evaporates pretty rapidly it’s not used within a certain timeframe. It needs to be processed continuously, with alerts and insights surfaced appropriately – it needs to be streamed.
The tools you’ll need
You’ll require tools for real-time data ingestion, real-time processing and even edge analytics. Edge analytics offer the ability for quicker decision making, at the edges of a system instead of within a central hub. A till, for example, could instantly make certain promotional decisions for a transaction instead of having to relay data to a central system then wait for a response. Self-driving cars, which have to make snap decisions on whether to stop, are another example of edge analytics.
Popular real-time data technology
There are a few options on the market, both open source and proprietary. This will likely grow as real-time analytics gains popularity. Here’s a run-down of the most widely used
Apache Flink provides capabilities for distributed computation across streams of data. It’s effective for both batch and real-time data processing but mostly prioritises streaming. It has a number of APIs including DataStream API, the DataSet API for Java, Scala and Python, and a SQL-like query API. Flink is very stream-oriented, with a flexible window for the continuous streaming model, making sure that batch and real-time processing is integrated into a single system. It also has FlinkML, its own machine learning library, as well as graph processing libraries.
Kinesis is an out-of-the-box solution offered by Amazon for data streaming. It’s often used for real-time data aggregation which is followed by loading that data into a data warehouse. This ensures durability and elasticity – plus it offers a completely cloud-based, managed and scalable service. It can handle large real-time data streams as well.
Apache Storm is a distributed real-time computation system which can be used with any programming language. It’s highly scalable. Storm is often used for distributed machine learning and real-time analytics. It can integrate with Hadoop and runs on YARN. Hadoop isn’t natively designed for stream processing, so when looking to begin real-time analytics in a Hadoop ecosystem, Storm is a good bet. Storm doesn’t have batch support, making it a purely real-time processing framework.
Kafka is a distributed messaging system that’s used to integrate data streams. It was developed by LinkedIn before becoming part of Apache. Kafka can handle a lot of data, terabytes upon terabytes, and is therefore very easy to scale.
Apache Samza is a distributed stream processing framework that’s closely associated with Apache Kafka. It’s specifically designed to take advantage of Kafka’s architecture. Samza offers fault tolerance, buffering and state storage. One downside to using Samza is that it only supports JVM language, so it isn’t as flexible as some of the other solutions.
Real-time data technology: considerations
Whatever you invest in, make sure your system is capable of your current and future plans. If your real-time data is going to increase over the next few years, then you’ll want a system that is scalable.
The different tools you choose need to integrate together. Depending on what you pick, you’ll end up with different combinations of functionality.
You should also determine the skills of your team, or if they’ll be required to do further training or learn a new programming language. All of this can bring additional expenses.
Then there’s the choice between proprietary and open-source. If you go down the open-source route, then you’ll need a data team who can adapt it to your organisation’s needs. Some companies find it more cost-effective to purchase out-of-the-box solutions or to use a mix of free and paid-for solutions.
Support is another key consideration. Real-time is much more critical when something goes wrong. Because of its instant nature, people are bound to notice any faults. Decide whether you need 24/7 support or if your data team is experienced enough to fix any issues.
Real-time data technology: now and for the future
Real-time data technology is becoming increasingly popular and will continue to do so, as real-time devices from IoT gain mainstream adoption. But real-time can be costly, in terms of time and resources. Therefore, only use real-time when your use case really calls for it. Some common use cases for it include mapping and real-time updates, real-time chats, autonomous vehicles and remote functionality via the Internet of Things.
Many organisations would benefit from a hybrid solution that does both streaming and batch processing. This allows for a degree of flexibility whilst also future-proofing your data function. Eventually, all organisations will likely use real-time in some capacity. It’s worth getting your real-time infrastructure right from the beginning, that way you’ll be light years ahead of your competitors.