Blog.

25.09.18

Real-time data tools to drive business outcomes.

Intro

Real-time data technology is a unique data source in that it expires quickly. To deal with this, you’ll need some specialised real-time data tools.

The value of real-time data evaporates pretty rapidly it’s not used within a certain timeframe. It needs to be processed continuously, with alerts and insights surfaced appropriately – it needs to be streamed.

With a purpose-built data platform suited to your strategy, real-time data tools can give your teams access to new insights, and drive better analytics across your organisation.

The tools you’ll need

You’ll need the right tools for real-time data ingestion, real-time processing and even edge analytics. Edge analytics offer the ability for quicker decision making, at the edges of a system instead of within a central hub.

A till, for example, could instantly make certain promotional decisions for a transaction instead of having to relay data to a central system then wait for a response. Self-driving cars, which have to make snap decisions on whether to stop, are another example of edge analytics.

Leading real-time data tools

There are a few options on the market, both open source and proprietary. This will likely grow as real-time analytics gains popularity. Here’s our run-down of the most widely used real-time data tools.

Flink

Apache Flink provides capabilities for distributed computation across streams of data. It’s effective for both batch and real-time data processing but mostly prioritises streaming. It has a number of APIs including DataStream API, the DataSet API for Java, Scala and Python, and a SQL-like query API.

Flink is very stream-oriented, with a flexible window for the continuous streaming model, making sure that batch and real-time processing is integrated into a single system. It also has FlinkML, its own machine learning library, as well as graph processing libraries.

Kinesis

Kinesis is an out-of-the-box solution offered by Amazon for data streaming. It’s often used for real-time data aggregation which is followed by loading that data into a data warehouse.

This ensures durability and elasticity – plus it offers a completely cloud-based, managed and scalable service. It can handle large real-time data streams as well.

Storm

Apache Storm is a distributed real-time computation system which can be used with any programming language. It’s highly scalable. Storm is often used for distributed machine learning and real-time analytics. It can integrate with Hadoop and runs on YARN.

Hadoop isn’t natively designed for stream processing, so when looking to begin real-time analytics in a Hadoop ecosystem, Storm is a good bet. Storm doesn’t have batch support, making it a purely real-time processing framework.

Kafka

Kafka is a distributed messaging system that’s used to integrate data streams. It was developed by LinkedIn before becoming part of Apache.

Kafka can handle a lot of data – terabytes upon terabytes – and this makes it incredibly easy to scale. If you can see your demand for real-time data skyrocketing, Kafka might be for you.

Samza

Apache Samza is a distributed stream processing framework that’s closely associated with Apache Kafka. It’s specifically designed to take advantage of Kafka’s architecture. Samza offers fault tolerance, buffering and state storage.

One downside to using Samza is that it only supports JVM language, so it isn’t as flexible as some of the other solutions.

Things to consider

Whatever you invest in, make sure your system is capable of your current and future plans. If your real-time data is going to increase over the next few years, then you’ll want a system that is scalable.

The different tools you choose need to integrate together. Depending on what you pick, you’ll end up with different combinations of functionality.

You should also determine the skills of your team, or if they’ll be required to do further training or learn a new programming language. All of this can bring additional expenses.

Then there’s the choice between proprietary and open-source. If you go down the open-source route, then you’ll need a data team who can adapt it to your organisation’s needs. Some companies find it more cost-effective to purchase out-of-the-box solutions or to use a mix of free and paid-for solutions.

Support is another key consideration. Real-time is much more critical when something goes wrong. Because of its instant nature, people are bound to notice any faults. Decide whether you need 24/7 support, or if your data team is experienced enough to fix any issues.

Now, and for the future

Real-time data tools are becoming increasingly popular and will continue to do so, as real-time devices from IoT gain mainstream adoption. But real-time can be costly, in terms of time and resources.

Therefore, only use real-time when your use case really calls for it. Some common use cases for it include mapping and real-time updates, real-time chats, autonomous vehicles and remote functionality via the Internet of Things.

Many organisations would benefit from a hybrid solution that does both streaming and batch processing. This allows for a degree of flexibility whilst also future-proofing your data function. Eventually, all organisations will likely use real-time in some capacity.

It’s worth getting your real-time infrastructure right from the beginning, that way you’ll be light years ahead of your competitors.