Data Platform Masterclass
Do you want to understand the technology behind your data?
Find out more in our
Data Platform masterclass.
There is a huge range of choice out there when it comes to the core platform and the aim of this post isn’t to make a decision on which one is right for you. Instead, I want to arm you with some of the most important questions you need to ask when choosing a platform to go with. This article is the first in a series of posts on Building an Analytics Platform; make sure to read up on the other four articles for a comprehensive guide to your data architecture.
Given the breadth of features inherent across all the major platforms, I’m going to try and focus on the central data storage and access components. So what are the questions we need to ask when choosing the right platform?
Data Platform | What type of use cases will you deliver?
In my experience, people set out to build an analytics platform in one of two ways.
The first way is starting with a very specific set of use cases, whether that be understanding cross-channel purchase behaviour, creating predictive maintenance applications, or building NLP models to understand customer feedback.
The second way is a far more loosely defined approach, where there is an understanding that data and analytics are important, and they have lots of data they should be doing something with, but there is no specific challenge to target.
Either one of these approaches can work. That being said, you’ll have to fight harder to prove value with no defined use case as you’re going to have to go off in search of something that will prove the platform is worthwhile. If you can begin with a good use case, it’s much easier to get going.
But how does a use case impact your technology choices? The first and most important question is, how are you going to build the thing that gives you the answer? Once the data is on the platform and available to people to work with, can you answer all your questions with SQL? Do you need a language like Python or R to achieve the desired result? Perhaps you need to do machine learning at scale, and need to leverage technology like Spark?
Understanding where you are going to take the platform in the long term is crucial. To start with SQL alone might be enough, but as the platform begins to take off, it soon might not be. It’s important to not back yourself into a corner that you cannot expand from. The last thing you want is to spend all the effort loading and preparing your data into a platform that cannot live up to the long-term expectations of the business.
Cloud technologies certainly ease this issue, allowing you to start with a smaller footprint on scalable technology and expand as needed, or add additional services from a marketplace of technologies to suit your growing needs.
For a lot of businesses, a simple SQL database coupled with a good visualisation tool is more than enough and a big step forward in capability. Understand what you want to do and take the simplest route to get there.
Data Platform | Who are your users?
It’s not just the use cases that determine some of these key technology choices, but the users as well. If the only skills you have in house are people who are good at writing SQL, there’s no point providing them a platform that is intended for use with a coding language like Python or R. Nothing is going to get done and unless you have an aggressive hiring strategy you’re not going to get much out of the platform.
Likewise, if you have a staff of data scientists who are capable of using a vast array of technologies, don’t hamper what they can do by just giving them a SQL database and nothing else.
These days it’s not difficult to find platforms that give you a wide array of choice when it comes to data manipulation technologies. Whether it’s a big data platform like Cloudera or HD Insights or a set of cloud services such as AWS or Google, you can typically find a good balance.
Whether you go in the cloud or on-prem is entirely down to how your current IT strategy is setup. If you still have a traditional data centre you might be surprised by how much you can get for your money when it comes to going in-house for big data though.
Data Platform | Do you have big data?
One of the main questions a lot of people will be asking themselves is if they need to go for a big data platform, or if a ‘simple’ scalable database technology will do. There’s an awful lot of things to take into account when making this kind of decision and it could merit a book in its own right, but here are three key things to think about:
- Do you genuinely have ‘big’ data? Do you have terabyte upon terabyte of data, and a seemingly endless source of it? Is it fast moving data that requires scalable streaming solutions? Will you need a range of different storage solutions to suit all the different data sources you have? If your answer to all of these was ‘no’, then you should probably avoid choosing a big data platform.
- Do you have the desire to invest in the types of skills required to maintain this type of platform knowing they are more difficult to source and cost more than traditional technologies?
- Do you have the use cases to merit the investment, whether they involve real-time analytics, streaming, unstructured data sets, machine learning or massive batch processing needs?
I’m the first person who would advocate the power of big data, but too often it’s chosen as the solution because it’s on-trend and not because it’s the right choice for the problem. Compared to standard relational database technologies, Hadoop-based big data solutions can often lack certain features, such as referential integrity or even the ability to do updates, that you might take for granted.
On the flip side, most big data solutions out there pack a lot more tools and technology in allowing you to solve a whole host of different scenarios, giving you much more flexibility in the long run. It all comes back to understanding where you want to take the platform in the future.
While users and use cases should be at the forefront of your mind when looking at which technology to choose, there are a few other things that should help make your decision. On to the quickfire round!
Does it meet your security needs?
Does the platform offer the necessary levels of security for your business? You’d be hard-pressed to find a solution that didn’t have some level of security, but if you need really granular row-level controls or want to store PII data, then you should make sure the accompanying security setup suits your needs.
Can you automate it?
Scheduling your jobs (or allowing users to schedule theirs) quickly becomes a vital part of your infrastructure. Does the platform you’ve chosen come with a tool and if it does, does it fall short for more complicated use cases? If so, you’ll want to make sure your platform integrates with your chosen enterprise tool of choice if this is the case
How will you do continuous integration?
Continuous integration has been best practice for a while now. These tools tend to sit separately from the main platform but sometimes they come bundled. What you do need to do is make sure there are deployment tools available and if not, that the external tool of your choice is compatible with the system.
Which coding language will you build with?
Beyond the tools that end-users will use, you need to consider the language that the team looking after the platform will be using. If your internal strategy is to go with python, make sure that this is suitably compatible with the platform of your choosing.
Choosing your language of choice goes beyond just the platform though. I’m a huge advocate of building data platforms with python. It’s quick, it’s a data-friendly language, it’s easy to skill everyone in your team up on the basics, and there’s an abundance of skills out there.
Is it suitable for operational support?
If you intend to make you analytics platform operational in nature, perhaps powering personalisation on a website, for example, you need to make sure that it can be made suitably fault tolerant and performant for the job. Most distributed platforms have a level of fault tolerance built in by default, but assessing the hardiness of the platform is vital if you’re running it in a business critical environment.
Should you go open source or proprietary?
This one is very much down to the general attitude to software within your IT department. I’m a big advocate for open source and there’s a lot to be gained from it. If you haven’t, have a read of ‘The cathedral and the bazaar’ to understand some of the reasons to go open source. One of the main reasons to go open source is to avoid vendor lock-in. If you think you’ll be changing provider soon, it might be sensible to go this route.
A hybrid approach is increasingly common these days. A technology like Cloudera leverages an open source core with proprietary enterprise features, often giving you the best of both worlds.
What kind of support do you need?
This can be an especially pertinent question if you went open source. Do you need a 24/7 support line with a 15 minute SLA, or are you happy to have things break for a time while you fix them yourself?
Once again, this likely comes back to the use cases you need to run and whether the platform is being put into a business-critical role. All the major vendors offer support, so you just need to pick the level that is right for you.
Can you find a good reference customer?
Always a good place to start is asking to speak to an existing customer who’s doing similar work to you. This gives you a chance to get that non-sales view on what the platform is like to work with.
If it’s a new technology you’re working with then you should ensure that you get some input into the future roadmap of the platform.
Keeping all these things in mind will help you make a good technology choice for your data platform. Of course, the place to start with these processes is always a market review, followed by a POC with the top choices to make sure they live up to the sale pitch.
If you found this article valuable, you might be interested in our next Data Platform masterclass. This London-based session is led by James Lupton, coaching leaders in business, data and tech on how to build a data platform.