Every organisation has data that it wants to use, and one of the ways of digging through it is with data science. But every treasure hunt needs tools, and data science is no different. You don’t just need a top-notch team of data scientists. You also need an effective data science platform. It’s big business, estimated to become a $385.2 billion market by 2025.
For the purposes of this blog, I’m going to talk about selecting tools used for the purposes of the analytics end of data science. Tools like Alteryx, Dataiku, Data Robot, and Cloudera Data Science Workbench. But don’t forget that a lot of other stuff lays the foundations for this – remember the rest of your tech stack, especially the data governance and management tech. I’m just going to talk about the bit in the middle. The tools used by a data scientist to connect some already prepared datasets to create a data science pipeline that outputs results.
The lab versus the factory
When talking about a data science platform, it’s important to make a distinction between the lab and factory. In the lab, your data scientists can experiment with whatever their preferred tool is. They can upload their own files to combine with existing data in the platform, to innovate and come up with new use cases, new models, new metrics.
The factory is more structured. It has a standardised way of doing something in a supported language. It’s where lab projects launch and are rolled-out across your organisation. So, if a data scientist uses R in the lab, but Python is used in the factory, then you need to consider how a project is transferred from the lab to the factory in that standard language.
Considerations for your data science platform
Here are 6 key factors that we think you should consider when choosing of data science platform:
1. Your use cases
This is the start point of any platform choice. Your platform needs to function for current use cases as well as future ones. If you intend to use real-time data in the future, then that will require a different platform and integrations to one used for batch processing.
2. The skills of your data science team
Your in-house capabilities will influence your choice of platform. You’ll want to pick a tool that your team can use from the get-go and that won’t require a lengthy training period. However, you should also consider whether your team will need to upskill for future data projects and your data science platform should be able to scale up with these future needs too. Do you want people to collaborate on the same code? Then the tool needs to enable this.
3. Your maturity
The maturity of your data function also influences what type of platform you invest in. If you’re just starting out, you might need a more out-of-the-box solution compared to if your data function is well established. If you’re still communicating data value across your organisation and gaining stakeholder buy-in, then you’ll want a platform that can deliver results quickly as opposed to a platform that takes time to implement and develop.
4. Open source or proprietary?
This choice ties into the last couple of points. Your team’s skills and maturity will affect whether you choose an open source or proprietary tool. An open source tool is technically free, but it requires a lot of expert input to get it working effectively for your organisation. However, open source can still be cost-effective and platforms like Jupyter and RStudio are industry standards at this point. Meanwhile, a proprietary tool will work pretty much from the start, but it may not be fully tailored to your business needs and you’ll be reliant on a vendor for updates.
5. Do you want to reuse code?
You don’t want to keep creating code from scratch. Where appropriate, you’ll want to automate certain elements of your data science and reuse code or data models. It can help you achieve better returns and productivity in the long term.
You’ll eventually want to move towards a platform that allows your team to readily deploy elements of their work to production, and to collaborate on it as well.
6. Communicating data to non-data people
Data science insights are useless if nobody acts on them. A data model isn’t going to mean anything to a layperson, but a visual probably will. Therefore, your data science platform needs to integrate with tools that can spread the data insights across your company.
Prepare for the future
Although making your choice now, you do also have to prepare your data science platform for the future. You’ll require different functions and capabilities at different stages of your organisation’s maturity.
Usability is important for any data science platform. People are unlikely to use a new platform if it makes their work harder, so it’s worth testing any platforms with your data team before you invest in them. Likewise, make sure the platform has some way for your data scientists to monitor the health of their models (for example, drift detection), and deploy multiple copies of the same model to test.
Many parts to consider
There are moving parts to every data science project, including the code, models and outputs. You have a lot of different bits to consider, plus the abilities of your team, your resources and future plans. A good platform works with every piece of this. It can take time to find the one that matches your needs, and sometimes you will need to mix-and-match a few different solutions across your tech stack. Don’t lose sight of your uses cases and overall goals when you do this. Everything must link back to how the platform will generate value for your organisation.
James Lupton – CTO