Blog.

20.09.18

6 crucial considerations when selecting a data science platform.

Intro

Every organisation has data that it wants to use, and one of the ways of digging through it is with data science. However, every treasure hunt needs tools, and data science is no different. You don’t just need a top-notch team of data scientists. You also need an effective data science platform.

I’m going to talk about selecting tools used for the analytics end of data science. Tools like Alteryx, Dataiku, Data Robot, and Cloudera Data Science Workbench. Don’t forget that a lot of other stuff lays the foundations for this – remember the rest of your tech stack, especially the data governance and management tech.

I’m just going to talk about the bit in the middle – the tools used by a data scientist to connect some already prepared datasets to create a data science pipeline that outputs results. If you’re wanting a more complete guide to data science architecture, you might be interested in our next Data Platform Webinar.

Data Science Platform | The lab versus the factory

When talking about a data science platform, it’s important to make a distinction between the lab and factory. In the lab, your data scientists can experiment with whatever their preferred tool is. They can upload their own files to combine with existing data in the platform, to innovate and come up with new use cases, new models, new metrics.

The factory is more structured. It has a standardised way of doing something in a supported language. It’s where lab projects launch and are rolled-out across your organisation. So, if a data scientist uses R in the lab, but Python is used in the factory, then you need to consider how a project is transferred from the lab to the factory in that standard language.

Data Science Platform | Considerations for your technology

Here are 6 key factors that we think you should consider when choosing of data science platform:

1. Your use cases

This is the start point of any platform choice. Your platform needs to function for current use cases as well as future ones. If you intend to use real-time data in the future, then that will require a different platform and integrations to one used for batch processing.

2. The skills of your data science team

Your in-house capabilities will influence your choice of platform. You’ll want to pick a tool that your team can use from the get-go and that won’t require a lengthy training period. However, you should also consider whether your team will need to upskill for future data projects and your data science platform should be able to scale up with these future needs too. Do you want people to collaborate on the same code? Then the tool needs to enable this.

3. Your maturity

The maturity of your data function also influences what type of platform you invest in. If you’re just starting out, you might need a more out-of-the-box solution compared to if your data function is well established. If you’re still communicating data value across your organisation and gaining stakeholder buy-in, then you’ll want a platform that can deliver results quickly as opposed to a platform that takes time to implement and develop.

4. Open source or proprietary?

This choice ties into the last couple of points. Your team’s skills and maturity will affect whether you choose an open-source or proprietary tool. An open-source tool is technically free, but it requires a lot of expert input to get it working effectively for your organisation. However, open-source can still be cost-effective and platforms like Jupyter and RStudio are industry standards at this point. Meanwhile, a proprietary tool will work pretty much from the start, but it may not be fully tailored to your business needs and you’ll be reliant on a vendor for updates.

5. Do you want to reuse code?

You don’t want to keep creating code from scratch. Where appropriate, you’ll want to automate certain elements of your data science and reuse code or data models. It can help you achieve better returns and productivity in the long term.

You’ll eventually want to move towards a platform that allows your team to readily deploy elements of their work to production, and to collaborate on it as well.

6. Communicating data to non-data people

Data science insights are useless if nobody acts on them. A data model isn’t going to mean anything to a layperson, but a visual probably will. Therefore, your data science platform needs to integrate with tools that can spread the data insights across your company.

Data Science Platform | Prepare for the future

Although making your choice now, you do also have to prepare your data science platform for the future. You’ll require different functions and capabilities at different stages of your organisation’s maturity.

Usability is important for any data science platform. People are unlikely to use a new platform if it makes their work harder, so it’s worth testing any platforms with your data team before you invest in them. Likewise, make sure the platform has some way for your data scientists to monitor the health of their models (for example, drift detection), and deploy multiple copies of the same model to test.

Data Science Platform | Many parts to consider

There are moving parts to every data science project, including the code, models and outputs. You have a lot of different bits to consider, plus the abilities of your team, your resources and future plans. A good platform works with every piece of this. It can take time to find the one that matches your needs, and sometimes you will need to mix-and-match a few different solutions across your tech stack. Don’t lose sight of your uses cases and overall goals when you do this. Everything must link back to how the platform will generate value for your organisation.

If this is a subject you’re interested in, join our webinar on