– and when to consider if it’s the right technology for your data strategy
One of the things we pride ourselves on here at Cynozure is being independent of any particular technology or vendor. We firmly believe that actual requirements, issues faced and outcomes needed should dictate which solutions and systems an organisation works with. And of course technology is only one aspect of a successful strategy for getting the most from data.
That said, there are some companies and technologies that we think you might like to look at when it comes to looking at options for your overall data and analytics stack.
There are a LOT of software solutions, services and providers out there – almost too many to keep track of; it’s tough knowing where you’ll get what you need. Or indeed which ones will last the distance (no one wants to invest time and money in technology that falls by the wayside).
So, who gets our vote of confidence? Here are some to get started with, and when they would be good to consider (and as importantly, when they may not be right for you):
Cloudera – Enterprise Data Hub
What is it:
Cloudera was one of the first enterprise ready big data solutions in the market, competing with the likes of Hortonworks and MapR. Taking the core offering of Hadoop and wrapping it up with some excellent enterprise management features such as Cloudera Manager, Cloudera has been a huge player in the big data space for a long time.
In their own words:
Cloudera Enterprise is a modern big data platform that can be deployed wherever your data resides, whether on-premises or in the cloud. Regardless of your use case, you can expect the same high-performance, enterprise-grade solutions and premium support services only available from Cloudera.
Cloudera is easy for us to recommend to customers, especially if they are looking to build an on premise solution. Its differentiator is all about the bits they add on rather than the core Hadoop (and wider distributed computing) components.
Cloudera manager lets you easily manage and configure the services on the cluster, letting even the most novice of users do some basic admin tasks. Tools like data science workbench help support the roll out of well-integrated data science functions to the business at large (though this one is definitely aimed at more competent users) and Cloudera Navigator supports data governance, discovery and lineage. It’ll even help you optimise those troublesome queries running on the cluster.
Where Cloudera is at it’s best though, is the support function. The lack of support in pure open source versions of Hadoop is one of the big drivers to use a vendor like Cloudera. Staffed by ex-field engineers and other highly experienced users you always speak to someone who actually understands the software and who can get you up and running quickly.
Cloudera EDH, when deployed on premise, is in our opinion better suited to larger scale deployments – unless you have massive amounts of data coming at you fast from many different sources it might end up being the proverbial sledgehammer to crack a nut. If you are looking to prototype or run a POC, we’ll probably be directing you to looking at some of the different cloud offerings out there. That being said, if you want to try Cloudera in the cloud, check out our next pick.
Cazena – Big Data as a Service
What is it:
Cazena is a big data as a service platform hosted on AWS or Azure, and running Cloudera as it’s core data platform. Promising reduced TCO vs DIY clusters, 24×7 support with active monitoring and issue resolution and security and governance provided out of the box this is the best way to run Cloudera in the cloud.
In their own words:
Cazena radically simplifies data and analytics, with the first fully-managed, Big Data as a Service platform. With Cazena, teams move, store, share and analyse big data in a few clicks, without specialized DevOps skills. Solutions are agile, secure and cost-effective. Cazena’s fully-managed solutions combine best-of-breed technologies with automation, security and cloud infrastructure (Cloudera, Microsoft Azure, AWS, etc.) to deliver production-ready data platforms.
One of the biggest blockers to getting up and running with a new big data project is 1) the cost and 2) finding the skills and time to spin up a cluster and properly configure it. Not everyone has an abundance of in house people with availability and skills necessary to do this and continue to provide a full 24×7 support wrap.
Cazena is a great place to go if you want to run a POC and need something up and running quickly, but it’s also a great place to go for your long term solution. For a single price, they’ll wrap up cloud infrastructure, database licences, Cazenas excellent platform automation and gateway tools and fully-managed 24×7 support.
And with that, you get all the benefits of Cloudera with all the time consuming support and admin bits taken care of leaving your data teams available to focus on delivering value for the business.
Microsoft – Azure
What is it:
Microsoft Azure is the banner under which all of Microsoft’s cloud services sit. What we’re interested in is their rich suite of data tools, including HDInsight (their big data offering based on Hortonworks) and the Azure SQL Data Warehouse offering. Their marketplace is also stocked full of other services (like Event Hubs for example, which provides a pub-sub streaming service) that you can use to add additional capabilities.
In their own words:
Azure is a comprehensive set of cloud services that developers and IT professionals use to build, deploy and manage applications through our global network of data centres. Integrated tools, DevOps and a marketplace support you in efficiently building anything from simple mobile apps to Internet-scale solutions.
Microsoft has been putting a lot of effort to improving their offerings in this space for sometime. We are seeing more and more people looking to Microsoft for their platforms at the moment. Part of this is driven by the amount of Microsoft infrastructure they already have – Office, SharePoint, Skype, Active Directory, Dynamics – the list goes on. By hosting their data in Azure it helps keep a lot of their technology estate in line.
Whether Microsoft is the right place for you will largely be down to your overall IT strategy, but the familiarity of tools like SQL Data Warehouse and Power BI will certainly draw in a lot of users looking for an upgrade to legacy data warehouse solutions. HDInsight is also worth considering if you are looking to build a big data solution in the cloud, though Cloudera still edges Microsoft as top choice for us.
Microsoft isn’t the only option out there when it comes to cloud, of course. AWS is just as strong and has a few offerings that you can’t find from Microsoft yet, such as query as a service. And Google takes a slightly different approach, but one that might be a better fit to some organisations.
Across all these cloud vendors, cost is something you will need to keep an eye on. If you go the big data route, your monthly run rate can climb quickly as you add numerous servers and storage. Cloud isn’t always cheaper and it’s worth taking advantage of the different vendors calculators to assess the cost.
Dataiku – Data Science Studio
What is it:
Dataiku is a data science toolset combining some of the most asked for features from end users. Providing a GUI based experience that supports collaborative development, a wide range of languages and modelling techniques, version control, visualisation and deployment management, it’s the Swiss army knife of the data science world.
In their own words:
Dataiku DSS is the collaborative data science software platform for teams of data scientists, data analysts, and engineers to explore, prototype, build, and deliver their own data products more efficiently.
DSS solves a lot of the main challenges we hear from data scientists we speak to. To start with, it’s all wrapped up in an attractive GUI environment and designed for collaboration. It supports the creation of complex workflows, gives end users control of scheduling and allows easy deployment between different environments.
Connecting to lots of different source systems including Hadoop, relational databases and files stores like S3 it’s got a versatile data preparation toolset. With the core modelling capabilities built on top of popular open source libraries and access to notebooks for python, R and SQL most data scientists will feel right at home. You can even create spark jobs (spark SQL, PySpark and Spark Scala are all supported) if you so choose.
Where this product really shines for us though is the way it can uplift the capability of analytics teams who are just getting into the world of data science and machine learning. With a UI to support the creation of machine learning (and other) models, it’s an excellent way to get your teams to develop their data science capabilities.
One of the only watch-outs is to make sure you use DSS for what it was intended. In our opinion it’s not the right choice for your primary ETL and data ingestion tool. Other than that, you can’t go too wrong!
If DSS isn’t for you, try checking out Data Robot or Alteryx instead. While they take a slightly different approach, both can make a strong case for being your tool of choice.
This isn’t an exhaustive list by any means; we could spend all day (all week!) talking about technology choices based on strategic requirements. Getting the best value when it comes to data is our favourite topic after all. But this just to give you some insights into some of the technology out there in the data and analytics space.
We’ll be adding to this list, so if you would like to be notified of updates, please sign up to receive our newsletter and we’ll let you know when we do.