A cloud data warehouse is a database delivered in a public cloud as a managed service that is optimized for analytics, scale and ease of use.
We’re now witnessing a third wave of innovation in data warehousing technology with the advent of cloud data warehouses. As enterprises move to the cloud, they are abandoning their legacy on-premise data warehousing technologies, including Hadoop, for these new cloud data platforms. This transformation is a huge tectonic shift in data management and has profound implications for enterprises.
THE BENEFITS OF A CLOUD DATA WAREHOUSE
Cloud-based data warehouses free up companies to focus on running their business, rather than running a room full of servers, and they allow business intelligence teams to deliver faster and better insights due to improved access, scalability, and performance.
- Data Access: Putting their data in the cloud enables companies to give their analysts access to real-time data from numerous sources, allowing them to run better analytics quickly.
- Scalability: It is much faster and less expensive to scale a cloud data warehouse than an on-premise system because it doesn’t require purchasing new hardware (and possibly over- or under-provisioning) and the scaling can happen automatically as needed
- Performance: A cloud data warehouse allows for queries to be run much more quickly than they are against a traditional on-premises data warehouse, for lower cost.
CLOUD DATA WAREHOUSE CAPABILITIES
Each of the major public cloud vendors offer their own flavor of a cloud data warehouse service: Google offers BigQuery, Amazon has Redshift and Microsoft has Azure SQL Data Warehouse. There are also cloud offerings from the likes of Snowflake that provide the same capabilities via a service that runs on the public cloud but is managed independently. For each of these services, the cloud vendor or data warehouse provider delivers the following capabilities “out of the box”:
- Data storage and management: data is stored in a cloud-based file system (i.e. S3).
- Automatic upgrades: there’s no concept of a “version” or software upgrade.
- Capacity management: it’s easy to expand (or contract) your data footprint.
FACTORS TO CONSIDER WHEN CHOOSING A CLOUD DATA WAREHOUSE
How these cloud data warehouse vendors deliver these capabilities and how they charge for them is where things get more nuanced. Let’s dive deeper into the different deployment implementations and pricing models.
Cloud Architecture: Cluster versus Serverless
There are two main camps of cloud data warehouse architectures. The first, older deployment architecture is cluster-based: Amazon Redshift and Azure SQL Data Warehouse fall into this category. Typically, clustered cloud data warehouses are really just clustered Postgres derivatives, ported to run as a service in the cloud. The other flavor, serverless, is more modern and counts Google BigQuery and Snowflake as examples. Essentially, serverless cloud data warehouses make the database cluster “invisible” or shared across many clients. Each architecture has their pros and cons (see below).
Cloud Data Pricing: Pay by the Drink or by the Server
Besides deployment architecture, another major difference between the cloud data warehouse options is pricing. In all cases, you pay some nominal fee for the amount of data stored. But the pricing differs for compute.
For example, Google BigQuery and Snowflake offer on-demand pricing options based on the amount of data scanned or compute time used. Amazon Redshift and Azure SQL Data Warehouse offer resource pricing based on the number or types of nodes in the cluster. There are pros and cons to both types of pricing models. The on-demand models only charge you for what you use which can make budgeting difficult as it is hard to predict the number of users and the number and size of the queries they will be running. I know one customer example where a user mistakenly ran a $1,000+ query.
For the node based models (i.e. Amazon Redshift and Azure SQL Data Warehouse), you pay by the server and/or server type. This pricing model is obviously more predictable but it’s “always on” so you are paying a flat price regardless of usage.
Pricing is a major consideration and requires a great deal of use case and workload modeling to find the right fit for your organization.
Challenges and Considerations for Cloud Migration (the “Gotchas”)
At Vengreen, we’ve seen lots of enterprises attempt a migration from their on-premise data lakes and/or relational data warehouses to the cloud. For many, their migrations “stall” after the first pilot project due to the following reasons:
- Disruption: downstream users (business analysts, data scientists) have to change their habits and re-tool their reports and dashboards.
- Performance: the cloud DW doesn’t match performance of highly tuned, legacy on-premise data platforms.
- Sticker shock – unanticipated or unplanned operating costs and lack of cost controls.