Traditionally, data analysis was often done on a server hosting a SQL instance of some sort. These locally hosted servers came with a host of issues such as:
- Needed to be physically housed somewhere
- Managing uptime and availability
- Managing server performance and storage
- Managing security
- Managing software
Some of those issues above were solved by the cloud computing providers by allowing the servers to be hosted offsite. However, cloud hosted SQL servers proved to be slow, while still requiring significant amounts of database administration time. Additionally, as companies gather and analyze larger amounts of data, traditional SQL servers (local and cloud hosted) are proving difficult to scale.
To solve this issue, the major cloud computing providers have developed the serverless cloud distributed database. These technologies are built to handle large amounts of data and do large analysis, without the user having to manage the underlying infrastructure.
Advantages and Disadvantages
+ Do not have to worry about servers or infrastructure
+ Scalability is not an issue. Queries will automatically use as much resources as necessary to run quickly
+ Uses SQL syntax familiar to many data analysts and developers
+ Do not have to worry about keys, indexes, like in a traditional database
+ Easy to make visualizations from the query results
– Data must all be loaded in the cloud and cannot interoperate with – Latency is high, so better used when the query itself is large. Not good for many tiny queries.
Athena vs Data Lake Analytics vs BigQuery
The 3 main players in this space are Amazon Athena, Microsoft Azure Data Lake Analytics, and Google BigQuery. All the services charge based on how much data your query scans and how much storage your data requires. They all offer partitioning of the data so queries scan less and thus cost less. They all offer APIs to pull the results of the queries into your local environment, such as a Jupyter notebook, for further analysis. Additionally, the features among the 3 products are very similar and more features are being added seemingly on a daily basis.
The decision of which to use is most likely going to be predicated on which overall cloud platform provides you with the most features you need.
Simple Example with BigQuery
Let’s look at a very simple example of using Google BigQuery. BigQuery has many datasets preloaded for users to explore. One of them is a dataset containing permit data to plant trees in San Francisco. Using the BigQuery UI, we can preview the schema and the data (not shown) in the table:
Then, we can write a query in the UI. BigQuery SQL is essentially the same as any other variants of SQL. There are advanced features such as arrays that don’t exist in other variants of SQL, but if you have used any SQL language before, BigQuery SQL will not be a problem. Below is the SQL code for our San Francisco trees query:
WITH filtered AS ( SELECT DATE(plant_date) AS plant_date FROM `bigquery-public-data.san_francisco_trees.street_trees` WHERE plant_date IS NOT NULL), planted AS ( SELECT plant_date, COUNT(*) AS num_planted FROM filtered GROUP BY plant_date), calendar AS ( SELECT date FROM ( SELECT MIN(plant_date) min_dt, MAX(plant_date) max_dt FROM filtered ), UNNEST(GENERATE_DATE_ARRAY(min_dt, max_dt)) date), aggregated AS ( SELECT c.date, IFNULL(p.num_planted, 0) AS num_planted FROM calendar c LEFT OUTER JOIN planted p ON c.date=p.plant_date ORDER BY c.date) SELECT date, AVG(num_planted) OVER (ORDER BY date ROWS BETWEEN 365 PRECEDING AND CURRENT ROW) AS avg_planted FROM aggregated ORDER BY date
Each row of the `street_trees` table is a permit to plant one tree. The query above counts the number of permits on a day, fills in missing days with 0, and then computes a 365 day moving average so the result is the average number of permits filed per day over the last 365 days. Running the query yields the following result in the UI:
If we want to do some simple visualizations, we can explore the data in Data Studio (by clicking on the “EXPLORE IN DATA STUDIO” button. Below is a simple plot of the data over time.
While the above is an extremely simplistic example, serverless, distributed databases allow users to easily analyze large datasets without worrying about the underlying infrastructure. In this example we used a preloaded dataset. In a future posts, we will talk about:
- the infrastructure and software stack used to ETL data into these distributed databases
- loading the results of queries into a Jupyter notebooks for advanced analysis