Background

Traditionally, data analysis was often done on a server hosting a SQL instance of some sort.  These locally hosted servers came with a host of issues such as:

  • Needed to be physically housed somewhere
  • Managing uptime and availability
  • Managing server performance and storage
  • Managing security
  • Managing software

Some of those issues above were solved by the cloud computing providers by allowing the servers to be hosted offsite.  However, cloud hosted SQL servers proved to be slow, while still requiring significant amounts of database administration time.  Additionally, as companies gather and analyze larger amounts of data, traditional SQL servers (local and cloud hosted) are proving difficult to scale.

To solve this issue, the major cloud computing providers have developed the serverless cloud distributed database.  These technologies are built to handle large amounts of data and do large analysis, without the user having to manage the underlying infrastructure.

Advantages and Disadvantages

+ Do not have to worry about servers or infrastructure

+ Scalability is not an issue.  Queries will automatically use as much resources as necessary to run quickly

+ Uses SQL syntax familiar to many data analysts and developers

+ Do not have to worry about keys, indexes, like in a traditional database

+ Easy to make visualizations from the query results

– Data must all be loaded in the cloud and cannot interoperate with – Latency is high, so better used when the query itself is large.  Not good for many tiny queries.

Athena vs Data Lake Analytics vs BigQuery

The 3 main players in this space are Amazon Athena, Microsoft Azure Data Lake Analytics, and Google BigQuery.  All the services charge based on how much data your query scans and how much storage your data requires.  They all offer partitioning of the data so queries scan less and thus cost less.  They all offer APIs to pull the results of the queries into your local environment, such as a Jupyter notebook, for further analysis.  Additionally, the features among the 3 products are very similar and more features are being added seemingly on a daily basis.

The decision of which to use is most likely going to be predicated on which overall cloud platform provides you with the most features you need.

Simple Example with BigQuery

Let’s look at a very simple example of using Google BigQuery.  BigQuery has many datasets preloaded for users to explore.  One of them is a dataset containing permit data to plant trees in San Francisco.  Using the BigQuery UI, we can preview the schema and the data (not shown) in the table:

Schema of the table from the BigQuery UI

Then, we can write a query in the UI.  BigQuery SQL is essentially the same as any other variants of SQL. There are advanced features such as arrays that don’t exist in other variants of SQL, but if you have used any SQL language before, BigQuery SQL will not be a problem. Below is the SQL code for our San Francisco trees query:

WITH
  filtered AS (
  SELECT
    DATE(plant_date) AS plant_date
  FROM
    `bigquery-public-data.san_francisco_trees.street_trees`
  WHERE
    plant_date IS NOT NULL),
  planted AS (
  SELECT
    plant_date,
    COUNT(*) AS num_planted
  FROM
    filtered
  GROUP BY
    plant_date),
  calendar AS (
  SELECT
    date
  FROM (
    SELECT
      MIN(plant_date) min_dt,
      MAX(plant_date) max_dt
    FROM
      filtered ),
    UNNEST(GENERATE_DATE_ARRAY(min_dt, max_dt)) date),
  aggregated AS (
  SELECT
    c.date,
    IFNULL(p.num_planted,
      0) AS num_planted
  FROM
    calendar c
  LEFT OUTER JOIN
    planted p
  ON
    c.date=p.plant_date
  ORDER BY
    c.date)
SELECT
  date,
  AVG(num_planted) OVER (ORDER BY date ROWS BETWEEN 365 PRECEDING AND CURRENT ROW) AS avg_planted
FROM
  aggregated
ORDER BY
  date

Each row of the `street_trees` table is a permit to plant one tree.  The query above counts the number of permits on a day, fills in missing days with 0, and then computes a 365 day moving average so the result is the average number of permits filed per day over the last 365 days.  Running the query yields the following result in the UI:

Results of the query displayed in the BigQuery UI

If we want to do some simple visualizations, we can explore the data in Data Studio (by clicking on the “EXPLORE IN DATA STUDIO” button.  Below is a simple plot of the data over time.

Data Studio visualization of BigQuery results

Summary

While the above is an extremely simplistic example, serverless, distributed databases allow users to easily analyze large datasets without worrying about the underlying infrastructure.  In this example we used a preloaded dataset.  In a future posts, we will talk about:

  • the infrastructure and software stack used to ETL data into these distributed databases
  • loading the results of queries into a Jupyter notebooks for advanced analysis
%d bloggers like this: