Laptop with BigQuery logo on the screen

BigQuery in action


Make sure that you've taken a look at the following tutorial(s) before you continue:

Using BigQuery and Data Studio requires you to enable billing. If you still have some of that free $300 credit left you might not want to do that until you have used all your gift money

You are going to learn

  • how to control the amount of data that your queries process
  • how to define named subqueries using WITH clause
  • what are nested and repeated fields
  • how to declare and use User Defined Functions
  • how to create complex queries with multiple subqueries
  • how to save query results in permanent table

A word of caution

I recommend using small datasets for learning BigQuery. 1TB free data processing quota might seem a lot but executing even a simple query on a huge dataset can decrease it by a few gigabytes. Query from a quickstart tutorial will execute in seconds and then you will see a message like Query complete (2.0s elapsed, 3.49 GB processed)

There are some cost control options that BigQuery offers but they are not straightforward to set up. Limits need to be requested via a form. It takes 2-3 business days to get a reply to your request and the lowest limit you can set is 10TB per day. As you can see this is a poor option for controlling the costs.

Each successful execution of non cached query will take at least 10MB of your quota. Even if the query processed 100kB of data from start to finish your quota will decrease by 10MB.

1
Find out how much data is processed

Reliability: HighUpdated: 23 January 2018CopiedBookmarkedBookmark removed

Go to Google Cloud Platform ConsoleBigQuery and click Compose query button.

In order to control the amount of data your query will process you should turn on the validator.

The larger the dataset and number of selected columns the more data will your queries process and consequently the more quota you will use. In the example above removing year column decreases the amount of data processed by 1GB.

After you run the query you will have access to it from Query History. Executed query has Details tab next to the Result. Bytes Billed tells you how much of your quota has been used by the execution.

Billed bytes in executed query details

BigQuery comes with an easy access to public datasets that you can use for learning. To check the size of the table you are going to query you need to select the table and then click the Details tab.

You can also use Preview next to the Detials tab to see what that table looks like. This does not influence your quota and allows you to look at the data inside the table. All columns are visible in the Preview.

2
Execute your first query

Reliability: HighUpdated: 23 January 2018CopiedBookmarkedBookmark removed

OpenAQ is one of the public datasets available in BigQuery. It has only one table - global_air_quality. Below you will find a sample from that table

location city country pollutant value timestamp unit source_name latitude longitude averaged_over_in_hours
CH0031A Bern CH co 356.9 2018-01-14 21:00:00.000000 UTC µg/m³ EEA Switzerland 46.951 7.4408712 1.0
CH0031A Bern CH pm25 2.7125 2018-01-12 04:00:00.000000 UTC µg/m³ EEA Switzerland 46.951 7.4408712 1.0
CH0031A Bern CH pm10 30.5469 2018-01-14 21:00:00.000000 UTC µg/m³ EEA Switzerland 46.951 7.4408712 1.0
CH0031A Bern CH o3 18.2218 2018-01-14 21:00:00.000000 UTC µg/m³ EEA Switzerland 46.951 7.4408712 1.0
CH0031A Bern CH no2 27.9739 2018-01-14 21:00:00.000000 UTC µg/m³ EEA Switzerland 46.951 7.4408712 1.0
CH0002R Vaud CH pm25 18.681 2018-01-14 21:00:00.000000 UTC µg/m³ EEA Switzerland 46.8131 6.9444804 1.0
CH0002R Vaud CH o3 20.2328 2018-01-14 21:00:00.000000 UTC µg/m³ EEA Switzerland 46.8131 6.9444804 1.0

Each row of the table represents a reading for one of the pollutants in one of the locations at some point in time.

Let's find out what are the top 10 places with the most dangerous smog. We will limit our search only to results that came in within the last 3 hours.

Particulate matter is mostly responsible for smog related health problems. PM10 is particulate matter 10 micrometers or less in diameter, PM2.5 is particulate matter 2.5 micrometers or less in diameter. PM2.5 is often called fine particles. We will ignore other pollutants in our research for now.

#standardSQL
SELECT location, city, country, SUM(value) as smog
FROM `bigquery-public-data.openaq.global_air_quality` 
WHERE (pollutant='pm25' OR pollutant='pm10') 
AND averaged_over_in_hours=1.0 
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, HOUR)<=3  
GROUP BY location,city,country 
ORDER BY smog DESC
LIMIT 10

I guess Jakarta is not the best place for jogging.

location city country smog
US Diplomatic Post: Jakarta Central Jakarta ID 985.0
Bozhou San Guo Lan Sheng Gong Bozhou CN 469.0
US Diplomatic Post: Kolkata Kolkata IN 456.0
Bozhou Wu Shui Chu Li Han Bozhou CN 432.0
Tongling Shi Di Jiu Zhong Xue Tongling CN 385.0
Huaibei Lie Shan Qu Zheng Fu Huaibei CN 368.0
Suzhou Huo Che Zhan Suzhou CN 362.0
Huaibei Zhi Ye Ji Zhu Xue Yuan Huaibei CN 354.0
Suzhou Jian Ce Lou Suzhou CN 337.0
US Diplomatic Post: Ulaanbaatar Ulaanbaatar MN 336.0

Results are saved in a temporary table. Such table has a lifespan of about 24 hours. BigQuery also allows you to write results to a permanent table in one of your datasets.

BigQuery Show Options button

BigQuery select permanent table button

3
Create named subqueries using WITH clause

Reliability: HighUpdated: 23 January 2018CopiedBookmarkedBookmark removed

We are using standard SQL in our BigQuery tutorial. Standard SQL is very flexible and pleasure to work with.

One of its features is naming your subqueries using WITH clause. You can used named subqueries just as you use normal tables.

WITH Top10Smog AS (
  SELECT location, city, country, SUM(value) as smog
  FROM `bigquery-public-data.openaq.global_air_quality` 
  WHERE (pollutant='pm25' OR pollutant='pm10') 
  AND averaged_over_in_hours=1.0 
  AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, HOUR)<=3  
  GROUP BY location,city,country 
  ORDER BY smog DESC
  LIMIT 10
)
SELECT country, COUNT(country) as locations_count
FROM Top10Smog
GROUP BY country
ORDER BY locations_count DESC

China has a strong representation in the infamous ranking.

country locations_count
CN 7
ID 1
IN 1
BD 1

4
Nested data in BigQuery

Reliability: HighUpdated: 23 January 2018CopiedBookmarkedBookmark removed

At first sight it might seem that BigQuery can only operate on flat data structures. This feeling could not be more wrong. BigQuery really shines when your data is denormalized. Trying to invoke relational concepts like star schema or snowflake schema on it will be acting against BigQuery's nature.

BigQuery supports nested and repeated fields and can without a problem understand your JSON or Avro files. Furthermore it can maintain relationships without a performance impact just as a normalized schema can.

Execute the query below

#standardSQL
SELECT EXTRACT(DATE FROM timestamp) AS date, location, city, country,
  ARRAY_AGG(
    STRUCT(
      pollutant, value 
    )
  ) AS sensor_reading
FROM `bigquery-public-data.openaq.global_air_quality`
WHERE averaged_over_in_hours=1.0 
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, HOUR)<=3  
GROUP BY date, location, city, country

Row date location city country sensor_reading.pollutant sensor_reading.value
4 2018-01-15 DERP053 Rheinland-Pfalz DE pm25 11.8810005
o3 27.5
5 2018-01-15 İzmit-MTHM Kocaeli TR no2 47.0
pm10 66.0
co 2394.0

Each row of the result contains date, location, city, country columns and a sensor_reading array. Each member of sensor_reading array is a STRUCT. You can think of STRUCT as an ordered container for fields. Here each STRUCT inside the sensor_reading array holds a reading for one of the pollutants.

5
Advanced BigQuery features

Reliability: HighUpdated: 23 January 2018CopiedBookmarkedBookmark removed

To show you the true power of BigQuery I've bundled a few of the interesting features into one large query.

#standardSQL
-- Create temporary function to calculate 
-- acurracy of Common Air Quality Index calculation
-- There are six pollutants in total. If all pollutant readings
-- are available then accuracy is 100%. 
-- The less pollutant readings there are the worse the CAQI calculation accuracy is.
-- This function takes in an array of sensor readings and returns single FLOAT64
CREATE TEMPORARY FUNCTION getCAQIAccuracy (
  sensor_reading ARRAY<STRUCT<pollutant STRING, value FLOAT64>>
)
RETURNS FLOAT64                   
AS ((
  SELECT (ARRAY_LENGTH(sensor_reading)/6)*100
));

WITH 
-- This is first named subquery inside WITH clause
-- It returns sensor readings from the last 3 hours for a location
-- ARRAY_AGG will create an array that groups all sensor readings
LatestSensorReadings AS (
  SELECT EXTRACT(DATE FROM timestamp) AS date, location, latitude, longitude, 
         city, country, ARRAY_AGG(STRUCT(pollutant, value)) AS sensor_reading
  FROM `bigquery-public-data.openaq.global_air_quality`
  WHERE averaged_over_in_hours=1.0 
  AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, HOUR)<=3  
  GROUP BY date, location, latitude, longitude, city, country),
-- Second named subquery in WITH clause starts here. Notice the comma ',' above
IndexCalculationGrid AS (
-- Create an array of STRUCT fields and initialize it 
-- with index calculation grid values. FLOAT64 fields in the STRUCT represent
-- the lowest value that pollutant needs to have to get specified CAQI level.
-- Highest CAQI level for a location settles the final index value
  SELECT * 
  FROM UNNEST(
  ARRAY<STRUCT<
    caqi_level INT64, 
    pm10 FLOAT64, 
    pm25 FLOAT64,
    no2 FLOAT64,
    o3 FLOAT64,
    co FLOAT64,
    so2 FLOAT64
    >>[
      -- VERY LOW
      (0, 0, 0, 0, 0, 0, 0),
      -- LOW
      (1, 25, 15, 50, 60, 5000, 50),
      -- MEDIUM
      (2, 50, 30, 100, 120, 7500, 100),
      -- HIGH
      (3, 90, 55, 200, 180, 10000, 350),
      -- VERY HIGH
      (4, 180, 110, 400, 240, 20000, 500)
    ]
)),
-- Third named subquery in WITH clause
CAQIStatus AS (
  -- Human readable CAQI levels as UNION of generated rows
  SELECT 0 AS level, 'VERY LOW' AS status UNION ALL
  SELECT 1 AS level, 'LOW' AS status UNION ALL
  SELECT 2 AS level, 'MEDIUM' AS status UNION ALL
  SELECT 3 AS level, 'HIGH' AS status UNION ALL
  SELECT 4 AS level, 'VERY HIGH' AS status
)

SELECT date, location, latitude, longitude, city, country, 
       caqi, status, accuracy, sensor_reading 
FROM (
  -- Subquery inside FROM
  SELECT lsr.date, lsr.location, lsr.latitude, lsr.longitude, lsr.city, lsr.country, 
	   -- User Defined Function is called here
         getCAQIAccuracy(sensor_reading) AS accuracy, lsr.sensor_reading, 
     ( -- Subquery inside a SELECT statement. Single result of 
       -- this SELECT will be a column (caqi) in outer query
       -- Find CAQI level for each pollutant inside sensor_reading array
       -- Maximal CAQI level in each sensor_reading array determines the final index
       -- ie. if PM10 level is VERY HIGH but all other pollutants are 
       -- at LOW then final index is VERY HIGH
      SELECT MAX(caqi_level) 
      FROM IndexCalculationGrid
      -- Below we are referencing column that originates from the outer query
      JOIN UNNEST(lsr.sensor_reading) 
      WHERE 
        CASE WHEN pollutant='pm10' AND value > pm10 THEN true END 
        OR CASE WHEN pollutant='pm25' AND value > pm25 THEN true END 
        OR CASE WHEN pollutant='no2' AND value > no2 THEN true END 
        OR CASE WHEN pollutant='o3' AND value > o3 THEN true END 
        OR CASE WHEN pollutant='co' AND value > co THEN true END 
        OR CASE WHEN pollutant='so2' AND value > so2 THEN true END 
     ) AS caqi
  FROM LatestSensorReadings AS lsr 
)
JOIN CAQIStatus ON (caqi=level)
-- caqi can be null if value of the reading is incorrect (eg. <= 0)
-- and there are no other correct readings in the set
-- We want to eliminate those results
WHERE caqi IS NOT NULL
-- We also want only the results that are 80% or more accurate 
-- (ie. based on 5 or more pollutant readings)
AND accuracy > 80
ORDER BY caqi DESC, accuracy DESC

Row date location latitude longitude city country caqi status accuracy sensor_reading.pollutant sensor_reading.value
103 2018-01-16 LU0101A 49.59769 6.13759 Luxemburg LU 1 LOW 100.0 co 220.0
pm10 10.0
o3 65.0
so2 2.3
pm25 2.0
no2 28.3

As you can see BigQuery has tons of useful features that will allow you to get full insight into your data. In the example above you can observe:

  • temporary User Defined Function
  • multiple named subqueries in WITH clause
  • UNNEST operator which takes an ARRAY and returns a table
  • initialization of array of STRUCT with values
  • subqueries inside SELECT and FROM
  • subquery that references correlated column (column that originates from the outer query)

Other BigQuery capabilities include:

  • analytic functions (OVER,WINDOW)
  • table decorators and wildcard functions (eg. SELECT * FROM tablename_2017_*)
  • subqueries in WHERE clause and anywhere in the query where expression can be used
  • User Defined Functions written in JavaScript
  • batch queries
  • support for external data sources

6
Write query result to permanent table

Reliability: HighUpdated: 23 January 2018CopiedBookmarkedBookmark removed

New dataset

Before you can write to a table you need to create a dataset. Datasets are used to control access and organize your tables. Click tiny arrow-down button next to your project's name and choose Create new dataset.

BigQuery create dataset button

Fill dataset creation form as you see fit and click OK.

BigQuery create dataset form

New table

Now create an empty permanent table inside that dataset. Choose Create empty table and then give your table a name pollutants_daily. For Schema section copy and paste the code from the bottom into textbox that appears after you click Edit as Text. Then you can click Create Table button.

How to create table in dataset ?

Create BigQuery table - Schema import

[
    {
        "mode": "REQUIRED",
        "name": "date",
        "type": "DATE"
    },
    {
        "mode": "REQUIRED",
        "name": "location",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "latitude",
        "type": "FLOAT"
    },
    {
        "mode": "REQUIRED",
        "name": "longitude",
        "type": "FLOAT"
    },
    {
        "mode": "REQUIRED",
        "name": "city",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "country",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "caqi",
        "type": "INTEGER"
    },
    {
        "mode": "REQUIRED",
        "name": "status",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "accuracy",
        "type": "FLOAT"
    },
    {
        "mode": "REPEATED",
        "name": "sensor_reading",
        "type": "RECORD",
        "fields": [
            {
                "mode": "REQUIRED",
                "name": "pollutant",
                "type": "STRING"
            },
            {
                "mode": "REQUIRED",
                "name": "value",
                "type": "FLOAT"
            }
        ]
    }
]

Append query result to permanent table

Now you can go back to Query History and open the query from the previous step. Click Show Options button. Click Select Table button and enter the name of the table you have just created - pollutants_daily. For Write Preference choose Append to table option. Finally execute the query and check the preview of your table.

How to save query results in permanent table

Now your permanent table is filled with the results of query execution.

Query results are cached for approximately 24 hours. If table changes then cache is invalidated and query needs to process data again.

This means that we can query pollutants_daily using another query and the result will be retrieved from cache until pollutants_daily is modified. If cache is used then your data processing quota is not affected.

Step notes

I will release next tutorial soon !
You will learn how to visualize query results using Google's Data Studio and turn them into a beautiful report
Good job. You deserve a bottle of Grog. Arrrr !
Success ! Let's open a case of Rum. Arrrr !
New skills gained. Beers all around! Arrrr !
Thanks for reading ! Please
  •  
  •  
  • 1
  •  
  •  
  •  
  •  
  •  

Leave a comment

Your email address will not be published. Required fields are marked *