Make sure that you've taken a look at the following tutorial(s) before you continue:
You are going to learn
- how to control the amount of data that your queries process
- how to define named subqueries using WITH clause
- what are nested and repeated fields
- how to declare and use User Defined Functions
- how to create complex queries with multiple subqueries
- how to save query results in permanent table
A word of caution
I recommend using small datasets for learning BigQuery. 1TB free data processing quota might seem a lot but executing even a simple query on a huge dataset can decrease it by a few gigabytes. Query from a quickstart tutorial will execute in seconds and then you will see a message like Query complete (2.0s elapsed, 3.49 GB processed)
There are some cost control options that BigQuery offers but they are not straightforward to set up. Limits need to be requested via a form. It takes 2-3 business days to get a reply to your request and the lowest limit you can set is 10TB per day. As you can see this is a poor option for controlling the costs.
Each successful execution of non cached query will take at least 10MB of your quota. Even if the query processed 100kB of data from start to finish your quota will decrease by 10MB.
Go to Google Cloud Platform ConsoleBigQuery and click Compose query button.
In order to control the amount of data your query will process you should turn on the validator.
The larger the dataset and number of selected columns the more data will your queries process and consequently the more quota you will use. In the example above removing
year column decreases the amount of data processed by 1GB.
After you run the query you will have access to it from Query History. Executed query has Details tab next to the Result. Bytes Billed tells you how much of your quota has been used by the execution.
BigQuery comes with an easy access to public datasets that you can use for learning. To check the size of the table you are going to query you need to select the table and then click the Details tab.
You can also use Preview next to the Detials tab to see what that table looks like. This does not influence your quota and allows you to look at the data inside the table. All columns are visible in the Preview.
OpenAQ is one of the public datasets available in BigQuery. It has only one table -
global_air_quality. Below you will find a sample from that table
|CH0031A||Bern||CH||co||356.9||2018-01-14 21:00:00.000000 UTC||µg/m³||EEA Switzerland||46.951||7.4408712||1.0|
|CH0031A||Bern||CH||pm25||2.7125||2018-01-12 04:00:00.000000 UTC||µg/m³||EEA Switzerland||46.951||7.4408712||1.0|
|CH0031A||Bern||CH||pm10||30.5469||2018-01-14 21:00:00.000000 UTC||µg/m³||EEA Switzerland||46.951||7.4408712||1.0|
|CH0031A||Bern||CH||o3||18.2218||2018-01-14 21:00:00.000000 UTC||µg/m³||EEA Switzerland||46.951||7.4408712||1.0|
|CH0031A||Bern||CH||no2||27.9739||2018-01-14 21:00:00.000000 UTC||µg/m³||EEA Switzerland||46.951||7.4408712||1.0|
|CH0002R||Vaud||CH||pm25||18.681||2018-01-14 21:00:00.000000 UTC||µg/m³||EEA Switzerland||46.8131||6.9444804||1.0|
|CH0002R||Vaud||CH||o3||20.2328||2018-01-14 21:00:00.000000 UTC||µg/m³||EEA Switzerland||46.8131||6.9444804||1.0|
Each row of the table represents a reading for one of the pollutants in one of the locations at some point in time.
Let's find out what are the top 10 places with the most dangerous smog. We will limit our search only to results that came in within the last 3 hours.
Particulate matter is mostly responsible for smog related health problems. PM10 is particulate matter 10 micrometers or less in diameter, PM2.5 is particulate matter 2.5 micrometers or less in diameter. PM2.5 is often called fine particles. We will ignore other pollutants in our research for now.
I guess Jakarta is not the best place for jogging.
|US Diplomatic Post: Jakarta Central||Jakarta||ID||985.0|
|Bozhou San Guo Lan Sheng Gong||Bozhou||CN||469.0|
|US Diplomatic Post: Kolkata||Kolkata||IN||456.0|
|Bozhou Wu Shui Chu Li Han||Bozhou||CN||432.0|
|Tongling Shi Di Jiu Zhong Xue||Tongling||CN||385.0|
|Huaibei Lie Shan Qu Zheng Fu||Huaibei||CN||368.0|
|Suzhou Huo Che Zhan||Suzhou||CN||362.0|
|Huaibei Zhi Ye Ji Zhu Xue Yuan||Huaibei||CN||354.0|
|Suzhou Jian Ce Lou||Suzhou||CN||337.0|
|US Diplomatic Post: Ulaanbaatar||Ulaanbaatar||MN||336.0|
Results are saved in a temporary table. Such table has a lifespan of about 24 hours. BigQuery also allows you to write results to a permanent table in one of your datasets.
We are using standard SQL in our BigQuery tutorial. Standard SQL is very flexible and pleasure to work with.
One of its features is naming your subqueries using
WITH clause. You can used named subqueries just as you use normal tables.
China has a strong representation in the infamous ranking.
At first sight it might seem that BigQuery can only operate on flat data structures. This feeling could not be more wrong. BigQuery really shines when your data is denormalized. Trying to invoke relational concepts like star schema or snowflake schema on it will be acting against BigQuery's nature.
BigQuery supports nested and repeated fields and can without a problem understand your JSON or Avro files. Furthermore it can maintain relationships without a performance impact just as a normalized schema can.
Execute the query below
Each row of the result contains
country columns and a
sensor_reading array. Each member of
sensor_reading array is a
STRUCT. You can think of
STRUCT as an ordered container for fields. Here each
STRUCT inside the
sensor_reading array holds a reading for one of the pollutants.
To show you the true power of BigQuery I've bundled a few of the interesting features into one large query.
As you can see BigQuery has tons of useful features that will allow you to get full insight into your data. In the example above you can observe:
- temporary User Defined Function
- multiple named subqueries in
UNNESToperator which takes an
ARRAYand returns a table
- initialization of array of
- subqueries inside
- subquery that references correlated column (column that originates from the outer query)
Other BigQuery capabilities include:
- analytic functions (
- table decorators and wildcard functions (eg.
SELECT * FROM tablename_2017_*)
- subqueries in
WHEREclause and anywhere in the query where expression can be used
- batch queries
- support for external data sources
Before you can write to a table you need to create a dataset. Datasets are used to control access and organize your tables. Click tiny arrow-down button next to your project's name and choose Create new dataset.
Fill dataset creation form as you see fit and click OK.
Now create an empty permanent table inside that dataset. Choose Create empty table and then give your table a name
pollutants_daily. For Schema section copy and paste the code from the bottom into textbox that appears after you click Edit as Text. Then you can click Create Table button.
Append query result to permanent table
Now you can go back to Query History and open the query from the previous step. Click Show Options button. Click Select Table button and enter the name of the table you have just created -
pollutants_daily. For Write Preference choose Append to table option. Finally execute the query and check the preview of your table.
Now your permanent table is filled with the results of query execution.
Query results are cached for approximately 24 hours. If table changes then cache is invalidated and query needs to process data again.
This means that we can query
pollutants_daily using another query and the result will be retrieved from cache until
pollutants_daily is modified. If cache is used then your data processing quota is not affected.
You will learn how to visualize query results using Google's Data Studio and turn them into a beautiful report