This utility scans a BigQuery SQL in search for several possible anti-patterns. Anti-patterns are specific SQL syntaxes that in some cases might cause performance impact.
We recommend using this tool to scan the top 10% slot consuming jobs of your workload. Addressing these anti-patterns in most cases will provide performance significant benefits.
Example of tool input and output :
Input:
SELECT
*
FROM
`project.dataset.table1`
Output:
All columns on table: project.dataset.table1 are being selected. Please be sure that all columns are needed
To run the tool use the cloud shell terminal. It has all the pre-requisites.
Build utility
# in cloud shell terminal
git clone https://github.com/GoogleCloudPlatform/bigquery-antipattern-recognition.git
cd bigquery-antipattern-recognition
mvn clean package jib:dockerBuild -DskipTests
Run tool for simple inline query
# in cloud shell terminal
docker run \
-i bigquery-antipattern-recognition \
--query "SELECT * FROM \`project.dataset.table1\`"
In the BigQuery console, run the DDL bellow to create th output table
-- in BQ console
CREATE OR REPLACE TABLE <my-project>.<my-dataset>.antipattern_output_table (
job_id STRING,
user_email STRING,
query STRING,
recommendation ARRAY<STRUCT<name STRING, description STRING>>,
slot_hours FLOAT64,
process_timestamp TIMESTAMP
);
To read from INFORMATION_SCHEMA and write to the output table, run the following in the command line:
# in cloud shell terminal
gcloud auth login
docker run \
-v ~/.config:/root/.config \
-i bigquery-antipattern-recognition \
--read_from_info_schema \
--info_schema_region us \
--read_from_info_schema_days 1 \
--processing_project_id <my-project> \
--output_table "<my-project>.<my-dataset>.antipattern_output_table" \
--info_schema_top_n_percentage_of_jobs 0.1
Read output in BigQuery Console
-- in BQ console
SELECT
job_id, user_email, query,
recommendation, slot_hours
FROM
`<my-project>.<my-dataset>.antipattern_output_table`
ORDER BY
process_timestamp DESC
LIMIT 10000;
Deploying the anti-pattern recognition tool to a Cloud Run job provides an easy mechanism to periodically scan INFORMATION_SCHEMA for occurrences of anti-patterns in top slot consuming queries.
Refer to this step-by-step on how to deploy this tool using Cloud Run.
--query="SELECT ... FROM ..."
-
To parse SQL string provided via CLI.
--read_from_info_schema
-
To read input queries from INFORMATION_SCHEMA.JOBS.
--read_from_info_schema_days n
-
Specifies how many days of INFORMATION_SCHEMA to read
Must be set along with `--read_from_info_schema`.
Defaults to 1.
--info_schema_region us
-
Region from which to read information schema
--read_from_info_schema_start_time "start-timestamp"
--read_from_info_schema_end_time "end-timestamp"
-
Alternative to `read_from_info_schema_days` option,
to specify start and end date or timestamp of INFORMATION_SCHEMA to read.
Must be set along with `--read_from_info_schema`.
Defaults to `--read_from_info_schema_days` option.
--read_from_info_schema_timeout_in_secs n
-
Specifies timeout, in secs, to query INFORMATION SCHEMA
Must be set along with `--read_from_info_schema`.
Defaults to 60.
--info_schema_table_name" \`region-us\`.INFORMATION_SCHEMA.JOBS" \
-
Specifies what variant of INFORMATION_SCHEMA.JOBS to read from.
--info_schema_top_n_percentage_of_jobs n
-
Number between 0 and 1. Uses to specify what fraction of top slot consuming jobs
the tool should consider, e.g. if equal to 0.1 only top 10% slot consuming jobs
will be checked por anti patterns.
--input_bq_table project_id.dataset_id.table_name
-
Specifies a BigQuery table as input, each row is a SQL string to be parsed.
Columns must be "id (String), query (String)"
--input_file_path /path/to/file.sql
-
Specifies path to file with SQL string to be parsed. Can be local file or GCS file.
--input_folder_path /path/to/folder/with/sql/files
-
Specifies path to folder with SQL files to be parsed. Will parse all .sql in directory.
Can be a local path or a GCS path
--input_csv_file_path /path/to/input/file.csv
-
Specifies a CSV file as input, each row is a SQL string to be parsed.
Columns must be ""id,query"
--output_file_path /path/to/output/file.csv
-
Specifies a CSV file as output, each row is a SQL string to be parsed.
Columns are "id,recommendation"
--output_table "my-project.dataset.antipattern_output_table"
-
Specifies table to which write results to. Assumes that the table already exits.
--processing_project_id <my-processing-project>
-
Specifies what project provides the compute used to read from INFORMATION_SCHEMA
and/or to write to output table (i.e. project where BQ jobs will execute)
Only needed if the input is INFORMATION_SCHEMA or if the output is a BQ table.
Example:
SELECT
*
FROM
`project.dataset.table1`
Output:
All columns on table: project.dataset.table1 are being selected. Please be sure that all columns are needed
Example:
SELECT
t1.col1
FROM
`project.dataset.table1` t1
WHERE
t1.col2 not in (select col2 from `project.dataset.table2`);
Output:
You are using an IN filter with a subquery without a DISTINCT on the following columns: project.dataset.table1.col2
Example:
WITH
a AS (
SELECT col1,col2 FROM test WHERE col1='abc'
),
b AS (
SELECT col2 FROM a
),
c AS (
SELECT col1 FROM a
)
SELECT
b.col2,
c.col1
FROM
b,c;
Output:
CTE with multiple references: alias a defined at line 2 is referenced 2 times
Example:
SELECT
t.dim1,
t.dim2,
t.metric1
FROM
`dataset.table` t
ORDER BY
t.metric1 DESC;
Output:
ORDER BY clause without LIMIT at line 8.
Example:
SELECT
dim1
FROM
`dataset.table`
WHERE
REGEXP_CONTAINS(dim1, ‘.*test.*’)
Output:
REGEXP_CONTAINS at line 6. Prefer LIKE when the full power of regex is not needed (e.g. wildcard matching).";
Example:
SELECT
taxi_id, trip_seconds, fare
FROM
(
SELECT
taxi_id, trip_seconds, fare,
row_number() over(partition by taxi_id order by fare desc) rn
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
)
WHERE
rn = 1;
Output:
LatestRecordWithAnalyticFun: Seems like you might be using analytical function row_number in line 7 to filter the latest record in line 12.
Example:
SELECT
*
FROM
comments c
JOIN
users u ON c.user_id = u.id
WHERE
u.id IN (
SELECT id
FROM users
WHERE location LIKE '%New York'
GROUP BY id
ORDER BY SUM(up_votes) DESC
LIMIT 10
)
;
Output:
Dynamic Predicate: Using subquery in filter at line 10. Converting this dynamic predicate to static might provide better performance.
Example:
SELECT
repo_name,
id,
ref
FROM
`bigquery-public-data.github_repos.files`
WHERE
ref like '%master%'
and repo_name = 'cdnjs/cdnjs'
;
Output:
WhereOrder: LIKE filter in line 8 precedes a more selective filter.
As a best practice the table with the largest number of rows should be placed first in a JOIN.
This anti-pattern checks the join order based on the number of rows of each
table. To do so this tool must fetch table metadata, for which the advanced_analysis
flag must be used.
Details can be found here.
Example:
SELECT
t1.station_id,
COUNT(1) num_trips_started
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations` t1
JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_trips` t2 ON t1.station_id = t2.start_station_id
GROUP BY
t1.station_id
;
Output:
JoinOrder: JOIN on tables: [bikeshare_stations, bikeshare_trips] might perform
better if tables where joined in the following order:
[bikeshare_trips, bikeshare_stations]
This is not an officially supported Google product.