A comprehensive guide to mastering Google BigQuery with practical examples, optimization techniques, and AI-powered automation using ChatGPT
In the modern data landscape, processing petabytes of data in seconds isn't just a luxury—it's a necessity. Google BigQuery has revolutionized how organizations handle massive datasets, and when combined with ChatGPT's AI capabilities, it becomes an unstoppable force for data engineers, analysts, and developers.
This isn't just another BigQuery tutorial. This is a complete roadmap from zero to advanced, covering:
Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. Think of it as SQL on steroids—capable of scanning terabytes of data in seconds without managing any infrastructure.
Serverless Architecture: No servers to provision, manage, or tune. You focus on querying data, Google handles everything else.
Separation of Storage & Compute: Store massive datasets cheaply, pay for compute only when you query. This architectural decision saves enormous costs.
Standard SQL Support: Use familiar ANSI SQL with extensions for analytics, machine learning, and geospatial functions.
Petabyte Scale: Analyze datasets that would crash traditional databases. BigQuery routinely handles petabytes across thousands of organizations.
┌─────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ APIs │ Databases │ Logs │ IoT │ Cloud Storage │ Streaming │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GOOGLE BIGQUERY │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Storage │ │ Compute │ │ BigQuery │ │
│ │ (Columnar) │ │ (Dremel) │ │ ML │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ VISUALIZATION │
│ Looker Studio │ Tableau │ Power BI │ Custom Dashboards │
└─────────────────────────────────────────────────────────────┘
Imagine a massive library with billions of books. Traditional databases make you search through each book manually. BigQuery is like having thousands of librarians working in parallel, each scanning different sections simultaneously and returning results in seconds.
Before BigQuery and similar cloud data warehouses, organizations faced painful tradeoffs:
No servers, no clusters, no maintenance. Deploy production queries in minutes, not months.
Scan terabytes in seconds using Google's Dremel engine. Parallelism built-in by default.
Only pay for what you use. $5 per TB scanned. Flat-rate pricing for predictable workloads.
Native connectors to GCS, Pub/Sub, Dataflow, Looker Studio, and 100+ tools.
Train and deploy machine learning models using SQL. No Python required.
Query data across AWS S3, Azure, and Google Cloud without moving it.
Understanding BigQuery's architecture isn't just academic—it's the key to writing efficient queries and optimizing costs.
Google's distributed filesystem stores data in columnar format. This means BigQuery only reads the columns you need, dramatically reducing I/O.
The query execution engine that uses massive parallelism. Queries are broken into tree structures and executed across thousands of workers.
Google's orchestration system allocates resources, optimizes execution plans, and manages fault tolerance automatically.
Slots are the fundamental unit of BigQuery compute. Think of them as "virtual CPUs" for query execution.
| Cost Component | On-Demand | Flat-Rate | Best For |
|---|---|---|---|
| Storage | $0.02/GB/month | $0.02/GB/month | All scenarios |
| Queries | $5/TB scanned | Fixed monthly fee | Variable: On-Demand Predictable: Flat-Rate |
| Streaming | $0.01/200 MB | Included | High-volume: Flat-Rate |
| Minimum | $0 (pay-per-use) | $2,000/month (100 slots) | Starting: On-Demand |
Use On-Demand if: You're starting out, have sporadic queries, or process <2 TB/day
Switch to Flat-Rate if: You consistently spend >$2,000/month, have predictable workloads, or need guaranteed capacity
1. Query Submission
│
▼
2. Query Optimizer (Borg)
│ - Parse SQL
│ - Create execution plan
│ - Optimize (partition pruning, predicate pushdown)
▼
3. Resource Allocation
│ - Assign slots
│ - Distribute work across workers
▼
4. Parallel Execution (Dremel)
│ - Read columnar data from Colossus
│ - Execute in parallel across 1000s of workers
│ - Shuffle and aggregate results
▼
5. Return Results
└─> Client receives data
Getting started with BigQuery takes less than 10 minutes. Let's walk through the complete setup process.
Navigate to console.cloud.google.com and create a new project. This is your isolated workspace for all BigQuery resources.
Project ID: my-bigquery-project-2024Go to APIs & Services → Enable APIs and search for "BigQuery API". Click Enable. This allows your project to use BigQuery services.
Navigate to Billing in the console and link a payment method. Don't worry—Google provides $300 in free credits for new users, and BigQuery offers 1 TB of free query processing per month.
In the BigQuery console, click Create Dataset. A dataset is like a database—it contains tables, views, and functions.
-- Example dataset configuration:
Dataset ID: analytics_data
Data location: US (multi-region)
Default table expiration: None
Encryption: Google-managed keyBigQuery provides public datasets for practice. Try querying the bigquery-public-data project:
SELECT *
FROM `bigquery-public-data.usa_names.usa_1910_current`
LIMIT 10;bq command) for automationBigQuery uses Standard SQL (ANSI SQL-2011) with powerful extensions. Let's explore essential patterns you'll use daily.
-- Standard SELECT query
SELECT
user_id,
event_name,
COUNT(*) AS event_count,
DATE(event_timestamp) AS event_date
FROM
`project.dataset.events`
WHERE
event_date BETWEEN '2024-01-01' AND '2024-01-31'
AND event_name = 'purchase'
GROUP BY
user_id, event_name, event_date
HAVING
event_count > 1
ORDER BY
event_count DESC
LIMIT 100;BigQuery's native support for STRUCT and ARRAY types is a game-changer for semi-structured data.
-- Query nested JSON/STRUCT data
SELECT
user_id,
order_details.order_id,
order_details.total_amount,
product.name,
product.category
FROM
`project.dataset.orders`,
UNNEST(products) AS product
WHERE
order_details.status = 'completed'
AND product.category = 'electronics';-- Calculate running totals and rankings
SELECT
user_id,
order_date,
revenue,
SUM(revenue) OVER (
PARTITION BY user_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY revenue DESC
) AS revenue_rank
FROM
`project.dataset.orders`;-- Use WITH for readable, modular queries
WITH user_cohorts AS (
SELECT
user_id,
MIN(DATE(created_at)) AS cohort_date
FROM `project.dataset.users`
GROUP BY user_id
),
monthly_revenue AS (
SELECT
user_id,
DATE_TRUNC(DATE(order_date), MONTH) AS revenue_month,
SUM(amount) AS monthly_total
FROM `project.dataset.orders`
GROUP BY user_id, revenue_month
)
SELECT
c.cohort_date,
r.revenue_month,
COUNT(DISTINCT c.user_id) AS cohort_size,
SUM(r.monthly_total) AS cohort_revenue
FROM user_cohorts c
LEFT JOIN monthly_revenue r USING(user_id)
GROUP BY c.cohort_date, r.revenue_month
ORDER BY c.cohort_date, r.revenue_month;-- Advanced date manipulation
SELECT
CURRENT_DATE() AS today,
DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AS last_week,
DATE_TRUNC(CURRENT_DATE(), MONTH) AS month_start,
EXTRACT(DAYOFWEEK FROM CURRENT_DATE()) AS day_of_week,
FORMAT_DATE('%Y-%m-%d', CURRENT_DATE()) AS formatted_date;-- Work with text and arrays efficiently
SELECT
CONCAT(first_name, ' ', last_name) AS full_name,
SPLIT(email, '@')[OFFSET(1)] AS email_domain,
ARRAY_LENGTH(interests) AS interest_count,
ARRAY_TO_STRING(interests, ', ') AS interests_list
FROM `project.dataset.users`;ChatGPT can dramatically accelerate your BigQuery workflow. Here's how to use AI as your SQL co-pilot.
You: Write a BigQuery SQL query to find the top 10 customers by total purchase amount in 2024. The table is `ecommerce.orders` with columns: customer_id, order_date, total_amount, status.
ChatGPT Response:SELECT
customer_id,
COUNT(DISTINCT order_id) AS total_orders,
SUM(total_amount) AS total_spent,
ROUND(AVG(total_amount), 2) AS avg_order_value
FROM
`ecommerce.orders`
WHERE
EXTRACT(YEAR FROM order_date) = 2024
AND status = 'completed'
GROUP BY
customer_id
ORDER BY
total_spent DESC
LIMIT 10;You: Optimize this BigQuery query to reduce cost and improve performance:
SELECT * FROM `project.dataset.large_table`
WHERE DATE(timestamp_col) = '2024-01-15'
ChatGPT Response:-- Optimized version (scans less data)
SELECT
-- Only select needed columns instead of *
user_id,
event_name,
timestamp_col,
value
FROM
`project.dataset.large_table`
WHERE
-- If table is partitioned by DATE(timestamp_col), use direct partition filter
_PARTITIONDATE = '2024-01-15'
-- OR if partitioned by timestamp_col:
-- timestamp_col BETWEEN '2024-01-15 00:00:00' AND '2024-01-15 23:59:59';You: Explain what this BigQuery query does in simple terms:
[Paste your complex query here]
ChatGPT will:
✓ Break down each CTE and subquery
✓ Explain the business logic
✓ Identify potential performance issues
✓ Suggest improvementsYou: I'm getting this BigQuery error:
"Syntax error: Expected end of input but got keyword WHERE at [15:1]"
Here's my query: [paste query]
ChatGPT Response:
The error indicates you have a WHERE clause after a semicolon or in an invalid position.
Looking at your query, line 14 ends the main SELECT with a semicolon, but line 15
starts a new WHERE clause. You likely meant to include that WHERE clause before the
semicolon. Here's the corrected version...Partitioning and clustering are the most powerful optimization techniques in BigQuery. Master these to reduce query costs by 90%+.
Partitioning divides your table into segments based on a column value (usually date/timestamp). BigQuery only scans relevant partitions, dramatically reducing data processed.
Without Partitioning: Query scans entire 10 TB table = $50 cost
With Partitioning: Query scans 1 day (10 GB) = $0.05 cost
| Type | Use Case | Example | Max Partitions |
|---|---|---|---|
| Time-unit column | Tables with timestamp/date columns | PARTITION BY DATE(created_at) | 10,000 |
| Ingestion time | When you load data without timestamp | PARTITION BY _PARTITIONDATE | 10,000 |
| Integer range | Partition by ID ranges (customer_id) | PARTITION BY RANGE_BUCKET(id, [0, 100, 200]) | 10,000 |
-- Date-partitioned table (most common)
CREATE TABLE `project.dataset.events`
(
user_id INT64,
event_name STRING,
event_timestamp TIMESTAMP,
event_data JSON
)
PARTITION BY DATE(event_timestamp)
OPTIONS(
partition_expiration_days=90,
require_partition_filter=TRUE
);-- Integer range partitioning (for ID-based queries)
CREATE TABLE `project.dataset.customers`
(
customer_id INT64,
name STRING,
signup_date DATE
)
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 1000000, 10000));Clustering sorts and co-locates data based on specified columns. It's like organizing books by genre, then author—related data is physically stored together.
Key Benefits:
-- Best practice: Combine partitioning + clustering
CREATE TABLE `project.dataset.user_events`
(
event_date DATE,
user_id INT64,
country STRING,
device_type STRING,
event_name STRING,
revenue FLOAT64
)
PARTITION BY event_date
CLUSTER BY country, device_type, user_id
OPTIONS(
description="User events partitioned by date, clustered by common filter columns"
);
PARTITIONING (Divides table into segments)
┌─────────────────────────────────────────────────────┐
│ 2024-01-01 │ 2024-01-02 │ 2024-01-03 │ ... │
│ (10GB) │ (10GB) │ (10GB) │ │
└─────────────────────────────────────────────────────┘
Query: WHERE event_date = '2024-01-02'
Scans: Only 10GB (not entire table)
CLUSTERING (Sorts data within partitions)
Partition 2024-01-02:
┌──────────────────────────────────────────┐
│ Country: CA │ Country: UK │ Country: US │
│ ├─ iOS │ ├─ Android │ ├─ iOS │
│ └─ Android │ └─ Web │ └─ Web │
└──────────────────────────────────────────┘
Query: WHERE country = 'US' AND device = 'iOS'
Scans: Only relevant sorted blocks (automatic pruning)
-- Check partition information
SELECT
partition_id,
total_rows,
total_logical_bytes / POW(10, 9) AS size_gb
FROM
`project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'user_events'
ORDER BY
partition_id DESC;ChatGPT Prompt: Analyze my BigQuery table schema and suggest optimal partitioning and clustering strategy:
Table: user_transactions
Columns: transaction_id, user_id, merchant_id, transaction_date, amount, category, country
Common queries: Filter by date range, group by country and category, find user transaction history
Expected AI Response:
Recommended configuration:
- PARTITION BY DATE(transaction_date) - Most queries filter by date
- CLUSTER BY country, category, user_id
* country: High filter frequency, medium cardinality
* category: Common in GROUP BY, low cardinality
* user_id: Good for user-specific queries, high cardinality OK in clusteringBeyond partitioning and clustering, these battle-tested optimization techniques will make your queries lightning-fast.
-- Large table joined first (scans unnecessary data)
SELECT *
FROM huge_table h
JOIN small_table s ON h.id = s.id
WHERE s.category = 'electronics';-- Filter small table, then join (reduces data early)
WITH filtered_small AS (
SELECT * FROM small_table
WHERE category = 'electronics'
)
SELECT h.*
FROM huge_table h
JOIN filtered_small s ON h.id = s.id;-- DATE() function prevents partition pruning!
SELECT *
FROM partitioned_table
WHERE DATE(timestamp_col) = '2024-01-15';-- Enables partition pruning (scans only relevant partition)
SELECT *
FROM partitioned_table
WHERE timestamp_col BETWEEN '2024-01-15 00:00:00' AND '2024-01-15 23:59:59';-- Exact count (slow on billions of rows)
SELECT COUNT(DISTINCT user_id) FROM huge_table;
-- Approximate count (up to 1000x faster, 98%+ accurate)
SELECT APPROX_COUNT_DISTINCT(user_id) FROM huge_table;
-- Approximate quantiles (for percentiles)
SELECT APPROX_QUANTILES(revenue, 100)[OFFSET(50)] AS median_revenue;Materialized views precompute and store query results. BigQuery automatically refreshes them and can use them to optimize queries.
-- Create materialized view for frequently-used aggregation
CREATE MATERIALIZED VIEW `project.dataset.daily_revenue_mv`
AS
SELECT
DATE(order_timestamp) AS order_date,
country,
SUM(amount) AS daily_revenue,
COUNT(*) AS order_count
FROM `project.dataset.orders`
GROUP BY order_date, country;ChatGPT Prompt: Analyze this BigQuery execution plan and suggest optimizations:
Query scanned: 50 GB
Slot time: 5 min
Stages: [stage details from execution plan]
Table schema: [paste schema]
What optimizations would reduce cost and improve performance?BigQuery costs can spiral quickly without proper controls. Here's how to optimize spending while maintaining performance.
Configure custom cost controls at project/user level to prevent expensive accidents.
Set maximum bytes billed: 1 TBUse INFORMATION_SCHEMA to track spend by user, query, and project.
Auto-delete old partitions to reduce storage costs ($0.02/GB/month adds up).
Export old data to GCS ($0.004/GB) for 5x cheaper storage.
Batch multiple small queries into one to reduce per-query overhead.
Run reports during off-peak hours with lower slot contention.
-- Track query costs by user (last 30 days)
SELECT
user_email,
COUNT(*) AS query_count,
SUM(total_bytes_processed) / POW(10, 12) AS total_tb_processed,
ROUND(SUM(total_bytes_processed) / POW(10, 12) * 5, 2) AS estimated_cost_usd,
ROUND(AVG(total_slot_ms) / 1000, 2) AS avg_slot_seconds
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND state = 'DONE'
AND total_bytes_processed > 0
GROUP BY
user_email
ORDER BY
total_tb_processed DESC;-- Identify expensive tables
SELECT
table_schema,
table_name,
ROUND(size_bytes / POW(10, 9), 2) AS size_gb,
ROUND(size_bytes / POW(10, 9) * 0.02, 2) AS monthly_storage_cost_usd,
TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), TIMESTAMP_MILLIS(creation_time), DAY) AS age_days
FROM
`project.dataset.__TABLES__`
ORDER BY
size_bytes DESC
LIMIT 20;ChatGPT Prompt for Cost Analysis:
I have a BigQuery table with 10 billion rows and 2 TB of data.
Common queries filter by date (last 7 days) and user_country.
Current monthly cost: $500
Current schema:
- timestamp (TIMESTAMP)
- user_id (INT64)
- user_country (STRING)
- event_type (STRING)
- properties (JSON)
How can I reduce costs by 50%+ without sacrificing performance?
Expected Response:
1. Partition by DATE(timestamp) - reduces each query to scan ~14GB instead of 2TB
2. Cluster by user_country, event_type - improves filtering efficiency
3. Consider extracting frequently-used JSON fields to top-level columns
4. Set partition expiration to 90 days if older data isn't needed
5. Use materialized views for common aggregations
6. Result: Query cost drops from $10/query to $0.07/query (98% reduction)LIMIT 10 for testing queriesCalculate your monthly spend. If you consistently pay $2,000+/month with on-demand pricing, flat-rate becomes cost-effective:
100 slots = $2,000/month (unlimited queries)Flat-rate also gives predictable budgeting and guaranteed capacity during peak times.
Understanding the tradeoffs between streaming and batch ingestion is crucial for building efficient data pipelines.
| Factor | Streaming Insert | Batch Load |
|---|---|---|
| Latency | Seconds (near real-time) | Minutes to hours |
| Cost | $0.01 per 200 MB | Free |
| Use Case | Real-time dashboards, alerts, IoT | Daily ETL, historical analysis |
| Throughput | 100,000+ rows/sec per project | Unlimited (parallel loads) |
| Availability | Immediate query (buffer time: seconds) | Available after load completes |
| Best For | Event tracking, clickstreams, sensors | Data warehouse loads, migrations |
-- Python: Streaming insert with BigQuery client
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.events"
rows_to_insert = [
{"user_id": 123, "event": "click", "timestamp": "2024-01-15T10:30:00"},
{"user_id": 456, "event": "purchase", "timestamp": "2024-01-15T10:31:00"},
]
errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
print(f"Errors: {errors}")
else:
print("Rows streamed successfully")-- Load from Cloud Storage (CSV/JSON/Avro/Parquet)
LOAD DATA OVERWRITE `project.dataset.daily_events`
FROM FILES (
format = 'PARQUET',
uris = ['gs://bucket/events/2024-01-15/*.parquet']
);-- Python: Batch load job
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.events"
uri = "gs://bucket/data/*.csv"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
write_disposition="WRITE_APPEND"
)
load_job = client.load_table_from_uri(uri, table_id, job_config=job_config)
load_job.result() # Wait for completionMany production systems use a streaming + batch hybrid:
ChatGPT Prompt:
I'm building an e-commerce analytics platform. We have:
- 1M daily transactions
- Real-time dashboard requirements for revenue tracking
- Historical analysis needs for trends (last 2 years)
- Budget constraints
Should I use streaming, batch, or hybrid approach for BigQuery ingestion?BigQuery ML lets you create and execute ML models using SQL—no Python, no data movement, no separate infrastructure.
Logistic Regression, DNN, XGBoost, AutoML
Predict categories: churn, fraud, spam
Linear Regression, DNN, XGBoost, AutoML
Predict numbers: revenue, price, demand
K-Means
Group customers, products, segments
ARIMA Plus
Forecast sales, traffic, inventory
Matrix Factorization
Product recommendations, content
TensorFlow, ONNX
Use pre-trained models in BigQuery
-- Prepare labeled training data
CREATE OR REPLACE TABLE `project.dataset.customer_features` AS
SELECT
customer_id,
total_purchases,
avg_purchase_amount,
days_since_last_purchase,
customer_lifetime_value,
support_tickets_count,
-- Label: 1 if churned, 0 if active
CASE WHEN last_purchase_date < DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
THEN 1 ELSE 0 END AS churned
FROM
`project.dataset.customers`;-- Train logistic regression model
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS(
model_type='LOGISTIC_REG',
input_label_cols=['churned'],
auto_class_weights=TRUE
) AS
SELECT
total_purchases,
avg_purchase_amount,
days_since_last_purchase,
customer_lifetime_value,
support_tickets_count,
churned
FROM
`project.dataset.customer_features`;-- Check model performance
SELECT *
FROM ML.EVALUATE(MODEL `project.dataset.churn_model`);-- Predict churn for all active customers
SELECT
customer_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM
ML.PREDICT(MODEL `project.dataset.churn_model`,
(SELECT * FROM `project.dataset.customer_features`)
)
WHERE
churn_probability > 0.7
ORDER BY
churn_probability DESC;-- Forecast next 30 days of sales
CREATE OR REPLACE MODEL `project.dataset.sales_forecast`
OPTIONS(
model_type='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='daily_revenue',
auto_arima=TRUE,
data_frequency='DAILY'
) AS
SELECT
date,
SUM(revenue) AS daily_revenue
FROM `project.dataset.sales`
GROUP BY date;
-- Generate forecast
SELECT *
FROM ML.FORECAST(MODEL `project.dataset.sales_forecast`,
STRUCT(30 AS horizon, 0.95 AS confidence_level)
);Train on petabytes without exporting data. Models live where your data lives.
Analysts can build ML models without learning Python or TensorFlow.
Let Google automatically select best model architecture and hyperparameters.
Pay only for query processing. No separate ML infrastructure to maintain.
ChatGPT Prompt:
I have a BigQuery table with user behavior data:
- user_id, session_date, page_views, time_on_site, bounce_rate, purchases
I want to predict which users will make a purchase in the next 7 days.
Write BigQuery ML code to create and evaluate this model.Let's explore production-grade use cases across industries to inspire your own implementations.
Online retailer processes 10M daily events (clicks, views, purchases) and needs real-time revenue dashboards plus historical trend analysis.
B2B SaaS company tracks user feature usage across 50K customers to guide product development and identify churn risks.
-- Feature adoption funnel
WITH user_journey AS (
SELECT
user_id,
MAX(CASE WHEN event = 'signup' THEN 1 ELSE 0 END) AS signed_up,
MAX(CASE WHEN event = 'feature_a_used' THEN 1 ELSE 0 END) AS used_feature_a,
MAX(CASE WHEN event = 'invited_team' THEN 1 ELSE 0 END) AS invited_team
FROM events
WHERE event_date >= '2024-01-01'
GROUP BY user_id
)
SELECT
SUM(signed_up) AS step1_signups,
SUM(used_feature_a) AS step2_feature_usage,
SUM(invited_team) AS step3_team_invite,
ROUND(SUM(used_feature_a) / SUM(signed_up) * 100, 1) AS activation_rate
FROM user_journey;Manufacturing company collects sensor data from 10K devices (temperature, pressure, vibration) every 10 seconds for predictive maintenance.
Digital marketing team needs to understand which channels drive conversions across a 30-day customer journey with 15+ touchpoints.
-- Data-driven attribution using Shapley values
WITH user_touchpoints AS (
SELECT
user_id,
ARRAY_AGG(
STRUCT(channel, timestamp)
ORDER BY timestamp
) AS journey,
MAX(converted) AS converted
FROM marketing_events
GROUP BY user_id
)
SELECT
touchpoint.channel,
COUNT(DISTINCT user_id) AS users_touched,
SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) AS conversions,
ROUND(AVG(CASE WHEN converted = 1 THEN 1.0 / ARRAY_LENGTH(journey) ELSE 0 END), 4) AS attributed_value
FROM user_touchpoints,
UNNEST(journey) AS touchpoint
GROUP BY touchpoint.channel
ORDER BY attributed_value DESC;Fintech platform processes 1M transactions daily and needs real-time fraud detection with <1 second latency.
BigQuery's architecture makes it perfect for ELT (Extract-Load-Transform) over traditional ETL. Here's why and how to implement it.
| Aspect | Traditional ETL | Modern ELT (BigQuery) |
|---|---|---|
| Process Flow | Extract → Transform → Load | Extract → Load → Transform |
| Transformation Location | External processing engine | Inside BigQuery (SQL) |
| Infrastructure | Spark clusters, Airflow, custom servers | Serverless (BigQuery compute) |
| Scalability | Manual cluster sizing | Automatic, unlimited scale |
| Data Validation | Before loading (slower iterations) | After loading (fast experimentation) |
| Best For | Complex transformations, legacy systems | Cloud-native, SQL-friendly teams |
┌─────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ PostgreSQL │ MySQL │ APIs │ S3 │ GCS │ Salesforce │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EXTRACT & LOAD │
│ Fivetran │ Stitch │ Airbyte │ Cloud Data Fusion │ Custom │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ BIGQUERY (Raw Layer) │
│ Landing zone: Raw data as-is from sources │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ TRANSFORM (SQL/dbt/Dataform) │
│ Stage 1: Clean, dedupe, standardize │
│ Stage 2: Business logic, joins, aggregations │
│ Stage 3: Mart layer (analytics-ready tables) │
└───────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CONSUMPTION LAYER │
│ Looker Studio │ Tableau │ ML Models │ Reverse ETL │
└─────────────────────────────────────────────────────────────┘
-- Load from Cloud Storage
LOAD DATA INTO `project.raw_data.postgres_users`
FROM FILES(
format = 'PARQUET',
uris = ['gs://data-lake/postgres/users/*.parquet']
);
-- Or use Fivetran/Airbyte for automated replication-- Create clean, standardized versions of raw tables
CREATE OR REPLACE TABLE `project.staging.stg_users` AS
SELECT
user_id,
LOWER(TRIM(email)) AS email,
INITCAP(first_name) AS first_name,
INITCAP(last_name) AS last_name,
CAST(created_at AS TIMESTAMP) AS created_at,
-- Remove duplicates (keep latest)
ROW_NUMBER() OVER(
PARTITION BY user_id
ORDER BY updated_at DESC
) AS row_num
FROM `project.raw_data.postgres_users`
QUALIFY row_num = 1;-- Apply business logic, join dimensions
CREATE OR REPLACE TABLE `project.intermediate.int_user_metrics` AS
SELECT
u.user_id,
u.email,
u.created_at AS signup_date,
COUNT(o.order_id) AS total_orders,
SUM(o.amount) AS lifetime_value,
MAX(o.order_date) AS last_order_date,
DATE_DIFF(CURRENT_DATE(), MAX(o.order_date), DAY) AS days_since_last_order
FROM `project.staging.stg_users` u
LEFT JOIN `project.staging.stg_orders` o USING(user_id)
GROUP BY u.user_id, u.email, u.created_at;-- Create denormalized tables for specific use cases
CREATE OR REPLACE TABLE `project.marts.marketing_analytics` AS
SELECT
u.user_id,
u.email,
u.signup_date,
u.lifetime_value,
c.cohort_month,
s.first_touch_channel,
s.last_touch_channel,
CASE
WHEN u.lifetime_value >= 1000 THEN 'High Value'
WHEN u.lifetime_value >= 100 THEN 'Medium Value'
ELSE 'Low Value'
END AS customer_segment
FROM `project.intermediate.int_user_metrics` u
LEFT JOIN `project.intermediate.int_cohorts` c USING(user_id)
LEFT JOIN `project.intermediate.int_attribution` s USING(user_id);Built-in scheduler for simple pipelines (no external tools needed).
Industry standard for SQL-based transformations with version control, testing, and documentation.
-- models/staging/stg_users.sql
{{ config(
materialized='table',
partition_by={'field': 'created_date', 'data_type': 'date'}
) }}
SELECT
user_id,
LOWER(email) AS email,
DATE(created_at) AS created_date
FROM {{ source('raw', 'users') }}For complex dependencies, conditional logic, and multi-step pipelines.
ChatGPT Prompt:
Design an ELT pipeline for my e-commerce business. We have:
- Source: PostgreSQL (users, orders, products tables)
- Goal: Daily analytics dashboard showing revenue, customer segments, product performance
- Team: SQL-comfortable, no Spark experience
- Budget: Prefer low infrastructure overhead
Provide architecture and sample SQL transformations.Connect BigQuery to Looker Studio (formerly Google Data Studio) for powerful, real-time dashboards in minutes.
Dashboards query BigQuery in real-time. No data duplication or ETL needed.
Looker Studio is free. Only pay for BigQuery queries when dashboards load.
Build professional dashboards without code using visual editor.
Charts, tables, maps, scorecards, and custom community visualizations.
Users see data based on their BigQuery IAM permissions automatically.
Embed in websites, share links, schedule email reports, export to PDF.
In Looker Studio → Create → Data Source → Select "BigQuery"
Choose your project, dataset, and table (or custom query)
Looker Studio auto-detects field types. You can:
Drag and drop visualizations onto canvas. Common components:
For complex logic, use custom queries as data sources:
-- Example: Monthly cohort retention
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC(MIN(DATE(signup_date)), MONTH) AS cohort_month
FROM users
GROUP BY user_id
),
activity AS (
SELECT
user_id,
DATE_TRUNC(DATE(activity_date), MONTH) AS activity_month
FROM user_activity
)
SELECT
c.cohort_month,
a.activity_month,
DATE_DIFF(a.activity_month, c.cohort_month, MONTH) AS months_since_signup,
COUNT(DISTINCT a.user_id) AS active_users
FROM cohorts c
LEFT JOIN activity a ON c.user_id = a.user_id
GROUP BY c.cohort_month, a.activity_month;-- Use @DS_START_DATE and @DS_END_DATE parameters
SELECT
DATE(order_date) AS date,
SUM(revenue) AS daily_revenue
FROM orders
WHERE
order_date BETWEEN PARSE_DATE('%Y%m%d', @DS_START_DATE)
AND PARSE_DATE('%Y%m%d', @DS_END_DATE)
GROUP BY date;ChatGPT Prompt:
I need a Looker Studio dashboard for executive reporting. Requirements:
- KPIs: MRR, Customer Count, Churn Rate, CAC
- Charts: MRR trend (last 12 months), Customer cohort retention, Revenue by plan type
- Data source: BigQuery table with columns: customer_id, signup_date, plan_type, mrr, churned_date
Write the BigQuery custom query to power this dashboard.Automate repetitive BigQuery tasks using AI-generated scripts and workflows.
Generate table schemas from descriptions or sample data
Convert queries between SQL dialects (MySQL → BigQuery)