BigQuery Ethereum Data Acquisition
Overview
Workflow for acquiring 5 years of Ethereum blockchain data (2020-2025, ~12.44M blocks) using Google BigQuery's public dataset within free tier limits. Includes column selection analysis for ML/time-series forecasting, cost optimization, and empirically validated download workflows.
Status: ✅ Empirically validated (v0.2.0, 2025-11-07)
When to Use This Skill
Use when:
- Evaluating BigQuery as a data source for Ethereum historical data
- Planning bulk historical blockchain data acquisition
- Optimizing column selection for feature engineering
- Calculating query costs and free tier utilization
- Comparing BigQuery vs RPC polling approaches
- Streaming downloads without BigQuery storage
Core Workflow
This skill follows a 5-step workflow for acquiring Ethereum data from BigQuery:
| Step | Purpose | Output | Key Metric | | ----------------------- | ------------------------------- | ----------------- | ------------------------- | | 1. Free Tier Limits | Understand query/storage limits | Limits documented | 1 TB query, 10 GB storage | | 2. Column Selection | Optimize for ML/time-series | 11 columns chosen | 0.97 GB (97% savings) | | 3. Cost Validation | Dry-run query cost | Cost estimate | 0.1% of free tier | | 4. Stream Download | Parquet download (no storage) | .parquet file | 62 bytes/row | | 5. DuckDB Import | Load for analysis | DuckDB database | <100ms query time |
Detailed workflow: See references/workflow-steps.md for complete step-by-step guide with SQL queries, bash commands, and validated results for each step.
Quick start: Run uv run scripts/test_bigquery_cost.py to validate cost, then uv run scripts/download_bigquery_to_parquet.py <start> <end> <output> to download.
Cost Analysis
Optimized selection: 11 columns = 0.97 GB (0.1% of free tier, 97% cost savings vs all 23 columns)
Key finding: BigQuery is 624x faster than RPC polling (<1 hour vs 26 days for 12.44M blocks).
Full analysis: See references/cost-analysis.md for detailed cost comparison, column selection rationale, and RPC provider comparison.
Prerequisites
One-time setup: gcloud auth, Python dependencies (google-cloud-bigquery, pandas, pyarrow, db-dtypes)
Setup guide: See references/setup-guide.md for complete authentication setup, dependency installation, and verification commands.
Scripts
Validated scripts (v0.2.0):
test_bigquery_cost.py- Dry-run cost estimation (0.97 GB for 12.44M blocks)download_bigquery_to_parquet.py- Streaming Parquet download (62 bytes/row validated)
Templates and usage: See scripts/README.md for complete usage examples, dependencies, and validated results.
References
Workflow Documentation
references/workflow-steps.md- Complete 5-step workflow with SQL queries, bash commands, and validated resultsreferences/cost-analysis.md- Detailed cost comparison, column selection rationale, RPC provider comparisonreferences/setup-guide.md- Authentication setup, dependencies, verification commands
Research Documents
references/bigquery_cost_comparison.md- Empirical cost analysis (6 vs 11 vs 23 columns)references/ethereum_columns_ml_evaluation.md- Column-by-column ML value analysisreferences/bigquery_complete_ethereum_data.md- Complete dataset catalog (11 tables)references/bigquery_cost_estimate.md- Free tier limits and methodologyreferences/littleblack-hardware-report.md- Local vs cloud hardware comparison
Scripts & Validation
scripts/README.md- Complete script usage guide with validated resultsVALIDATION_STATUS.md- Empirical test results, testing methodology, dependencies validated
Verification After Acquisition
Important: This skill covers data acquisition from BigQuery (downloading historical Ethereum blocks), but does NOT verify the data actually landed in ClickHouse.
After completing BigQuery download:
- Use the historical-backfill-execution skill
- Run
scripts/clickhouse/verify_blocks.pyto verify database state - Confirm expected block count (~23.8M blocks for 2015-2025 backfill)
Common workflow:
- Download from BigQuery using this skill (Step 4 above)
- Insert to ClickHouse via
chunked_backfill.sh - Verify ClickHouse state using historical-backfill-execution skill
- Check yearly breakdown to ensure complete coverage
See historical-backfill-execution skill for database verification and troubleshooting missing data.
Related Skills
historical-backfill-execution- ClickHouse database verification and backfill operationsblockchain-rpc-provider-research- RPC rate limit comparison and provider evaluationblockchain-data-collection-validation- Empirical validation workflow for data pipelines
Scan to join WeChat group