Rebuilding Texas in Tables

Texas is the largest oil-producing state in the U.S. — and one of the most data-rich. The Texas Railroad Commission (RRC) publishes free, downloadable datasets covering every wellbore drilled in the state since 1976: drilling permits, completion records, monthly production figures, well-test results, and GIS map layers. Together, these files describe over 1,100,000 wellbores across more than a dozen dataset categories.

The catch? The data is published in the formats the RRC has used for decades — EBCDIC flat files, fixed-width ASCII records, raw CSV/TXT dumps, and scanned PDFs. No REST APIs. No unified schema. No foreign keys. What should be a connected picture of a well's lifecycle — from permit application through completion, stimulation, and decades of production — arrives as a disjointed pile of files whose relationships you have to reverse-engineer yourself.

This case study describes how we built the automated ETL pipelines that ingest, clean, normalize, and link these datasets into a single, cloud-native database — the data engine that powers ODIS.

1.1M+

Wellbores

Database Tables

12+

RRC Source Categories

35+

Years of Records

The challenge: 14 dataset categories, zero joins

The RRC organizes its public data into broad categories: Digital Map Data (GIS shapefiles with well coordinates and land survey boundaries), Drilling Permit Data (every W-1 application since 1976), Oil & Gas Field Data (field rules, spacing orders, proration schedules), Production Data (monthly lease-level volumes), and Oil & Gas Well Data — the largest and most complex category, encompassing full wellbore records, completion files (casing, perforation, squeeze, tubing, formation tops), well-test reports, and oil/gas well status data.

Beyond the RRC itself, the FracFocus Chemical Disclosure Registry provides detailed chemical-ingredient data for hydraulic fracturing jobs — information not captured in the Commission's completion files.

Each category ships in its own format. The regulatory well and production files arrive as EBCDIC or fixed-width ASCII, formats native to the mainframe systems the Commission has used for decades. Permit data comes as CSV/TXT. GIS layers come as shapefiles. And image-based records — horizontal drilling permits, scanned W-1 forms, directional survey applications — come as PDFs that require OCR or manual extraction.

Critically, the RRC does not publish explicit join keys between these categories. The American Petroleum Institute (API) number is the de facto well identifier, but its formatting varies across files (10-digit vs. 14-digit, with and without dashes), some older records predate the API system entirely, and production data is keyed by lease number rather than API — meaning a single "join" actually requires resolving a many-to-many well-to-lease mapping.

Mapping the data landscape

Before writing a single line of ETL code, we performed a systematic audit of every downloadable dataset on the RRC site. For each file we cataloged the native format (EBCDIC, ASCII, CSV, shapefile, or PDF image), update frequency (daily, monthly, or annual), record volume, available key fields, and data dictionary coverage (whether the Commission's published manuals fully documented the field layout).

From this audit we identified a tractable Phase I scope — all structured, machine-readable datasets — and deferred PDF/image files (horizontal permits, imaged W-1 forms, directional surveys) to Phase II, where we apply OCR and document-intelligence techniques. The Phase I scope alone covers the vast majority of analytical use cases: well siting, completion design benchmarking, production forecasting, capital allocation, and regulatory compliance monitoring.

Source categories ingested in Phase I

Category	Native Format	Key Identifier	Update Cadence
Digital Map / GIS Layers	Shapefile (NAD27) / dBase	API, Abstract No.	Twice a week
Drilling Permit Master	CSV / TXT	API, Permit No.	Monthly
Oil & Gas Field Data	ASCII / EBCDIC	Field No.	Monthly
Production Data (Lease)	EBCDIC / CSV	Lease No., District	Monthly
Wellbore Query Data	ASCII	API	Monthly
Oil & Gas Well Status (W-10/G-10)	EBCDIC	API	Monthly
Completion — Casing	EBCDIC / ASCII	API	Monthly
Completion — Perforation	EBCDIC / ASCII	API	Monthly
Completion — Formation Tops	EBCDIC / ASCII	API	Monthly
Completion — Squeeze	EBCDIC / ASCII	API	Monthly
Completion — Tubing	EBCDIC / ASCII	API	Monthly
Completion — Test Results	EBCDIC / ASCII	API	Monthly
FracFocus Registry	CSV (bulk download)	API, Disclosure ID	Quarterly
FracFocus Ingredients	CSV (bulk download)	Disclosure ID	Quarterly

Schema design: the API number as universal spine

The core design decision was choosing the API number as the universal join key. Every RRC dataset (except lease-level production) references API in some form. We standardized to both 10-digit API (state + county + well sequence) and 14-digit API14 (which adds the sidetrack and completion suffixes), normalizing dashes, leading zeros, and padding inconsistencies at ingestion time.

The resulting schema follows a star-like topology with two dimension hubs — well_general (lifecycle metadata, location, operator, dates) and well_completions_general (completion-level metadata) — radiating outward to fact tables for production, permits, and every completion sub-record. Spatial dimensions (Texas Land Survey abstracts, PLSS sections, shale play/basin boundaries, county polygons) provide the geographic fabric for map-based queries.

Loading diagram…

A few design choices worth noting. Production data exists at two granularities: well_production (monthly per-well volumes — oil, gas, water, condensate) and lease_production (monthly per-lease aggregates that match the RRC's official reporting format). We also compute a well_production_summary table with pre-aggregated IP-window metrics — cumulative oil, gas, water, and BOE at 3, 6, 12, and 36 months — so that type-curve and benchmarking queries don't require scanning the full production history every time.

Hydraulic fracturing data bridges two sources. The well_completions_frac_info table captures job-level metadata (dates, total water volume, vertical depth) from FracFocus disclosures, while well_completions_frac_ingredients holds the ingredient-level detail (chemical names, CAS numbers, mass, supplier, concentration percentages). Both tables join back to well_general via API and to each other via disclosure ID.

The ETL pipeline: EBCDIC to cloud

Each source category has its own ingestion path, but the general pipeline follows four stages: Extract → Decode & Parse → Transform & Normalize → Load & Validate.

Extract

Automated download functions pull files from the RRC's HTTPS download portal. Files are versioned by download date, and a change- detection layer compares file hashes to avoid re-processing unchanged datasets. FracFocus bulk data follows the same pattern from their separate download endpoint.

Decode & parse

EBCDIC files — a character encoding standard from the IBM mainframe era — require byte-level decoding before any parsing can happen. We built custom decoders that read the RRC's published record layouts (fixed-width field positions, packed-decimal numerics, signed overpunch fields) and emit clean pandas DataFrames. ASCII fixed-width files use the same layout-driven parser with a simpler encoding step. CSV/TXT files are parsed with schema-aware readers that handle the RRC's inconsistent quoting, delimiter, and header conventions.

Transform & normalize

This is where the bulk of the data quality work happens. Key transformations include API number standardization (10-digit and 14-digit normalization with checksum validation), date parsing across multiple formats (YYYYMMDD, MM/DD/YYYY, and packed Julian dates), unit normalization (production volumes to BBL and MCF, depths to feet, pressures to PSI), null/sentinel handling (the RRC uses various conventions for missing data: blanks, zeros, 9999, and "N/A" strings), and string cleaning for operator names, field names, and formation names (case normalization, whitespace trimming, dba/alias deduplication).

Load & validate

Transformed DataFrames are loaded into the cloud database with row-level validation: primary key uniqueness checks, foreign key referential integrity against the well_general dimension, range constraints on physical quantities (depths must be positive, volumes non-negative, coordinates within Texas bounding box), and data-type conformance. Failed rows are quarantined to an error log for manual review rather than silently dropped.

From proof-of-concept to production

The project began as a research proof-of-concept: a local SQLite database hosted on a Azure Virtual Machine, manually triggered, with a handful of Python query functions. That prototype validated the concept — you could unify the RRC's disparate files into a single relational model — but it wasn't production-ready for multiple concurrent users, automated refresh, or downstream ML workflows.

The production system evolved in several dimensions. The data warehouse moved to a Azure cloud-native stack designed for analytical workloads at scale. Ingestion became fully automated on a scheduled cadence that matches each source's update frequency. API query functions were replaced by a proper REST API layer with authentication, rate limiting, and response pagination. And the schema itself expanded from the original 12 tables in the prototype to the current 22-table model as we added FracFocus integration, production summaries, spatial dimensions, and multi-state extensibility (the state_code field on every table is there because the architecture is designed to ingest New Mexico, Oklahoma, and other state regulatory data into the same model).

What's queryable now

The integrated database enables queries that would be impractical or impossible to answer from the raw RRC files alone. Some examples of what analysts can do out of the box:

Full wellbore profiles. Select any API number and get a unified view: permit history, completion details (casing strings, perforation intervals, formation tops, squeeze jobs, tubing configuration), initial production test results, and the full monthly production curve — all from one query endpoint.

Completion design benchmarking. Compare hydraulic fracturing parameters (water volume, proppant mass, chemical composition) across operators, formations, or time periods by joining FracFocus data to well completions and production outcomes.

Production analytics at any aggregation level. Monthly volumes are available at the well, lease, field, district, and state level. Pre-computed IP-window summaries (3/6/12/36 months) power type-curve analysis without full scans. Historical production trends stretch from the late 1980s through the present month.

Geospatial overlay. Well coordinates joined to Texas Land Survey abstracts, shale play/basin boundaries, and county polygons enable map-based filtering and spatial analytics — for example, identifying all Wolfcamp completions within a specific abstract boundary.

Interactive dashboards. All datasets feed into configurable dashboards that support drill-down from state-level summaries to individual wellbore detail. Filters include district, operator, lease, field, and county — covering the 1.1M+ Texas wells tracked by the system.

What's next

This pipeline is the data backbone of ODIS, but the integration work is ongoing in several directions.

Image-based records. Horizontal drilling permits, scanned W-1 forms, and directional survey applications are currently excluded because they ship as PDFs and scanned images. We are building document-intelligence pipelines to extract structured data from these files using OCR and layout-aware parsing.

Multi-state expansion. The schema's state_code field on every table is not decorative — it is the hook for ingesting well data from New Mexico's OCD, Oklahoma Corporation Commission, and other state regulators into the same model. The RRC integration established the patterns; extending to additional states is a matter of building per-state decoders that emit the same normalized schema.

AI-ready features. With clean, linked data in place, the next layer is machine learning: production forecasting, decline-curve analysis, completion optimization, and the natural- language query interface powered by EnergyGPT. Clean data is the prerequisite for trustworthy models — and this pipeline is designed to deliver exactly that.

Real-time alerting. Several RRC datasets update daily (drilling permits, pending applications). Moving from batch ingestion to near-real-time change detection opens the door for automated alerts — new permits filed in your area of interest, production anomalies on tracked leases, or regulatory status changes on monitored operators.

The full data model and query capabilities are available through ODIS. If you're working with RRC data and want to see what an integrated, queryable version looks like, let's talk.