Automated quality control and validation system for solarimetric and meteorological data from INPE SONDA (Sistema de Organização Nacional de Dados Ambientais) stations across Brazil.
The system reads station data from Parquet files, computes solar geometry, joins climatic normals, runs per-variable DQC algorithms, and writes monthly validated CSV files per station. Both solarimetric and meteorological variables are validated in a single optimized DuckDB query per domain.
sonda-validation/
├── main.py # Main execution script (processes all stations)
├── pyproject.toml # Project configuration and dependencies
├── uv.lock # Dependency lock file
├── core/
│ ├── __init__.py
│ ├── sondaValidator.py # SolarimetricValidator and MeteoValidator classes
│ └── sondaUtils.py # auxFunctions: data loading, preprocessing, orchestration
├── data/
│ ├── metadata/
│ │ ├── INPESONDA_stations.csv # Station coordinates (station, latitude, longitude, altitude)
│ │ └── INPESONDA_normais.csv # Climatic normals per station (tp_min/max, press_min/max, rain_max)
│ ├── raw/ # Input Parquet files (Solarimetrica-001.parquet)
│ └── output/ # Validated output CSVs, organized by station/month
└── validation.ipynb # EDA and DQC visualization notebook
- Python 3.12 or higher
- UV package manager (recommended) or pip
# Clone and enter the repository
git clone <repository-url>
cd sonda-validation
# Install with UV (recommended)
uv sync
# Or with pip
pip install pandas duckdb numpy astral timezonefinder pyarrowpython main.pymain.py reads all unique acronym values from the Parquet file and calls rodar_validacao() for each station sequentially. Outputs a timing summary at the end.
PARQUET_FILE = os.path.join(SCRIPT_DIR, "data", "raw", "Solarimetrica-001.parquet")
OUTPUT_DIR = os.path.join(SCRIPT_DIR, "data", "output")
N_ROWS = None # Set to an integer to limit rows (useful for testing)from core.sondaUtils import auxFunctions
aux = auxFunctions()
aux.rodar_validacao(
parquet_file="data/raw/Solarimetrica-001.parquet",
OUTPUT_DIR="data/output",
n_rows=None, # None = all rows
station="SBR", # None = all stations in file
)Each station is processed independently:
- Load data —
carregar_dados(): reads the Parquet file into a DuckDB in-memory tablesolar_raw, optionally filtered by station, row limit, or random sample. - Preprocess —
preprocess_conversion_data_fill_time(): casts all measurement columns toDOUBLE, fills the time series to a regular 10-minute grid (insertingNULLrows for missing intervals), and flags complete rows with avalidboolean. - Join metadata — creates table
solar_with_metaby joining:INPESONDA_stations.csv→ addslatitude,longitudeINPESONDA_normais.csv→ addstp_min,tp_max,press_min,press_max,rain_max
- Solar geometry —
add_mu0_to_duckdb(): computes per-row solar cosine zenith anglemu0and solar azimuthazsusing theastrallibrary. Processes in adaptive chunks (25K–100K rows) to control memory. Timezone is auto-detected from station coordinates viatimezonefinder. - Extraterrestrial radiation —
add_sa_sum(): adds columns:Sa = S0 / UA²whereS0 = 1361 W/m²(solar constant) andUA = 1.0AUSum = dif_avg + dir_avg × mu0(theoretical global from components)
- Solar validation —
SolarimetricValidator.run_solar_validation(): builds a single DuckDBCREATE TABLE AS SELECTthat computes DQC flags for all present solar columns in one query pass. - Meteorological validation —
MeteoValidator.run_all(): same pattern — single-query validation for all present meteorological columns. - Consolidation — merges the two validated tables (
solar_validated_solar+solar_validated_meteo) with aFULL OUTER JOINon(acronym, timestamp)intofinal_consolidated. - Output — saves one semicolon-delimited CSV per calendar month into
data/output/<station>/solar_validated_<station>_<YYYY-MM>.csv.
Each validated variable receives a DQC code: a 2- or 3-character string, one digit per algorithm.
| Digit value | Meaning |
|---|---|
9 |
Passed validation (good quality) |
2 |
Failed validation (suspicious/flagged) |
5 |
Missing data or insufficient context for validation |
Examples: 999 = all three algorithms passed; 295 = algorithm 1 passed, algorithm 2 flagged, algorithm 3 insufficient data; 55 = both algorithms have missing data.
Input columns require both *_avg and *_std (standard deviation within the 10-minute interval). A zero standard deviation (std = 0) triggers a flag (2) on the first algorithm, indicating a stuck sensor.
| Alg | Check | Flag |
|---|---|---|
| 1 | NULL/NULL std → 5; std = 0 → 2; glo_avg < -4 or > Sa × 1.5 × mu0^1.2 + 100 → 2 |
Physically possible limit |
| 2 | NULL → 5; glo_avg < -2 or > Sa × 1.2 × mu0^1.2 + 50 → 2 |
Extremely rare limit |
| 3 | NULL or Sum ≤ 50 → 5; azs < 75° and ` |
glo_avg/Sum - 1 |
| Alg | Check | Flag |
|---|---|---|
| 1 | NULL/NULL std → 5; std = 0 → 2; dir_avg < -4 or > Sa → 2 |
Physically possible limit |
| 2 | NULL → 5; dir_avg < -2 or > Sa × 0.95 × mu0^0.2 + 10 → 2 |
Extremely rare limit |
| 3 | NULL → 5; (dir_avg × mu0 - 50) > (glo_avg - dir_avg) or (glo_avg - dir_avg) > (dir_avg × mu0 + 50) → 2 |
Diffuse component consistency |
| Alg | Check | Flag |
|---|---|---|
| 1 | NULL/NULL std → 5; std = 0 → 2; dif_avg < -4 or > Sa × 0.95 × mu0^1.2 + 50 → 2 |
Physically possible limit |
| 2 | NULL → 5; dif_avg < -2 or > Sa × 0.75 × mu0^1.2 + 30 → 2 |
Extremely rare limit |
| 3 | NULL or glo_avg ≤ 50 → 5; azs < 75° and dif_avg / glo_avg > 1.05 → 2; 75° ≤ azs < 93° and > 1.10 → 2 |
Diffuse fraction check |
| Alg | Check | Flag |
|---|---|---|
| 1 | NULL/NULL std → 5; std = 0 → 2; lw_avg < 40 or > 700 → 2 |
Physically possible limit (W/m²) |
| 2 | NULL → 5; lw_avg < 60 or > 500 → 2 |
Extremely rare limit |
Limits scaled from solar irradiance using a conversion factor of 2.07 (µmol/m²/s per W/m²).
| Alg | Check | Flag |
|---|---|---|
| 1 | NULL/NULL std → 5; std = 0 → 2; par_avg < -4 or > 2.07 × (Sa × 1.5 × mu0^1.2 + 100) → 2 |
Physically possible limit |
| 2 | NULL → 5; par_avg < -2 or > 2.07 × (Sa × 1.2 × mu0^1.2 + 50) → 2 |
Extremely rare limit |
Limits scaled from solar irradiance using a conversion factor of 0.1125 (klux per W/m²).
| Alg | Check | Flag |
|---|---|---|
| 1 | NULL/NULL std → 5; std = 0 → 2; lux_avg < -4 or > 0.1125 × (Sa × 1.5 × mu0^1.2 + 100) → 2 |
Physically possible limit |
| 2 | NULL → 5; lux_avg < -2 or > 0.1125 × (Sa × 0.95 × mu0^1.2 + 50) → 2 |
Extremely rare limit |
Limits (tp_min, tp_max, press_min, press_max, rain_max) come from INPESONDA_normais.csv, joined per station. Temporal checks use DuckDB window functions with LAG() partitioned by station.
Note on source column names: the Parquet file uses tp_sfc, humid, press, ws10_avg, wd10_avg, and rain. The output CSV renames them to temp_avg, rh_avg, press_avg, ws_avg, wd_avg, and rain.
| Alg | Window | Check | Flag |
|---|---|---|---|
| 1 | — | NULL → 5; tp_sfc < tp_min or > tp_max → 2 |
Climatic normals bounds |
| 2 | 6 steps back (1 h) | NULL diff → 5; ` |
Δ1h |
| 3 | 72 steps back (12 h) | NULL diff → 5; ` |
Δ12h |
| Alg | Check | Flag |
|---|---|---|
| 1 | humid ≥ 0 and ≤ 100 → 9; otherwise → 5 |
Physical range |
| Alg | Window | Check | Flag |
|---|---|---|---|
| 1 | — | NULL → 5; press < press_min or > press_max → 2 |
Climatic normals bounds |
| 2 | 18 steps back (3 h) | NULL diff → 5; ` |
Δ3h |
| Alg | Window | Check | Flag |
|---|---|---|---|
| 1 | — | NULL → 5; ws10_avg < 0 or > 25 m/s → 2 |
Physical range |
| 2 | 18 steps back (3 h) | NULL diff → 5; ` |
Δ3h |
| 3 | 72 steps back (12 h) | NULL diff → 5; ` |
Δ12h |
| Alg | Window | Check | Flag |
|---|---|---|---|
| 1 | — | NULL → 5; wd10_avg < 0° or > 360° → 2 |
Physical range |
| 2 | 18 steps back (3 h) | NULL diff → 5; ` |
Δ3h |
| 3 | 108 steps back (18 h) | NULL diff → 5; ` |
Δ18h |
| Alg | Window | Check | Flag |
|---|---|---|---|
| 1 | — | NULL → 5; rain < 0 or > rain_max → 2 |
Climatic normals ceiling |
| 2 | 6 rows (1 h accumulation) | NULL sum → 5; Σ1h > 25 mm → 2 |
Hourly accumulation limit |
| 3 | 144 rows (24 h accumulation) | NULL sum → 5; Σ24h > 100 mm → 2 |
Daily accumulation limit |
Required columns:
| Column | Type | Description |
|---|---|---|
acronym |
string | Station identifier (e.g., SBR, NAT) |
timestamp |
datetime | Measurement timestamp (UTC, 10-minute resolution) |
Optional solarimetric columns (validated if present):
| Column | Unit | Description |
|---|---|---|
glo_avg, glo_std |
W/m² | Global horizontal irradiance |
dir_avg, dir_std |
W/m² | Direct normal irradiance |
dif_avg, dif_std |
W/m² | Diffuse horizontal irradiance |
lw_avg, lw_std |
W/m² | Longwave radiation |
par_avg, par_std |
µmol/m²/s | Photosynthetically Active Radiation |
lux_avg, lux_std |
klux | Illuminance |
Optional meteorological columns (validated if present):
| Column | Unit | Description |
|---|---|---|
tp_sfc |
°C | Air temperature at surface |
humid |
% | Relative humidity |
press |
hPa | Atmospheric pressure |
ws10_avg |
m/s | Wind speed at 10 m |
wd10_avg |
° | Wind direction at 10 m |
rain |
mm | Precipitation |
data/metadata/INPESONDA_stations.csv — comma-separated:
station,latitude,longitude,altitude
SBR,-15.601,-47.713,1023
...
data/metadata/INPESONDA_normais.csv — semicolon-separated:
acronym;tp_min;tp_max;press_min;press_max;rain_max
SBR;12.0;32.0;880.0;960.0;150
...
Monthly CSV files per station, semicolon-delimited:
data/output/
├── SBR/
│ ├── solar_validated_SBR_2024-01.csv
│ ├── solar_validated_SBR_2024-02.csv
│ └── ...
├── NAT/
│ └── ...
acronym;timestamp;year;day;min;
glo_avg;glo_avg_dqc;
dir_avg;dir_avg_dqc;
dif_avg;dif_avg_dqc;
lw_avg;lw_avg_dqc;
par_avg;par_avg_dqc;
lux_avg;lux_avg_dqc;
temp_avg;temp_avg_dqc;
rh_avg;rh_avg_dqc;
press_avg;press_avg_dqc;
ws_avg;ws_avg_dqc;
wd_avg;wd_avg_dqc;
rain;rain_dqc
Columns are only present if the corresponding source column existed in the input Parquet file.
acronym;timestamp;year;day;min;glo_avg;glo_avg_dqc;dir_avg;dir_avg_dqc;...
SBR;2024-01-01 00:00:00;2024;1;0;0.0;555;0.0;555;...
SBR;2024-01-01 06:10:00;2024;1;370;850.3;999;620.1;999;...
SBR;2024-01-01 06:20:00;2024;1;380;-5.1;299;0.0;999;...
Jupyter notebook for exploratory analysis of validated output:
- Reads a monthly CSV from
data/output/ - Filters by date range
- Plots time series of any variable (e.g.,
glo_avg) - DQC heatmap: 2D grid of days × time-of-day for each DQC column — blue cells =
999/99(good data), red cells = any other code (suspect or missing) - Prints per-variable DQC distribution and data availability percentage
- DuckDB in-memory: all processing done in DuckDB without writing intermediate files
- Single-query validation: both solar and meteo validations use a single
CREATE TABLE AS SELECTwith all algorithmCASEexpressions evaluated in one pass - Chunked solar angle calculation:
mu0/azscomputed in chunks of 25K–100K rows (adaptive based on dataset size) to prevent memory exhaustion - Per-station isolation: each station uses its own DuckDB connection, closed and garbage-collected after processing
SET memory_limit = '16GB'
SET threads = 4
SET preserve_insertion_order = false
SET enable_object_cache = trueFileNotFoundError for metadata CSVs — the system looks in data/metadata/ first, then falls back to the project root. Ensure INPESONDA_stations.csv and INPESONDA_normais.csv are in data/metadata/.
Missing latitude/longitude after join — the acronym in the Parquet file must match the station column in INPESONDA_stations.csv (case-insensitive, whitespace-trimmed).
Memory issues — reduce N_ROWS in main.py for testing, or reduce chunk size in add_mu0_to_duckdb().
Timezone detection failure — timezonefinder requires valid coordinates. Check that the station exists in INPESONDA_stations.csv with non-null latitude/longitude.
| Package | Purpose |
|---|---|
duckdb |
In-memory SQL engine for all data processing |
pandas |
DataFrame handling for metadata and chunked updates |
numpy |
Numeric operations in solar angle calculations |
astral |
Solar elevation and azimuth calculations |
timezonefinder |
Automatic timezone detection from coordinates |
pyarrow |
Parquet file reading |
GNU General Public License v3.0 — see LICENSE.
- INPE (Instituto Nacional de Pesquisas Espaciais)
- SONDA (Sistema de Organização Nacional de Dados Ambientais)