Scale and benchmarks¶
This project includes harnesses to reason about very large runtimes, not live billion-row runs checked into Git.
In-repo tools¶
sparkrules.runtime.perf.run_perf_harness- measures elapsed time and rows/sec for a callable.sparkrules.runtime.perf.scale_evidence- produces a structured estimate (rows, rows/sec, target rows, estimated duration) for documentation and SLO planning.sparkrules.obs.health- classifies per-stage health from duration, shuffle volume, and task failures (for UI and ops dashboards).
Default API path: not distributed Spark¶
For default HTTP simulations and the Workbench Simulate view, evaluation runs in pure Python in the API process: SparkSession.getActiveSession() is typically None, and there is no automatic mapPartitions / broadcast rule package on a DataFrame. That is by design for a simple integration surface; it is not evidence of billion-row Spark throughput.
For architecture scope, execution paths, and the wiring needed for real cluster execution (mapPartitions, CompiledRulePackage broadcast, sparkrules/spark/dataframe.py), see KNOWN_LIMITATIONS.md. For when to choose Spark vs staying on Python, see SPARK_INTEGRATION.md.
What “production evidence” means¶
A real billions-of-rows proof requires your Spark cluster, storage (Iceberg/Delta/Hudi/Parquet), and network. Capture:
- Job config: merged
EngineConfig/runtime_conf()and Spark versions. - Input size: row count, partition count, and format.
- Metrics: stage duration, shuffle read/write, executor skew (Spark UI or your metrics store).
- Outcome: end-to-end runtime and cost.
Store those artifacts in your internal wiki or data platform; the repository stays vendor-neutral.
Reproducible local checks¶
Run the full test suite and coverage gate:
Opt-in performance tests (if present) use pytest -m perf.
Phase 4 - lakehouse benchmarks¶
For governance and promotion (see GOVERNANCE.md), the repository documents behavior only; your lakehouse is where you prove latency and cost for rule evaluation at scale.
- Baseline: same
EngineConfig/runtime_conf()and Spark version you use in production. - Data: one or more Iceberg/Delta/Parquet tables; record row counts, file sizes, and partition layout.
- Rule set: a fixed namespace and promoted prod pin versions (or your packaging format).
- Run: full job (or representative slice) with Spark UI and/or your metrics (Datadog, Databricks, etc.).
- Record: job duration, shuffle GB, executor CPU, and cost estimate; file under your org’s performance evidence process.
Re-use the in-repo perf harness for micro-benchmarks; lakehouse billions-of-rows evidence stays outside the repo, as in What “production evidence” means.
Appendix: Req 37 / native wheel CI order-of-magnitude¶
Optional Rust + cibuildwheel jobs are budgeted separately from the default Python CI matrix (\~3 × Python minors × lint + tests). Expect additional runner minutes proportional to (Python minors) × (OS targets: manylinux, macOS arm64, Windows) × cold-cache builds (tens to low hundreds of minutes per release wave until incremental caches warm). Track wall time alongside pure-Python fallback correctness (same test gate).