Spark: optional by design¶
TL;DR - where Spark actually runs
You are here Spark used? POST /simulations, Workbench Simulate, most REST trafficNo - pure Python in the API process ( SparkSession.getActiveSession()is usuallyNone).Your PySpark job calling apply_drl(df, drl)on aDataFrameYes - work is distributed by Spark (your cluster, your SparkSession).No server environment variable flips the API into Spark mode. Integrate in code; see
apply_drlbelow.
SparkRules is named for Spark-oriented data platforms, but using Apache Spark (PySpark) is optional. You choose the execution mode for your use case.
| Mode | When to use it | How rules run |
|---|---|---|
| Pure Python (default API) | Local development, POST /simulations, Workbench Simulate, fast feedback, CI |
One CPython process; no SparkSession created by those endpoints. |
| Spark-integrated (your job) | Large partitioned tables, many executors, production lakehouse jobs | You build a SparkSession, apply rules on a DataFrame (see below); work runs on the cluster. |
There is no single SPARKRULES_ENABLE_SPARK=true switch that turns the HTTP server into a distributed Spark app. Enabling Spark means writing (or reusing) a PySpark job that calls the library, not only starting Uvicorn.
When you do not need Spark¶
- Authoring, validating, and simulating rules via
/rules/validate,/ide/lsp/analyze,/simulations, and the Workbench - all use the in-process engine. - Unit tests and most integration tests.
- Scenarios where one machine and Python throughput are enough.
This is normal and expected; it does not mean the product is “broken.”
When you do use Spark (PySpark)¶
Use Spark when you need distributed execution over large row sets already in the Spark ecosystem:
- Create a
SparkSessionin your driver (EMR, Databricks, Dataproc, YARN, Kubernetes, etc.). - Load facts as a
DataFrame(e.g. from Parquet, Iceberg, Delta - your catalog and IAM). - Apply DRL to that DataFrame using the helpers in
src/sparkrules/spark/dataframe.py: apply_drl(df, drl, ...)-broadcast’s the DRL string, usesmapPartitionsover the DataFrame’s RDD, returns a newDataFrameoffact_id,fired,out_json.- Lower-level:
iter_rule_rows,mpartition_rowsfor custom pipelines. use_v2=True(recommended): the Spark executor merges action fields across strategies with deterministic ordering (salience, then rule-name code points, then DRL declaration order). You may briefly seeaction_<field>__s<salience>_o<order>staging columns internally; outputs surface asaction_<field>after merge.RulePack.serialize()embeds aSRRPversion header before the pickle (seedocs/REQUIREMENTS_V2_ENGINE.mdReq 34).- Optionally combine with a
CompiledRulePackageandsparkrules/transport/broadcaster.pyfor broadcasting larger artifacts in advanced setups (see that module’s docstring and tests).
Requirements on the workers: JARs / Python environment must include this package, PySpark, and compatible Spark version - same as any other PySpark app.
Relationship to the REST API¶
- The FastAPI process (
uvicorn, Docker, etc.) is separate from a batch Spark job. You typically do not run the cluster inside the same process as/simulations. - You can use the API to govern and store rules, then your Spark job loads the DRL (or package) and calls
apply_drlor the executor path you design.
For architecture scope and execution paths, see KNOWN_LIMITATIONS.md and BENCHMARKS.md.
Quick reference (code)¶
| Component | Path |
|---|---|
| DataFrame DRL application | sparkrules/spark/dataframe.py - apply_drl, iter_rule_rows, rows_from_session |
| Broadcast bytes | sparkrules/transport/broadcaster.py |
| Runnable E2E samples | examples/usecases/ domain packs + examples/spark/ harness (apply_drl_local.py, iter_rule_rows_no_jvm.py) |
| Spark iterator tests (need JVM) | tests/spark/, tests/unit/test_spark_iter.py |
If you are unsure, start with pure Python simulations; add Spark when you have a real DataFrame and a cluster to run on.
Tests and line coverage (src/sparkrules/spark)¶
The sparkrules.spark package is covered to 100% line coverage in CI style when you run:
How that is achieved:
| Test file | What it covers |
|---|---|
tests/unit/test_spark_iter.py |
iter_rule_rows with dict rows and row-like asDict() (no JVM). |
tests/unit/test_cov_spark_apply_mocks.py |
apply_drl with a mock DataFrame / SparkSession (no JVM), rows_from_session, and non-dict partition elements. |
tests/unit/test_cov_spark_mpartition.py |
mpartition_rows with real PySpark Row (imports pyspark; skipped if import fails). |
tests/spark/test_pyspark_dataframe.py |
Optional end-to-end with SparkSession (local[1]) - requires JVM; marked spark, not in default pytest selection if you exclude the directory. |
You get full sparkrules.spark coverage from unit tests alone (mocks + importorskip PySpark for mpartition_rows); the tests/spark/ file is for real Spark smoke runs in environments with Java.