sparkrules vs other rule engines¶
Open-source rule engine that scales from one fact in Python to a billion rows on Spark. Same DRL, same governance, same explainability across both paths.
How sparkrules compares to the common alternatives: Drools, GoRules, Camunda, IBM ODM, Flink CEP, and the pure-Python engines rule-engine and business-rules. Covers cost, scaling, performance, reliability, and features. Measured numbers where possible, literature numbers where not.
sparkrules is Apache 2.0 licensed and free to use. No sales team behind this document. The goal is to help you pick the right tool for your workload.
Table legend for every table below: โ yes ยท โ ๏ธ partial / with caveats ยท โ no / missing ยท ๐ best in class
๐ Quick reference¶
| Dimension | sparkrules | Drools | GoRules | Camunda DMN | IBM ODM | Flink CEP | rule-engine / business-rules |
|---|---|---|---|---|---|---|---|
| โก Single-machine per-core speed | โ | ๐ | โ | โ ๏ธ | โ | โ ๏ธ | โ ๏ธ |
| ๐ฆ Distributed billions of rows | ๐ Spark-native | โ single JVM cap | โ single process | โ | โ ๏ธ RES cluster | ๐ Flink-native | โ |
| ๐ฐ Cost at 1B rows/day | ๐ ~$142 /mo | โ cannot scale | โ cannot scale | โ cannot scale | โ commercial | โ ๏ธ Flink cluster | โ cannot scale |
| ๐จ Authoring UI for analysts | โ Workbench | โ Kogito | โ ZenJDM | ๐ Modeler | ๐ Decision Center | โ code only | โ none |
| ๐ Python-native | ๐ | โ JVM bridge | โ ๏ธ Go + REST | โ JVM bridge | โ JVM bridge | โ JVM bridge | โ |
| โก Spark-native | ๐ DataFrame + Catalyst | โ | โ | โ | โ | โ | โ |
| ๐๏ธ Lakehouse (Iceberg/Delta/Hudi) | ๐ native sinks | โ | โ | โ | โ | โ ๏ธ | โ |
| ๐ก๏ธ Governance (dev/stage/prod) | โ | ๐ | โ ๏ธ | โ | ๐ | โ | โ |
| ๐ Hot-swap rules | โ | โ | โ | โ | โ | โ ๏ธ | โ |
| โฑ๏ธ Stateful CEP | โ | โ | โ | โ | โ | ๐ | โ |
| โ๏ธ Commercial SLA | โ | ๐ Red Hat | โ ๏ธ | โ | ๐ | โ Ververica | โ |
| ๐ License | โ Apache 2.0 | โ Apache 2.0 | โ MIT | โ Apache 2.0 | โ commercial | โ Apache 2.0 | โ BSD / MIT |
What the ๐ column says about each engine: - sparkrules โ distributed billions, cost, Python, Spark, lakehouse - Drools โ per-core speed, governance depth, commercial SLA - Flink CEP โ distributed streams, stateful CEP - Camunda / IBM ODM โ visual modelers, governance, commercial SLA
โฑ๏ธ Benchmark matrix - measured times¶
All measurements on the same laptop: Windows 11, 4-core CPU, Python 3.13.2, single-process unless noted. JVM engines (Drools, Camunda, ODM) are literature numbers from vendor reports, not measured here - see ยง11 caveat.
Single-fact p99 latency (50-rule lending pack)¶
| Engine | p99 latency | vs sparkrules | Source |
|---|---|---|---|
| ๐ Drools PHREAK | 20-50 ยตs | ~12-30ร faster per-core | literature (Red Hat) |
| sparkrules 1.1.0 | 595 ยตs | baseline | measured |
| rule-engine 4.5.3 | 1,523 ยตs | 2.6ร slower | measured |
| business-rules 1.1.1 | 1,884 ยตs | 3.2ร slower | measured |
| GoRules Zen | ~100-200 ยตs | 3-6ร faster per-core | literature |
| Camunda DMN | ~500 ยตs | similar | literature |
Single-machine batch throughput (50 rules ร 10k facts)¶
| Engine | Throughput | vs sparkrules | Source |
|---|---|---|---|
| ๐ sparkrules 1.1.0 | 3,465 rows/sec | baseline | measured |
| rule-engine 4.5.3 | 2,716 rows/sec | 0.78ร | measured |
| business-rules 1.1.1 | 2,283 rows/sec | 0.66ร | measured |
| Drools PHREAK | ~200,000 rows/sec | ~58ร faster single-JVM | literature |
| GoRules Zen | ~300,000 rows/sec | ~87ร faster single-process | literature |
| pandas vectorized (not a rule engine) | ~1,140,000 rows/sec | vectorized column math | measured |
Distributed batch wall-clock (50-rule pack, 200 executors)¶
| Engine | 100M rows | 1B rows | Source |
|---|---|---|---|
| ๐ sparkrules 1.1.0 | 44 seconds | 7.4 minutes | projected from measured local[4] |
| Drools | โ no distributed mode; ~80 min on one JVM at 100M | โ cannot run | architectural |
| GoRules Zen | โ no distributed mode | โ cannot run | architectural |
| Camunda DMN | โ no distributed mode | โ cannot run | architectural |
| IBM ODM | โ ๏ธ requires RES cluster (commercial) | โ ๏ธ license-limited | vendor |
| Flink CEP | โ streaming only, different workload | โ streaming only | architectural |
Read the three tables together¶
- At single-fact p99: Drools wins per-core, full stop. sparkrules is mid-pack; we do not claim to beat Drools there.
- At single-machine batch: sparkrules wins the Python rule-engine tier. Drools wins single-JVM batch (literature, not measured here) but you pay for that per-core speed with no way to distribute beyond one JVM.
- At distributed batch: sparkrules is the only general-purpose rule engine in the table that actually runs 1B rows. The others either cannot distribute (Drools, GoRules, Camunda) or target a different workload (Flink CEP is streaming).
Full methodology, raw numbers, and reproduction steps: repo root BENCHMARK_LATENCY.md, OPTIMIZED_BENCHMARK.md, and docs/BENCHMARKS.md.
๐ฐ 1. Cost at scale¶
Scenario: batch-score 1 billion loan applications once per day.
sparkrules on Spark (200 executors, Databricks Jobs, spot pricing)¶
| Item | Assumption | Cost |
|---|---|---|
| Cluster size | 200ร m5.xlarge equivalent, spot | - |
| Wall time for 1B rows | 66 seconds (10 rules) or 7.4 minutes (50 rules), measured projection | - |
| Cluster cost per run (50 rules) | 200 ร $0.192/hr ร (7.4/60) | ~$4.73 / run |
| Daily cost | $4.73 ร 1 run/day | ~$142 / month |
| Annual | ~$1,700 / year |
Why there's no Drools cost column for 1B rows¶
Drools does not scale to this workload. A single JVM caps at ~500M rows/hr even with PHREAK, and it cannot distribute across executors. Running 1B rows through Drools means either a 2+ hour serial job on one JVM (which hits GC and heap issues before it finishes) or a hand-rolled fan-out across 3-5 KIE workers behind a load balancer (which is distribution you build and maintain yourself). Either way, the cost question is not "how much does Drools cost for 1B rows/day" - it's "can Drools even do this job." See ยง2 for the architectural reasons.
IBM ODM¶
Commercial, per-core runtime fees. Typical enterprise license $50k-$500k/year. Different conversation.
๐ฆ 2. Scaling - architectural ceilings¶
| Engine | Scaling model | Ceiling | Real-world max | Failure mode |
|---|---|---|---|---|
| ๐ sparkrules | Horizontal (Spark cluster) | None practical | 1B+ rows in minutes on 200 executors | Driver OOM if RulePack > 100 MB |
| Drools | Vertical (single JVM) | โ ~32 GB heap | ~500M rows/hr per JVM | GC thrash โ heap OOM |
| GoRules Zen | Vertical (single process) | โ process memory | ~300k rows/sec | Process memory cap |
| Camunda DMN | Vertical (single JVM) | โ session table | ~200k rows/sec | Session growth |
| IBM ODM | Horizontal (RES cluster) | โ ๏ธ ~50 cores typical | ~500k TPS | License-limited |
| ๐ Flink CEP | Horizontal (Flink cluster) | None practical | 1M-10M events/sec | State backend latency |
| rule-engine, business-rules | Vertical (single Python) | โ Python GIL | ~13k rows/sec | GIL |
โ ๏ธ Why Drools cannot scale to 1B rows¶
Drools on a single JVM is bounded by:
- โ Heap size - at 32 GB heap, the working set for a large rule pack plus facts is already tight
- โ Single-process throughput - no matter how fast PHREAK is per-core, one JVM caps at 1B รท per-core-throughput
- โ GC pauses - long-running Drools sessions hit G1/ZGC pauses that add tail latency
- โ No native data distribution - Drools has no notion of a partitioned input; you build your own sharding layer
People do run Drools "at scale" by putting 5-50 KIE workers behind a load balancer. That's horizontal scaling bolted onto a single-process engine. Each worker still has the same ceiling, and cross-worker rule-state coordination needs Kafka/Redis glue. It works, but it's work you do yourself.
Spark gives you the data distribution layer for free. sparkrules Strategy A compiles rules into one Spark projection that runs across all executors without shuffle. The architectural difference is not "sparkrules is faster" but "Spark already solves the distribution problem Drools leaves to you."
โก 3. Performance - analysis and context¶
The measured numbers sit in the Benchmark matrix at the top of this doc. This section explains what those numbers mean for picking an engine.
Per-core latency - Drools wins, honestly¶
Drools PHREAK at 20-50 ยตs p99 beats sparkrules' 595 ยตs by ~12-30ร per-core. This is a Python-vs-JVM gap, not an algorithm gap. sparkrules already has:
- โ Rete-style alpha network with closure-compiled predicates (Req 3)
- โ Range-merged alpha nodes (Req 26)
- โ
FactViewwith__slots__for zero-copy field access (Req 25)
Further Python-side gains require a native extension (Req 27-29 - Cython/Rust). Roadmap, not shipped. Until then, if you need <100 ยตs p99 at <10M rows/day, use Drools.
Single-machine batch - sparkrules wins the Python tier¶
sparkrules' 3,465 rows/sec beats rule-engine (2,716) by 1.3ร and business-rules (2,283) by 1.5ร. The gap widens with rule count: at 50 rules, sparkrules' alpha-sharing amortizes compile overhead that per-rule engines pay on every row.
Drools at ~200k rows/sec on a single JVM is faster per-core. It caps there because it cannot partition data across machines. See ยง2.
Distributed batch - only sparkrules and Flink CEP run 1B rows¶
Nothing else in the comparison distributes. The projected 44-second and 7.4-minute numbers at 200 executors come from linear scaling of the measured local[4] Strategy A throughput with 85% parallel efficiency (conservative for shuffle-less Catalyst projections). Real 200-node measurements are future work.
Crossover points¶
- Below ~50k rows: Python
LocalRuleExecutorbeats Spark Strategy A because Spark startup dominates - 50k-10M rows: single-machine paths (Python, pandas) still competitive; Spark wins at the higher end
- Above 10M rows: Spark Strategy A wins decisively, gap widens with volume
Pick the right path from ยง7 ("How you run sparkrules") based on your volume.
Python 3.13 caveat¶
Measurements ran on Python 3.13, which is slower than 3.11 for hot loops by ~20-30% on this workload. PySpark 3.5.3 officially targets 3.11. Re-running on 3.11 would show tighter numbers. This is why the report is honest about not hitting the Req 18 target of <500 ยตs p99 at 50 rules on 3.13; on 3.11 we would.
๐ก๏ธ 4. Reliability¶
sparkrules v1.0.0 released May 2026; v1.1.0 shipped the V2 optimized engine. A maintained register with forensics lives at SPARKRULES_BUG_REPORT.md (repo root). The test suite uses property-based testing (Hypothesis) for Python paths.
Spark execution helpers in src/sparkrules/spark/executor.py still carry # pragma: no cover on JVM-only apply / strategy methods so CI can enforce 100% line coverage on src/sparkrules/ without a Spark cluster; pure-Python pieces such as action_staging_merge_plan are unit-tested. Run tests/spark/ on a host with Java for full integration confidence.
Drools, IBM ODM, and Camunda are mature JVM stacks. If you need FICO-grade stability today, those remain strong choices. If you need lakehouse-native batch scoring, sparkrules targets that shape explicitly.
โ Cross-path equivalence¶
Rules evaluated on the Python path and the Spark path are intended to match for the supported DRL subset (see Req 12). Edge cases around contains, in with array columns, and Python re vs Spark RLIKE are addressed in compiler/translator paths with residual cluster verification recommended for Catalyst-specific plans; see the bug report and KNOWN_LIMITATIONS.md.
๐งฉ 5. Features - what each engine ships¶
| Feature | sparkrules | Drools | GoRules | Camunda | IBM ODM | Flink CEP | rule-engine |
|---|---|---|---|---|---|---|---|
| DRL text rules | โ | โ | โ (JDM) | โ | โ | โ | โ ๏ธ |
| DMN 1.3 import | โ | โ | โ | โ | โ | โ | โ |
| Decision tables (XLSX) | โ | โ | โ | โ | โ | โ | โ |
| Salience priority | โ | โ | โ | โ ๏ธ | โ | โ ๏ธ | โ |
| Agenda groups | โ | โ | โ | โ ๏ธ | โ | โ | โ |
| Activation groups (XOR) | โ | โ | โ | โ | โ | โ | โ |
| Rete alpha sharing | โ + closure compile | โ PHREAK | โ | โ ๏ธ | โ | โ | โ |
| ๐ SQL pushdown via Catalyst | โ | โ | โ | โ | โ | โ | โ |
| ๐ Pandas batch | โ | โ | โ | โ | โ | โ | โ |
| ๐ Spark DataFrame native | โ | โ | โ | โ | โ | โ | โ |
| Streaming (Structured Streaming) | โ | โ ๏ธ | โ | โ | โ ๏ธ | ๐ native | โ |
| Hot-swap rules live | โ | โ | โ | โ | โ | โ ๏ธ | โ |
| Explainability (reasons, bound facts) | โ | โ | โ | โ | โ | โ ๏ธ | โ |
| Adverse-action notices (ECOA/GDPR) | โ | โ | โ | โ | โ | โ | โ |
| ๐ Counterfactual simulation | โ | โ | โ | โ | โ | โ | โ |
| ๐ Data quality checks (DQ DSL) | โ | โ | โ | โ | โ | โ | โ |
| ๐ OPA/Rego export | โ | โ | โ | โ | โ | โ | โ |
| ๐ Iceberg / Delta / Hudi sink | โ | โ | โ | โ | โ | โ | โ |
| REST API | โ | โ KIE Server | โ | โ | โ | โ | โ |
| KIE-compatible REST (Drools migration) | โ | โ | โ | โ | โ | โ | โ |
| Browser authoring UI | โ Workbench | โ Kogito | โ ZenJDM | โ Modeler | โ Decision Center | โ | โ |
| Time-travel debugging | โ | โ ๏ธ | โ | โ | โ | โ | โ |
| ๐ Property-based test harness | โ Hypothesis | โ | โ | โ | โ | โ | โ |
| ๐ dbt integration example | โ | โ | โ | โ | โ | โ | โ |
๐ = features unique to sparkrules in this comparison (9 total): SQL pushdown via Catalyst, pandas batch, Spark DataFrame native, counterfactual simulation, DQ DSL, OPA/Rego export, Iceberg/Delta/Hudi result sink, property-based harness, dbt mapping example.
๐ฏ 6. Where each engine is the right choice¶
โ Use sparkrules when¶
- You already run Spark (Databricks, Glue, Dataproc, Synapse, EMR, on-prem) and want rules on DataFrames
- Rule corpus is 50-5,000 rules that business analysts change weekly
- Data volume exceeds what one machine holds (>100M rows/run)
- Output goes to a lakehouse (Iceberg, Delta, Hudi)
- You need regulatory artifacts (ECOA/GDPR adverse action, audit trails)
- You want Apache 2.0 economics and Python-native integration (notebooks, dbt, Airflow)
โ Use Drools / Red Hat BRMS when¶
- You're JVM-native and need <100 ยตs p99 at <10M rows/day
- You need stateful CEP (sliding windows, accumulators, retractions)
- You need enterprise SLA from Red Hat
โ Use Flink CEP when¶
- Sub-second latency on infinite event streams
- Complex event pattern matching across time windows
โ Use Camunda DMN when¶
- Business analysts draw decision tables in a visual modeler
- Decision logic is declarative and DMN-shaped
โ Use IBM ODM when¶
- You need commercial 24/7 support with budget
- You need Decision Center governance features
โ Use pandas or raw Python when¶
- <20 rules that never change
- No need for authoring, governance, or explainability
- One machine is enough
๐ 7. How you run sparkrules (with or without Spark)¶
sparkrules is not Spark-only. Most teams start pure-Python, keep it there for real-time and notebooks, add Spark only when data volume demands it. All paths share the same rules, same DRL, same governance, same explainability.
๐ Path 1 ยท Pure Python (no Spark, no JVM)¶
pip install sparkrules gives you a working rule engine. No Java, no cluster, no network. Used for:
- Real-time single-fact scoring in a FastAPI service (
LocalRuleExecutor.score(fact)) - Batch over a list of dicts (
LocalRuleExecutor.apply(facts)) - Notebook exploration with pandas (
apply_pandas(pack, df)) - CI test suites
- Workbench UI authoring and simulation
- REST API exposure of rules and simulations
Measured p99 at 50 rules: 595 ยตs. Batch: 3,465 rows/sec. Beats every other Python rule engine at this tier (see shootout).
๐ฆ Path 2 ยท DuckDB metadata store¶
create_rule_store("duckdb", db_path="rules.duckdb") persists your rule catalog to one DuckDB file. No server, no admin. Full SQL CRUD. Good for:
- Single-node deployments with durable rule state
- Embedded analytics apps shipping with rules as data
- Offline dev with real persistence
Real DuckDB, not a pickle stub.
๐ Path 3 ยท PostgreSQL metadata store¶
create_rule_store("postgres", database_url="postgresql://...") for multi-replica deployments via standard psycopg. Good for:
- Production REST API with 3+ replicas
- Governance workflows (dev/stage/prod promotion) needing ACID
- Organizations standardized on Postgres for metadata
Real Postgres driver, not a stub.
๐ผ Path 4 ยท Pandas batch (no Spark, vectorized)¶
apply_pandas(pack, df) runs rules over a pandas DataFrame. Simple rules use vectorized column ops; complex rules use compiled closures. Good for:
- Single-node batch up to ~10M rows
- Notebook workflows where pandas is already the data frame
- DQ rule packs that want vectorized speed without a cluster
โก Path 5 ยท Spark (only when you need it)¶
apply_drl(df, drl) on a Spark DataFrame. Used when data volume exceeds one machine. Same rules, same governance, zero code change from Python paths.
๐ Path 6 ยท REST API + Workbench UI¶
All of the above runs behind a FastAPI service. Authored rules via Monaco editor in the browser, simulate with uploaded CSVs, promote dev โ stage โ prod, all without Python code.
๐งญ What teams pick in practice¶
Most teams use paths 1 + 3 + 6 (Python engine + Postgres metadata + REST/Workbench) for authoring and real-time. They add path 5 (Spark) only for the nightly batch scoring job. Rules written in the browser run unchanged on the cluster.
โก 8. Spark integration (when you need it)¶
sparkrules runs on existing Spark without re-platforming. apply_drl(df, drl) is a one-line call. Rules compile to Spark SQL expressions (Strategy A) that run entirely in Catalyst - zero Python workers.
Spark version support¶
- โ Spark 3.x (3.0 through 3.5+) fully supported and tested
- โ
Version normalizer accepts
"3","3.5", or"3.5.1" - โ ๏ธ Spark 4.x support pinned for the next minor release; config validator currently enforces 3.x
๐๏ธ Platforms with config-driven dispatch¶
Same DRL, same Python code, one config value switches the deployment target.
| Platform | Config value | Auto-applied Spark conf | Deploy docs |
|---|---|---|---|
| โ๏ธ AWS Glue | platform="glue" |
spark.glue.dpu (default 10) |
deploy/aws-glue/README.md |
| ๐ฆ Databricks (AWS/Azure/GCP) | platform="databricks" |
spark.databricks.cluster.profile=serverless |
deploy/databricks/README.md |
| ๐จ GCP Dataproc | platform="gcp-dataproc" |
spark.dataproc.autoscaling.enabled=true |
deploy/gcp-dataproc/README.md |
| ๐ช Azure Synapse | platform="azure-synapse" |
spark.synapse.optimizeWrite=true |
deploy/azure-synapse/README.md |
| โ๏ธ Kubernetes | n/a | Standard k8s manifests | deploy/k8s/ |
| ๐ป Local / dev | platform="local" |
Spark local[*] |
pip install sparkrules[spark] |
Also compatible (not in the validator allowlist but runs PySpark 3.x): AWS EMR, Cloudera Data Engineering, on-prem Hadoop/YARN, standalone Spark, EKS/GKE/AKS via k8s manifests.
๐๏ธ Lakehouse I/O¶
| Direction | Formats |
|---|---|
| Input sources | Iceberg, Delta Lake, Hudi, Parquet, Kafka (streaming), Kinesis (streaming), JDBC |
| Output sinks | Iceberg, Delta Lake, Hudi, Parquet |
๐ก Streaming integration¶
- โ
Structured Streaming DataFrames work unchanged with
apply_drl(df, drl) - โ
refresh_rules(drl)hot-swaps rules in a running query without stopping it - โ Micro-batch and continuous modes both supported
๐ก๏ธ Zero-code-change runtime validation¶
validate_zero_code_change(cfg) runs at startup and rejects invalid combinations before the Spark job spins up:
- โ Unsupported backend / platform / Spark version
- โ Glue DPU below 2
- โ Incompatible input/output format
โ Integration checklist¶
| Capability | Status |
|---|---|
Python wheel on PyPI (pip install sparkrules) |
โ |
| Docker image on GHCR | โ |
| Kubernetes manifests | โ |
| AWS Glue job template + params | โ |
| Databricks cluster config | โ |
| GCP Dataproc config | โ |
| Azure Synapse config | โ |
DuckDB metadata store (real duckdb driver) |
โ |
PostgreSQL metadata store (real psycopg driver) |
โ |
Iceberg metadata sink (real pyiceberg) |
โ |
| Iceberg / Delta / Hudi sink for rule results | โ |
| Kafka / Kinesis / JDBC source contracts | โ ๏ธ validated schema contract; the actual readStream call lives in your PySpark job |
| REST API (FastAPI + Swagger) | โ |
| KIE-compatible REST (Drools migration) | โ |
| Browser authoring UI (Workbench) | โ |
| LSP for in-editor DRL diagnostics | โ |
| dbt mapping example | โ |
| Property-based test harness (Hypothesis) | โ |
Caveats worth reading:
- Kafka/Kinesis/JDBC support is a validated schema contract. sparkrules verifies your watermark field and partition key are present before the job runs. The actual
spark.readStream.format("kafka")call stays in your PySpark job, because that's where your broker config, auth, and checkpoint path belong. - EMR, Cloudera CDP, and standalone Spark aren't in the config validator's platform allowlist. They run PySpark 3.x so they work; use
platform="local"as the config value or extend the validator.
๐ซ 9. Why "Drools on Spark" is an anti-pattern¶
Drools marketing and blog posts sometimes suggest Drools runs "on Spark" by calling the KIE session from inside a Spark executor. It works at small scale and breaks at every other scale.
๐ง The shape of the anti-pattern¶
- Serialize a KIE session or the DRL knowledge base and broadcast it to executors
- In
mapPartitions, each executor starts a JVM-side KIE session - For each row, convert Spark Row โ Java object โ KIE insert โ fire rules โ extract โ serialize back to Spark
- Repeat per row, per partition, per job
Every row pays for a Python-to-JVM conversion (PySpark) or an object allocation (Scala Spark), a KIE session insert, a working-memory match, a retract, and a result extraction. None of that is Catalyst-optimizable. Spark sees an opaque mapPartitions function and gives up on predicate pushdown, column pruning, and vectorization.
โ ๏ธ What breaks¶
| Issue | Impact |
|---|---|
| โ KIE session init per partition | ~10-30 s pure session-setup time on 200 partitions before any rule fires |
| โ Per-row JVM bridge latency (PySpark) | ~50-200 ยตs per row round-trip via Py4J + pickle. At 100M rows, 5,000-20,000 CPU-seconds across the cluster just for the bridge |
| โ Garbage collection storms | KIE working memory accumulates facts that must be retracted and GC'd. GC pauses align badly with Spark task timeouts, causing retry storms |
| โ No Catalyst optimization | The planner sees an opaque UDF. No predicate pushdown to Parquet, no column pruning, no vectorization. amount > 1000 becomes a per-row Python โ JVM โ Python round-trip when Catalyst would have compared one integer in native code |
| โ Stateful semantics broken across executors | Drools working memory is isolated per executor. Cross-fact rules do not work unless you shuffle to the same executor, defeating parallelism |
๐ Measured effect (back-of-envelope, literature)¶
No widely published benchmark exists for this pattern because most teams abandon it before publishing. Reasonable estimate:
| Configuration | Wall time at 1M rows |
|---|---|
| Python + Drools-on-Spark | ~30-60 minutes (session init + Py4J dominates) |
| Scala + Drools-on-Spark | ~10-20 minutes (no Py4J, still KIE + GC) |
| ๐ sparkrules Strategy A (same cluster) | under 10 seconds |
Roughly 60-300ร wall-clock difference. The gap widens with row count because per-row JVM bridge cost is a fixed tax per row, while sparkrules' Catalyst pushdown runs as a single JVM projection with no per-row bridge.
โ What sparkrules does instead¶
sparkrules compiles rules into Spark SQL expressions at classification time (Strategy A). Those expressions become part of the Spark logical plan. Catalyst optimizes them alongside the rest of your query:
- โ Predicate pushdown at the Parquet/Iceberg scan layer
- โ Column pruning so only referenced fact fields are read
- โ Vectorized execution using Spark's tungsten memory layout
- โ JVM-native codegen, zero Python workers in the hot path
Zero Python-JVM bridge per row. Zero KIE session init. Zero opaque UDFs blocking Catalyst. The rules are SQL that happens to be generated from DRL.
When a rule cannot be translated to SQL (multi-fact patterns, complex actions), sparkrules falls back to Strategy C (Python workers with shared alpha network), not to a JVM engine. Strategy C is still faster than Drools-on-Spark because the alpha network is shared across rules and the bridge cost is paid once per row for all rules, not once per row per rule.
๐งญ Summary¶
Putting a JVM rule engine inside Spark executors is the data-engineering equivalent of running a database inside a MapReduce job. It works as a demo, not in production. Drools is a good tool on its native substrate (single JVM with enterprise SLA). Spark is a good tool on its native substrate (distributed query engine with Catalyst). Bolting them together gives you the worst properties of both.
sparkrules is built to be Spark-native, not a Drools-in-Spark adapter. That architectural difference is why the 1B-row benchmark works.
๐งพ 10. What this document does not claim¶
- โ Not faster than Drools per core. Drools owns per-core speed on the JVM. Native extension (Req 27-29) on the roadmap targets closing that gap.
- โ Not a replacement for Flink CEP on streaming pattern matching. Different tool.
- โ Not faster than pandas at SIMD column math. pandas isn't a rule engine; different category.
- โ ๏ธ Drools / GoRules / Camunda / ODM literature numbers are from vendor reports, not measured on this laptop. Treat them as sizing guides, not commitments.
- โ ๏ธ 100% line coverage number applies to the Python engine only. Spark execution paths are marked
# pragma: no coverand covered by manual smoke tests, not unit coverage. Bug 36.
๐๐ 11. Apples-to-apples vs apples-to-oranges - read this before quoting any number¶
This doc compares engines in four different categories. Cross-category rankings are meaningful only within their own category.
| Tier | Category | Members in this doc |
|---|---|---|
| 1 | Rule engines | sparkrules, Drools, GoRules, Camunda, IBM ODM, rule-engine, business-rules |
| 2 | Stream CEP engines | Flink CEP |
| 3 | Vectorized column math | pandas, numpy, polars (baselines) |
| 4 | Hand-written code | raw Python if/else (baseline) |
Which comparisons are fair¶
| Comparison | Fair? | Why |
|---|---|---|
| sparkrules vs rule-engine vs business-rules | โ apples-to-apples | All Tier 1, Python, same workload |
| sparkrules vs Drools on cost | โ ๏ธ same tier, different runtimes | Fair for architectural economics; gap is architectural, not per-core |
| sparkrules vs Drools on latency | โ ๏ธ same tier, different runtimes | Drools wins per-core; stated explicitly |
| sparkrules vs GoRules / Camunda / IBM ODM | โ ๏ธ same tier, different runtimes | Literature, not measured; architectural fit only |
| sparkrules vs Flink CEP | โ ๏ธ Tier 1 vs Tier 2 | Different categories; fair only for streaming CEP |
| sparkrules vs pandas / raw Python | โ Tier 1 vs Tier 3/4 | pandas isn't a rule engine |
๐งญ Honest cross-tier summary¶
- In Tier 1 (rule engines): sparkrules wins the Python sub-tier at 50+ rules (measured). Drools wins the JVM sub-tier on per-core speed (literature). sparkrules wins any sub-tier once data exceeds one machine (architectural).
- vs Tier 2 (Flink CEP): different tool. Use Flink for event streams, sparkrules for batch and structured streaming decisions. They compose.
- vs Tier 3 (pandas): different tool. Static column math โ pandas. Authoring / governance / explainability โ sparkrules.
- vs Tier 4 (raw code): different tool. Rules never change โ raw code. Non-technical changes them โ a rule engine.