Features¶
For production scope, unsupported DRL/Spark edges, and execution caveats, see KNOWN_LIMITATIONS.md.
Core engine¶
- DRL-style rule parsing and evaluation
- Salience-based priority control
- Agenda group and activation group execution controls
- Explainable outputs with bound data and reason codes
- Optional
SQL_JOIN/ multi-pattern path for list-valued fact bindings
V2 optimized engine (new)¶
- AST-to-SQL translator - DRL predicates translated to Spark SQL for Catalyst pushdown
- Closure compiler - predicates compiled to Python closures at parse time (5-10x faster)
- Alpha network - shared predicate evaluation across rules (Rete-style deduplication)
- RulePack - structured, salience-ordered, classified rule collection
- Three execution strategies: SQL_PUSHDOWN, ALPHA_SHARED, PYTHON_FALLBACK
- LocalRuleExecutor - Python-native scoring with compiled closures + alpha network
- NativeRuleExecutor (optional
sparkrules-nativewheel — build/CI artifact; not on PyPI yet /[native]extra empty until publish) — Rust Tier-1 scalar scorer, JSON fact I/O, parity withLocalRuleExecutor.score(); see NATIVE_TIER1.md - SparkRuleExecutor - three-strategy Spark dispatch with typed output columns (Spark
RLIKEvs Pythonre: see KNOWN_LIMITATIONS.md) - ReteNetwork - FactView with
__slots__, range-merged alpha nodes, frozenset membership - Pandas batch evaluation -
apply_pandas()for vectorized evaluation without Spark - Cross-path equivalence - unit + property coverage in
tests/unit/test_cross_path_equivalence.pyand selectedtests/property/cases - DRL parse caching (LRU 256) for repeated evaluations
Regulatory compliance (new)¶
- Adverse-action notices -
build_adverse_action_notice()for ECOA/FCRA/GDPR Art 22 - Principal reasons capped at 4 per ECOA standard
- Deduplicated, priority-ordered reason codes with audit metadata
Data quality and profiling (new)¶
- Statistical profiling -
profile_rows()for completeness, uniqueness, mean/stddev/percentiles - DQ checks: not-null, range, in-set, regex, uniqueness, freshness, column sum, row count, table counts
- Severity levels: INFO, WARN, ERROR, CRITICAL with tolerance thresholds
Policy export (new)¶
- OPA/Rego export -
export_to_rego()converts DRL to Open Policy Agent format - DMN 1.3 import - parse Camunda-style decision table XML
Authoring formats¶
- DRL text
- Decision table JSON model
- XLSX decision table import/export
- Template-driven guided field schema generation for UI/editor surfaces
Execution and runtime¶
- Single-fact and batch-style execution paths
- Spark dataframe helper paths for partition processing - optional; default API path is pure Python (SPARK_INTEGRATION.md)
- Replay metadata model for deterministic re-runs
- Spark version targeting for Spark 3.x / 4.x runtimes with normalization (
3,3.5,4,4.2, …); default target4.0inEngineConfig - Config-only platform switching across Glue/Databricks/GCP Dataproc/Azure Synapse/local
- Configurable executor resources (cores, workers, memory, Glue DPU)
- Streaming rule refresh orchestration for micro-batch pipelines
- Input-source contract validation (batch and streaming profiles)
- Format policy classification for supported source types
- Output sink abstraction with
iceberg/delta/hudi/parquettargets - Export service with manifest and SHA-256 output integrity hash
- Zero-code-change runtime configuration contract validation
- Performance harness and scale evidence estimation utilities
- UDF registry with versioned resolution and replay-time pinning
Service surfaces¶
- FastAPI endpoints for health, rules, rule-pack import/export, version diff, governance (pins + deprecations with enforce), LSP (
/ide/lsp/analyze), simulations (default, shadow, coverage, counterfactual, chain), time-travel debug capture/replay, deployment config, DQ, Workbench helper routes - Browser Rules Workbench at
/workbench/: Monaco DRL editor, validate (parse) + LSP diagnostics, Overview (stats, charts), light/dark theme synced with editor, assets with filters, per-version activate/deactivate (see API), simulation, deployment readout, template helper, Phase 3 pack + diff, Phase 4 governance pane - Optional browser login for Workbench (
SPARKRULES_WORKBENCH_AUTH) — the shipped static shell hides the sign-in form by default (WORKBENCH_LOGIN_UI_ENABLED = false); useSPARKRULES_API_KEYor leave workbench auth unset for typical dev (WORKBENCH_LOGIN.md) - Infrastructure as code (Terraform) — reusable AWS modules (
s3-artifacts-aws,emr-ec2-roles-aws), roots for EMR / Glue / EKS / Databricks / GCP / Azure,terraform.tfvars.exampleper root, and a production deployment runbook — INFRASTRUCTURE_TERRAFORM.md, examples/infrastructure/ - Python package APIs for parser, compiler, executor, store, and runtime modules
- Data quality API endpoint for check evaluation and summarized violation outputs
- Optional
SPARKRULES_API_KEY: also required for sensitive GETs on rules, deployment, and governance when set (public:/health, OpenAPI,OPTIONS, static/workbench/-shell) - Docker
Dockerfileanddocker compose; CI can push images to GHCR and publish sdist/wheel to PyPI (trusted publishing) - PUBLISHING.md - Deploy documentation for AWS Glue, Databricks, GCP Dataproc, and Azure Synapse (config-driven)
- Phase 3: rule pack, asset search, group/namespace filter, DRL version diff, API key (writes + sensitive reads)
- Phase 4: rule namespace, dev/stage/prod promotion pins (in-memory), deprecation records and enforce to deactivate live versions - GOVERNANCE.md; lakehouse benchmark checklist: BENCHMARKS.md
- Release: PUBLISHING.md (local build, PyPI on
v*tags or manual, ghcr.io images on branch/tag push)
Metadata lifecycle¶
- Versioned rule metadata lifecycle operations
- Active window overlap detection and conflict protection
- Pluggable store backends:
in_memory; DuckDB and Postgres SQL metadata stores; Iceberg-hydrating store with optional pyiceberg append sink or pickle-on-disk fallback when no sink is configured (create_rule_storeinsparkrules.store.backends)
Observability¶
- Structured logging helpers
- Metrics endpoint support
- Runtime health analysis for UI integration payloads
- Slow-stage, high-shuffle, and task-failure issue detection
Delivery quality¶
- Full test suite with unit, property, and integration coverage
- 100% line coverage gate on
src/sparkrules(pytest tests/unit/ --cov=src/sparkrules,fail_under=100inpyproject.toml) - DRL parse caching (LRU 256) for repeated evaluations
- Architecture scope and extension points: KNOWN_LIMITATIONS.md
Regulatory compliance¶
- Adverse-action reason aggregation -
build_adverse_action_notice(),adverse_action_record(), andadverse_action_counterfactual_summary()(base vs hypothetical context) for ECOA/FCRA (US) and GDPR Art 22 (EU) counsel-reviewable templates - Principal reasons capped at 4 per ECOA standard
- Deduplicated, priority-ordered reason codes with audit metadata
Data profiling¶
profile_rows()- per-field statistics over a batch of rows- Completeness (% non-null), uniqueness (% distinct)
- Numeric: mean, stddev, min, max, p25, p50, p75
- Categorical: top-N value counts
- Structured
DataProfilewith.to_dict()for API/JSON output