Bethel Hall

Bethel Hall (ቤቴል ሆል)

Ph.D. student in Computer Science · Stevens Institute of Technology · Expected 2028

Current AI agents are powerful but fundamentally unverifiable. My work focuses on making them reliable for high-stakes domains.

I'm a Ph.D. student in Computer Science at Stevens Institute of Technology, advised by Prof. William Eiers. My research improves the reliability of large language models with neurosymbolic methods, focusing on systems that can provide hard correctness guarantees. I study how to bring these guarantees to high-stakes domains where LLMs are increasingly deployed, including finance, healthcare, and cloud infrastructure.

Outside research, I enjoy biking, hiking, and running.

Publications

Neurosymbolic Auditing of Natural-Language Software Requirements

Bethel Hall, William Eiers

Preprint, arXiv:2605.13817

We show that disagreements between independently-generated LLM formalizations of the same requirement reliably signal latent ambiguity, and that SMT-generated counterexamples can localize and repair it.

CloudFix: Automated Policy Repair for Cloud Access Control Policies Using Large Language Models SANER 2026

Bethel Hall, Owen Ungaro, William Eiers

SANER 2026 (Research Track)

We show that SMT-based fault localization, which pinpoints the exact policy statement at fault before invoking the LLM, more than doubles repair rates over LLM-only baselines on real-world AWS policies.

Code

Exploring Large Language Models for Access Control Policy Synthesis and Summarization

Adarsh Vatsa, Bethel Hall, William Eiers

Preprint, arXiv:2510.20692

We show that LLMs fail to reason about policy semantics with reliability-grade precision, even when given their own explanations as input, and that pairing automata-derived characterizations with LLM simplification resolves this gap.

Projects

VERIMED

VERIMED audits natural-language software requirements before they become formal models or implementations. It checks for ambiguity, inconsistency, vacuousness, redundancy, and safety violations, producing solver-backed evidence instead of subjective review.

Natural-language requirements SMT Z3 Counterexample-guided repair

100% formalization rate — 64/64 requirements converted without manual intervention QA accuracy lifted from 55.4% to 98.5% using SMT counterexample feedback (ABZ hemodialysis benchmark, n=64) Caught 12 latent ambiguities in real hemodialysis machine specifications

Paper

VERIMED method overview — Key insight: ambiguity in a requirement causes independent LLMs to produce contradictory formalizations — a disagreement SMT can detect and localize with a concrete counterexample.

VERIMED experimental results — Key insight: ambiguity in a requirement causes independent LLMs to produce contradictory formalizations — a disagreement SMT can detect and localize with a concrete counterexample.

CloudFix

CloudFix repairs AWS IAM access-control policies with a verified LLM workflow. It localizes the faulty policy statement with SMT, asks an LLM for a targeted patch, and formally checks the result before accepting it.

AWS IAM Policy repair Fault localization Quacky + Z3

Evaluated on 282 real-world AWS IAM policies 84.0% repair rate vs. 48.2% baseline — a 36-point improvement on 10-request tasks 54.3% vs. 22.3% baseline on harder 30-request tasks — more than double the baseline

Paper Code

CloudFix repair method overview — Key insight: SMT fault localization identifies which specific policy statement is wrong before the LLM is asked to repair it, dramatically narrowing the repair search space.

CloudFix experimental results — Key insight: SMT fault localization identifies which specific policy statement is wrong before the LLM is asked to repair it, dramatically narrowing the repair search space.

POLICYSUMMARIZER

POLICYSUMMARIZER characterizes what cloud access-control policies actually allow. It converts AWS, Azure, or GCP policies into automata, simplifies the result with an LLM, and verifies the summary against the formal representation before accepting it.

LLM evaluation Neurosymbolic AI Regex simplification Human-AI collaboration

Non-reasoning LLMs top out at 45.8% semantic equivalence, highlighting deployment reliability limits Reasoning LLMs reach 86.4%–93.7% — highlighting the gap between model classes Formal summarizer achieves 0.889 mean similarity on semantic request descriptions

Paper Tool

Policy summarization method overview — Key insight: automata-based equivalence checking verifies that a natural-language summary is semantically correct, not just fluent — guaranteeing the description matches what the policy actually enforces.

Policy summarization experimental results — Key insight: automata-based equivalence checking verifies that a natural-language summary is semantically correct, not just fluent — guaranteeing the description matches what the policy actually enforces.

News

2026

May 2026Preprint released, Neurosymbolic Auditing of Natural-Language Software Requirements (arXiv).

Mar 2026Presented CloudFix at SANER 2026, Limassol, Cyprus.

Mar 2026Passed Ph.D. oral qualifying exam.

2025

Dec 2025Paper accepted to SANER 2026 (Research Track).

2024

May 2024Passed Ph.D. written qualifying exam.

2021

Aug 2021Selected as a Global UGRAD Scholar by the U.S. Department of State.