Can AI Agents Actually Run Business Operations? — Automating Attendance Management with OpenClaw

Challenge

AI agents are being marketed as replacements for existing SaaS workflows. But can they really handle complex internal rules, multi-phase approval processes, and edge cases that took years of institutional knowledge to manage — all while keeping costs realistic and security under control?

Solution

Built a multi-agent system on OpenClaw (5 agents, 3 models) to automate attendance rule checking. Designed a three-layer architecture: Python for execution, LLMs for judgment and adaptation, and the agent platform for orchestration. Deployed on a cloud server for 24/7 operation.

Result

Achieved 100% benchmark accuracy (729/729 records), $0 per pipeline run. Confirmed that AI agents can autonomously handle clear specification changes, including HTML structure changes and rule updates.

OpenClaw — Automating Attendance Management with AI Agents

About This Project

The Problem

Since late 2025, AI agent platforms have been appearing everywhere. OpenClaw, Claude Cowork, Microsoft Copilot Cowork, OpenAI Frontier — one after another, companies are shipping platforms that promise to let AI handle business tasks. The term “SaaS Apocalypse” has entered the conversation, with the claim that AI agents will replace the workflows that traditional SaaS products have been handling.

But how much of this holds up in practice?

  • Can AI agents truly understand company-specific rules and workflows?
  • Can they complete tasks reliably when the context is complex?
  • Are API costs realistic for production use?
  • How do you manage security concerns?
  • Can they run 24/7 on cloud servers?

There are not many reports that test these questions in a production-like environment, beyond simple demos and proof-of-concept experiments.

What I Did

I worked on this project as an FDE (Forward Deployed Engineer) — embedded at the client site, understanding their business processes deeply, then building AI solutions to automate them. The contract is performance-based: payment is tied to achieving specific KPIs. This model is well known from Palantir’s approach to client engagements.

The client had a major pain point in their attendance management process.

First, the rules are complex. Detailed labor regulations, department-specific rules, position-based exceptions, years of accumulated tribal knowledge — all intertwined. “This rule only applies to Department A.” “Skip for section chiefs and above.” “If on-call duty and night shift fall on the same day, use a different judgment.” There are countless conditional branches like these.

Second, the workflow has multiple phases. Attendance checking goes through first approval, second approval, and final approval, each handled by different HR staff at different levels. The first phase handles general checks (missing clock-ins, work hour consistency), and later phases handle more critical checks (overtime limit violations, legally required break times). Each phase applies different rules, and each phase depends on the results of the previous one.

Currently, about 10 HR staff spend 10 days every month doing this entire process manually for each employee. When issues are found, they send individual emails requesting corrections, wait for responses, and re-check — all concentrated at the end of each month.

The mission of this project was to compress 100 person-days of work into 10 hours. The approval steps progress daily, so everything cannot be finished in a single day. But by reducing each phase’s execution time to under 1 hour, the full 10-day approval cycle should take only 10 hours total — a 100x reduction in labor.

Simple rule checking can be done in Python. But applying different rules per phase, branching later steps based on earlier results, and switching exception conditions by department and position — orchestrating the entire workflow is beyond what Python scripts alone can manage. A single LLM also struggles to control multi-phase business flows reliably. This is where an AI agent platform like OpenClaw becomes necessary.

This series focuses primarily on automating the rule checking itself. Workflow control (managing approval phases and switching rule sets between phases) will be covered in future articles.

I already had an attendance checking system built with Python and Playwright, which gave me a solid foundation to test whether OpenClaw’s AI agents could automate the process.

Why Run It on a Cloud Server?

In the OpenClaw community, the common approach seems to be running agents on a dedicated local machine with throwaway accounts — so if the agent goes out of control, the damage stays contained. That makes sense. Running on a cloud server is less common.

I chose a cloud server (Hetzner Cloud) for practical reasons:

  • Dedicated hardware was hard to procure for a client project — purchasing and installing physical machines required approval processes that did not match the pace of experimentation
  • A fully isolated environment was required for security — the system handles data that includes personal information, so it needed to run in a sandboxed environment, separate from the internal network, processing only anonymized data
  • 24/7 uptime — local machines sleep or lose power, but a cloud server stays running

Running on a cloud server has risks too: the agent can access the internet, server management costs money, and you have to handle security configuration yourself. I will cover these in detail in the individual articles.

What Was Built

Using OpenClaw, I built the following in stages:

  1. AI-driven rule coding — Feed attendance rules to AI and have it autonomously generate Python validation code
  2. Multi-agent architecture — Split responsibilities across 5 AI agents to prevent cost explosions and context bloat
  3. Playwright pipeline — Automate browser operations and run comparisons against the existing system at $0 per execution
  4. Autonomous adaptation testing — Test how well AI agents can autonomously handle HTML structure changes and rule additions

This is a client project, so specific company names and detailed business rules are not disclosed. The client has approved publication within the scope where no company or individual can be identified. Technical architecture, configuration, costs, and failure stories are shared as openly as possible.

Results

MetricValue
Benchmark accuracy100% (729/729 records)
Pipeline execution cost$0/run (no LLM needed — Playwright + Python)
Autonomous adaptationClear spec changes are handled autonomously
HTML structure changeCoder agent autonomously refactored to dynamic column detection
Monthly operating costHeartbeat 0(localLLM)+scheduledruns0 (local LLM) + scheduled runs 0 (scripts)

Note: The $0 figures above refer to the pipeline execution (Python/Playwright) and scheduled tasks (local LLM). Agent operations incur separate API costs — the main agent runs on Gemini Flash, the coder agent runs on Claude Sonnet/Haiku, and the analyze agent runs on ChatGPT-5.4. See the “Agent Architecture” section below for cost tiers.


Why Not “Do Everything with AI”?

Looking at the results — “$0 pipeline cost,” “Python rule checking” — you might wonder how this differs from a traditional rule-based system. And yes, the execution itself is just Python scripts. There is no need to have an LLM judge rules every time.

But running business operations with AI agents is not only about execution.

The Limits of Python Rule-Based Systems

Python rule-based systems are excellent at repeating fixed rules accurately. But in real business environments, problems like these arise:

  • When labor regulations are revised, who updates the code?
  • When the HTML structure changes, who notices the parser is broken?
  • When a new work type is added, who handles everything from updating master data to running tests?

Rule-based systems assume that the person who wrote them keeps maintaining them. When that person transfers to another department or leaves the company, the system gradually falls out of sync with reality.

The Limits of LLMs Alone

So should you just hand everything to an LLM? Through experimentation, I found this is not realistic either.

  • There are dozens of attendance rules. Passing them all to the prompt every time bloats the context and causes costs to explode
  • LLM judgments are probabilistic — there is no guarantee the same input produces the same output every time
  • For tasks that require reproducibility and precision — numerical comparisons, time calculations — code is far more reliable

In this project, when I had Gemini Flash generate the validation code, it failed repeatedly. A single task consumed over $8 in API costs without producing usable code. If this ran across all departments every month, API costs would pile up with no improvement in accuracy.

What an AI Agent Platform (OpenClaw) Solves

What OpenClaw provides is a system that combines Python’s precision with LLM flexibility, and governs the whole thing as a business process.

Python handles execution. LLMs handle judgment and adaptation. The agent platform handles orchestration. This three-layer structure is the core of a business AI agent system.

The platform provides these building blocks:

ComponentRoleWhy It Matters
SOUL.mdAgent personality and behavioral rulesExplicitly states what the agent must not do, preventing runaway behavior
AGENTS.mdRouting table and delegation rulesFixes which tasks go to which agent/model via config files
MEMORYDaily notes and long-term memoryPreserves context across sessions. Captures tribal knowledge in files
Skill (SKILL.md)Business skill definitions and proceduresGives the AI a clear understanding of what to modify and how
Session isolationIndependent sub-agent executionKeeps heavy processing contexts out of the main agent. Prevents cost explosions
Model routingPer-task model selectionSonnet for code generation, Flash for simple tasks, local LLM for scheduled jobs

For example, when labor regulations are revised: with a Python-only system, a developer has to modify the code. With an LLM alone, it does not accurately remember the rules. With OpenClaw, a user can simply say “this rule only applies to Department C, skip for section chiefs and above,” and the main agent routes it to the coder agent, which follows the definition-first principle in SKILL.md to update the TSV definitions, modify the code, and run the tests. Knowledge stays in files, execution stays in code, and judgment and adaptation stay with the LLM.

The same applies when HTML structure changes. In this project, two columns were actually added to the attendance page. I told the coder agent to “switch to detecting column positions by header name instead of fixed indices,” and it autonomously completed the entire process — from adding the design principle to SKILL.md, to refactoring the code, to running the tests.

The Three-Layer Structure

LayerHandlesGood AtNot Good At
Execution (Python)Rule checking, numerical comparison, time calculationPrecision, reproducibility, $0 costAdapting to change, handling ambiguity
Judgment (LLM)Code generation, rule interpretation, structural changesFlexibility, natural language understandingReproducibility, cost control
Orchestration (OpenClaw)Routing, memory, session management, runaway preventionEnd-to-end workflow consistency— (it is infrastructure, not a decision-maker)

The idea is not to “do everything with AI.” Instead, each layer handles only what it is best at. I believe this design is what makes business AI agents practical.


Why I Am Sharing This

Automating business operations with AI agents is technically possible. But there is a wide gap between “possible” and “practical.”

What I learned through this project is that AI agents have clear strengths and equally clear limits. They are good at executing well-defined specification changes. But investigative tasks — “something is wrong but I do not know what” — are still a human job. Throwing everything at an AI agent without structure does not produce the results you expect. Neglecting cost management leads to API bills of several dollars overnight. Misconfiguring the agent makes it go rogue.

I am sharing these experiences — with actual config files, cost records, and failure stories, not abstract arguments — in the hope that it helps others who are considering AI agents for their own business operations.


Tech Stack

LayerTechnology
AI Agent PlatformOpenClaw (v2026.3.24)
High-precision model (code generation, analysis)Claude Sonnet 4.6
Efficient model (orchestration, web tasks)Gemini 3 Flash
Local model (scheduled tasks, heartbeat)Gemma3 4B (Ollama)
Browser automationPlaywright (Python)
ServerHetzner Cloud (4vCPU / 16GB)
Debugging and verificationClaude Code (Opus)

Agent Architecture

Instead of having one AI agent handle everything, I split responsibilities across 5 agents.

AgentModelRoleCost
mainGemini 3 FlashOrchestration, task routingLow
heartbeatGemma3 4B (local)Health check response$0
monitorGemini 3 FlashWeb summaries, periodic checksLow
coderClaude Sonnet / Claude HaikuCode generation, skill updates / source verificationMedium
analyzeChatGPT-5.4Deep analysis, comparison reportsMedium

Getting to this architecture involved failures — cost explosions, runaway agents, and more. Those details are covered in the individual articles.


What I Learned

Where AI Agents Are Strong

  • They can autonomously execute clear spec changes — A single instruction like “this rule only applies to Department C, skip for section chiefs and above” triggers the full cycle: updating definition files, modifying code, and running tests
  • They can automate browser interactions — Login, form submission, and result retrieval can be handled autonomously by the LLM (though once stabilized, these can be locked down to Python scripts at $0)
  • Multi-model setups keep costs under control — Instead of running every task on an expensive model, you route tasks to different models based on their nature

Where AI Agents Fall Short

  • Investigative tasks are still difficult — Finding the root cause of a benchmark mismatch is an exploratory task where a human with a high-precision model is faster and more accurate
  • They cannot distinguish visually identical characters — Unicode traps (○ vs 〇) are especially common in AI-generated code. Cross-checking against real data is essential
  • Cost management requires active attention — Left unchecked, a few dollars can vanish in a single day. Local LLMs and session isolation help

Operational Lessons

  • Definition-first principle — For AI to autonomously modify things, knowledge must live in definition files (TSV/YAML/SKILL.md), not hardcoded in source code
  • Put everything in config files — Relying on “just ask the agent nicely” is fragile. Routing, model selection, and session settings should all be explicit in configuration
  • Set up runaway protection on day one — Heartbeat settings, limits on proactive behavior, and session isolation should be configured before you start the agent

Series Articles

The full details of this project are documented in an 8-part blog series, published progressively.

PartTopic
Part 1Building the AI Agent Infrastructure — Server Design and Cost Strategy
Part 2Making AI “Code the Rules” — Comparing 3 Models’ Autonomous Code Generation
Part 3Controlling a Runaway AI Agent
Part 45-Agent Architecture — Designing Against Context Explosion
Part 5From 93% to 100% — The Full Debugging Record of AI-Generated Code
Part 6Definition-First — Design Principles That Let AI Self-Modify
Part 7Playwright Pipeline — Automation at $0 Execution Cost
Part 8When the HTML Changes, Can AI Adapt Autonomously?

Who This Is For

  • Engineers considering AI agents for business operations
  • Anyone interested in OpenClaw or similar AI agent platforms
  • People who want to know if AI can realistically replace SaaS workflows
  • Those looking for real examples of multi-model architectures and cost optimization

Disclaimer

This project was conducted as part of a client engagement. Specific company names, detailed business rules, and authentication credentials are not disclosed. The focus is on technical architecture, configuration, costs, and failure stories.

Attendance management is a relatively simple workflow. Applying AI agents to more complex business processes would require additional consideration. I hope this serves as a practical reference for those at the early stages of introducing AI agents into their operations.

I am still relatively new to OpenClaw, so there may be settings, approaches, or best practices that I have overlooked. If you have suggestions or corrections, I would be grateful to hear them — please leave a comment or reach out on LinkedIn.

Share this article

Join the conversation on LinkedIn — share your thoughts and comments.

Discuss on LinkedIn