Foundations 5 June 2026 · 6 min

Idempotency Keys: The Old Trick Every Agent Workflow Needs

By wGrow Project Team · 5 June 2026

The Exactly-Once Myth in Agent Pipelines

An LLM times out. The orchestration layer retries. You just sent the same invoice twice.

This is not a theoretical failure mode. It is Tuesday. Agentic frameworks are designed to retry — that is a feature, not a bug. Network partitions happen. External APIs stall under load. The orchestration layer treats a timeout as a recoverable error and fires the request again. What it does not know is whether the first attempt landed.

Developers wire agents to production databases assuming the code will run exactly once. It will not. Retrying orchestration gives you at-least-once execution, not exactly-once. Exactly-once effects — where every write lands once regardless of how many times the code runs — only exist if you pay for idempotency, transactional storage, and protocol machinery at every layer. Without that investment, a retry is not a recovery mechanism. It is a corruption mechanism. The agent charges the card, logs the quote, creates the support ticket. Twice. Every time.

Idempotency means that running the same operation ten times produces the same system state as running it once. The duplicates are absorbed. That is the whole idea.

The 2014 Wallet Double-Crediting Incident

We learned this the hard way building a payment gateway in 2014. The system listened for bank webhooks to confirm user transactions and credited wallets on receipt.

Bank servers hiccuped — transient network delays, perfectly standard behaviour. They resent the webhook payloads. Our application had no deduplication layer. It processed both payloads and credited wallets twice for a single transaction.

This cost real money. We caught the problem from a reconciliation mismatch, then spent several hours retrofitting unique transaction IDs across live tables with active users still on them. Not a pleasant afternoon.

This failure mode is not new. It runs through decades of concurrent-systems work on synchronization and repeatable operations — the lesson predates agent frameworks by a long margin. Yet in 2026 the same error is surfacing again in AI development, now with LLM-generated tool calls doing the corrupting instead of a raw HTTP handler.

The abstraction layer changed. The failure mode did not.

Timeout Retries and the Duplicate CRM Quote

Singapore Chinese engineer focused on multiple computer screens in an office.

Modern agent architectures recreate the webhook problem with more indirection. Last year we built an agentic workflow to generate vendor quotations. The agent had access to a PDF generation tool and a CRM write tool, called in sequence.

Under load, PDF generation routinely ran past the configured tool timeout. The orchestration layer interpreted the timeout as a failure and retried the tool call.

The first process completed in the background — PDF generated, CRM record written. The retry also completed — second PDF generated, second CRM record written. The vendor received two identical quotes.

Hallucinations are bad. Deterministic duplicate writes are worse. A hallucination at least looks wrong. A duplicate quote looks exactly like a process that ran correctly, twice — no obvious failure signal, which means the error propagates quietly until someone notices the numbers do not add up.

The fix was straightforward once we recognised the class of problem. The implementation took one afternoon. The recognition took longer.

Generate Operation IDs Before the Agent Acts

Context Injection

1	const opId = crypto.randomUUID();	← ①
2
3	const response = await agent.execute({
4	prompt: userPrompt,
5	tools: [crmQuoteTool],
6	context: { operation_id: opId }	← ②
7	});

① Generate once per logical operation
② Bind directly to tool context

Do not ask the LLM to invent an execution ID. A model-generated ID is not an enforced operation boundary: it may shift to a different value on a retry attempt, collide with a concurrent request, or be dropped from the tool call entirely — none of which your application can intercept reliably.

Generate an Operation ID in your application layer — before the prompt is constructed, before the agent context is assembled, before the tool list is attached — and carry it through every retry attempt. A UUID v4 works as long as it is created once per logical operation and reused on subsequent retries, not regenerated each time. A hash of the job inputs works if you want natural deduplication across semantically identical requests.

Pass this ID directly into the tool’s input schema. Whether the agent executes the tool on the first attempt or the fifth, your application code receives the same Operation ID every time.

operation_id = uuid4()
tool_input = {
  "operation_id": operation_id,
  "vendor_id": vendor_id,
  "quote_data": quote_data
}

The orchestration layer can retry as aggressively as it likes. The ID is stable. The downstream logic can check against it.

Enforcing State at the Database Level

Technical illustration of a database architecture blocking a duplicate data entry.

Constraint Handling

1	ALTER TABLE quotes ADD CONSTRAINT uniq_op UNIQUE (op_id);	← ①
2
3	// Inside tool logic:
4	try {
5	await db.insert(quote).values({ op_id: ctx.opId });
6	} catch (err) {
7	if (err.code === '23505') {	← ②
8	return 'Success: already processed';	← ③
9	}
10	throw err;
11	}

① Hard constraint at the storage layer
② Intercept duplicate writes
③ Return success to halt LLM retries

Passing an ID into the tool is not enough on its own. Application-level checks fail under race conditions — two retry threads can both pass the “does this operation exist?” check before either has written the record.

Add a unique constraint to the operation_id column in your database table.

ALTER TABLE quotes ADD CONSTRAINT uq_quotes_op_id UNIQUE (operation_id);

When the agent attempts a duplicate write, the database rejects the insert at the structural level — not at the application level, not at the “hope no two threads run simultaneously” level. The constraint holds regardless of concurrency.

Catch the unique constraint violation in your tool logic and return a success response to the agent — “operation already completed” — so it stops retrying and moves on. Before returning that response, confirm the existing record represents a completed operation, not a partial failure. The two are not the same. From the agent’s perspective, the tool succeeded. From the system’s perspective, nothing changed. That is the correct outcome.

This pattern has one real boundary: unique constraints on an operation_id column guard insert paths cleanly. Update operations — modifying an existing record rather than creating one — require a different approach, typically checking against a version or status field rather than presence alone.

Design for At-Least-Once Execution

Execution

WITHOUT IDEMPOTENCY

WITH IDEMPOTENCY

Timeout behavior

Duplicate writes

Safe retries

System state on retry

Corrupted

Consistent

Defense strategy

Application checks

Database schema

Stop trying to build systems that never retry. Accept that your agent will fire duplicate requests. Design your tools accordingly.

At-least-once execution with idempotent handlers is a coherent system. Exactly-once execution is not a guarantee that agent frameworks provide — you get at-least-once by default, and exactly-once effects only when your storage layer, protocol design, and handler logic all enforce them end-to-end. The gap between those two positions is where production incidents live.

For multi-step workflows that cross system boundaries — create order, charge card, dispatch fulfilment — operation IDs alone are not sufficient. Each step needs its own ID, and you need a strategy for compensating a partially completed chain. That is the domain of saga patterns, which are out of scope here but worth understanding before you give an agent autonomous write access to a billing system.

As agents gain access to CRMs, ERPs, and scheduling infrastructure, operation IDs are not optional hardening. They are table stakes. The 2014 wallet incident and the 2025 duplicate quote were the same problem separated by eleven years and a language model. The fix was available decades earlier.

Treat every state-changing tool call as a transaction that may run more than once. Build the constraint in before the agent touches production. The fashionable frameworks will not do this for you.

← All field notes Brief a crew →