AI & Agents 2 May 2026 · 13 min

Inside the WaterDoctor Crew: a research desk and a sensor-to-PDF agent on a weekly cadence.

By Scott Li · 2 May 2026

Two pipelines, one weekly cadence, one human gate. How the WaterDoctor crew reads ten aquaculture journals, fact-checks every paper, then turns each pond's pH/DO/ammonia stream into a bilingual PDF the farm manager, the vet and the regulator can all read off.

The WaterDoctor crew runs inside WaterDoctor’s backend on a Monday cadence. Two pipelines. The first is a deep-research desk — three agents reading ten aquaculture journals, regional market feeds and government policy announcements, fact-checking every item, assembling a curated bilingual brief. The second is a per-pond report agent that takes a week of sensor data — pH, dissolved oxygen, temperature, ORP, ammonia, nitrite, nitrate, turbidity, algae — folds in the research bundle and the next seven days of weather, and produces a PDF a farm manager, an aquaculture vet and an environmental regulator can all read off.

I’m one of the two wGrow engineers embedded with the WaterDoctor team on the ground. This is the crew we built, the gates that hold it, and the parts I’d build differently if we were starting next Monday.

What runs Monday morning

The cadence is the architecture. The deep-research pipeline runs Sunday night so the verified bundle is sitting in the database when the report agents wake up Monday. Each report agent works one pond at a time — pulling the seven-day sensor history off the field gateway, pulling the next-seven-day forecast from the regional weather provider, pulling the verified research bundle from the previous step. Out comes a numbered report. The reviewer reads. The reviewer signs. The PDFs go out.

Two cadences sharing one human gate. That’s the shape.

The research desk: research → fact-check → editor

The research agent runs four parallel grounded searches per cycle. Two paper sweeps across ten journals — Aquaculture, Fish & Shellfish Immunology, Aquaculture Reports, Reviews in Aquaculture, Frontiers in Marine Science, Journal of Fish Diseases, Aquacultural Engineering, Marine Biotechnology, Nature Communications, Water Research. Plus region-tuned sweeps for China and Southeast Asia market and policy news. Every paper carries a real DOI. Every news item carries the publisher’s exact date and URL — copied from the search grounding, not reconstructed from the model’s training memory.

That last constraint is load-bearing. Early drafts of the research agent would happily emit a DOI that looked right for a paper that didn’t exist, or reconstruct a URL with the right shape but the wrong path. We have one rule the agent cannot break: every URL in its output must appear in the search-grounding citations the search call actually returned. If the model wants to cite something the grounding didn’t surface, the citation is dropped. We’d rather lose the item than ship a fabrication.

The fact-check agent then verifies item by item. For each, an independent grounded search runs to find live corroborators. Every candidate URL is HTTP-checked for liveness — we keep up to six live URLs per item. The verdict is one of verified, unverified, or flagged. There’s a 70%-flag retry rule: if more than seven of every ten items in the batch come back flagged on the first attempt, the entire batch retries once before persisting verdicts. Search-grounding flakiness — the kind that shows up as transient connection refusals or rate-limit surges — should not propagate as quality data. We learned this the hard way after a Sunday in February when an upstream provider had a bad two hours and the next morning’s bundle came in with a nine-out-of-eighteen-flagged ratio that wasn’t real.

The editor agent does only bilingual cleanup. Tidy titles, harmonise EN ⇌ 简体中文 across every metadata field, never invent. Flagged items pass through untouched so the reviewer sees exactly what fact-check saw. The editor cannot promote a flagged item to verified; only the reviewer can.

The reviewer panel shows the per-item verdict, the corroborating URLs, and an Override toggle. Overrides are on the record with a reason. The bundle the report writer reads is the bundle the reviewer signed off — there is no parallel version of the truth in the system.

The weekly report: sensor stream → bilingual PDF

This is the part that matters operationally. The research desk is rigour; the report is what an aquaculture client pays for.

Each Monday, for each pond under the contract, the report agent pulls:

Seven days of sensor readings — pH, dissolved oxygen, temperature, ORP, ammonia, nitrite, nitrate, turbidity, algae density. The sensor stream comes in at 5-minute intervals from the field gateway; the agent reads aggregates plus the raw alarm-event log.
The next seven days of weather for the pond’s location — temperature high/low, rainfall probability, wind, sunlight hours.
The week’s region-tuned research bundle — the verified items from pipeline 01 that match the pond’s species and culture phase.

It produces a numbered report: overview, water quality (core judgment plus per-parameter sparkline and range), disease screening, weather risk and advice, aeration strategy, disease checklist, feeding schedule, FCR analysis, energy comparison, cost breakdown. Every section is rendered EN and 简体中文 side by side. Every diagnosis ties phenomenon → cause → remedy, with a HIGH/MEDIUM/LOW risk level a vet can defend.

The “phenomenon → cause → remedy” structure is the part I’d defend hardest if I were rebuilding this. Sensor anomalies do not interpret themselves. A DO drop at 3am is phenomenon; the cause might be aerator failure, nighttime algal respiration, or fish biomass exceeding the system’s oxygen budget; the remedy depends on which. The agent must propose a cause and a remedy together, with the risk level keyed to how confident the cause attribution is. A “MEDIUM” cause attribution gets a “MEDIUM-RISK action” — usually a check rather than an intervention. The reviewer can promote, demote or rewrite any of the three. The structure forces the agent to commit; the reviewer’s edits are the training signal for next week’s prompt.

What this won’t do

Five constraints we won’t bend:

No invented citations. The research agent’s URLs are constrained to ones the search grounding actually returned. Reconstructed-from-training-memory DOIs are dropped before fact-check ever sees them.

No date-massaging. A paper outside the search window keeps its real date. The report records reality, not a tidy fiction that fits the week.

No silent flags. Flagged items survive the editor untouched and surface in the reviewer panel. The human sees exactly what the agent saw.

No publish without a human. Every weekly report passes a reviewer before the EN/ZH PDFs go out. There is no auto-publish path. We wired the auth around this — the publish endpoint requires a reviewer-signed token; agents cannot mint one.

No single-agent demo. A “do-everything” agent is unevaluable. Five narrow agents with one job each is what’s running in production. We have prototyped the alternative; it never beat the crew on the eval.

How we built it

We built the report writer first. The research desk came later, after we had the report writer good enough that “research-bundle quality” was the next bottleneck.

Phase 1 — sensor stream to a useful PDF. Six weeks. One agent, one pond, no research bundle, no weather. Just take 7 days of readings and produce a paragraph that sounded like a vet’s interpretation. The first version was bad — confident on data the sensors were known to drift on, hand-wavy on actual pH violations. We added a per-parameter trust map: each sensor type gets a known-failure profile (the DO probe biofouls, the pH probe drifts after the third week without calibration, the ORP probe is noisy at low values). The agent’s prompt now reads the trust map and weights its conclusions accordingly. Any reading flagged as untrustworthy gets a “sensor maintenance recommended” line in the report, not a confident interpretation.

Phase 2 — multiple ponds, bilingual. Four weeks. We extended the agent to read the pond’s species, culture phase, and operating parameters from a profile. Bilingual rendering came in this phase — and we discovered that translating after drafting introduced subtle drift between the two language versions. The fix was to render both languages in a single agent call from a typed payload — the agent emits the structured report, then both English and Chinese are generated from the structure, not from each other.

Phase 3 — research desk. Eight weeks. This was the heaviest phase. Three agents, fact-check verdicts, the 70%-flag retry, the reviewer override panel. The first version of fact-check had no retry logic and no liveness check; we added both after a single bad Sunday persisted twelve “verified” items with dead URLs.

Phase 4 — research-into-report. Two weeks. Connecting the verified bundle into the report writer. Surprisingly small. The hard work was in pipeline 01; pipeline 02 just had to read the bundle.

Phase 5 — eval and override telemetry. Ongoing. Every reviewer override goes back into a labelled corpus we use to tune both agents. The eval rubric — claim coverage on the research desk, sensor-trust coverage on the report writer — is reviewed every quarter. We have not had to retrain any model; the lifts have all come from prompt and memory work.

Total: about five months from first sensor read to a Monday cadence the team trusted.

What I’d build differently

Three.

First, the trust map should have been on the schema, not in the prompt. We carry sensor-trust as a per-pond config record now. For the next embedded engagement, that record would be in the schema from day one — every reading-row joined to a trust-state-row at write time. The agent should never see a reading without seeing whether the sensor that produced it is currently trusted.

Second, the reviewer panel should ship with the agent, not after it. We wrote the report writer for two weeks before there was a UI to review its output in. Reviewer-readable output is the actual deliverable; we should have built the surface first and wired the agent into it, not the other way around.

Third, the per-pond profile is the unit of personalisation, not the prompt. The temptation when scaling to many ponds is to bake more into the agent’s tier-02 memory. The right answer is to keep the agent generic and put the personalisation in a typed profile the agent reads. Same agent, different profiles, different output. We are halfway there; I’d be all the way there.

The shape ports. Swap aquaculture journals for clinical guidelines, sensor streams for telemetry, ponds for sites. A research desk that learns the customer’s domain on a weekly cadence; a report agent that turns operational data into something a customer or a regulator can read; a human gate that signs every output. That’s the embedded-delivery model wGrow runs, and the WaterDoctor crew is the most-instrumented version we have on a Monday cadence.

— Scott Li, wGrow

← All field notes Brief a crew →