PDPC GenAI Guidelines Turn Training Data Into A Delivery Question
By wGrow Project Team ·
Here is the humanized article:
An engineer debugging a failed LLM inference call pulls up the raw HTTP request payload in CloudWatch. The JSON contains a user’s NRIC — S1234567A — alongside a highly specific symptom they typed into a chatbot interface: “recurring chest pain after eating.” That is not a debug trace. That is an unmapped personal data store sitting in CloudWatch — where log groups retain data indefinitely unless a retention policy has been explicitly configured — readable by anyone in the AWS account with CloudWatch Logs read permissions.
PDPC’s Proposed Advisory Guidelines on the Use of Personal Data in Generative AI Systems (June 2026) give concrete shape to an accountability expectation already embedded in the PDPA: organisations should be able to account for how personal data is collected, used, disclosed, retained, and protected across the full model workflow. The text is clear. The implementation typically is not.
Most engineering leads run tight access controls on their primary databases — locked-down PostgreSQL schemas, row-level security, audited service accounts. Those same leads simultaneously pipe raw LLM prompts into CloudWatch, Datadog, and analytics dashboards with default or infinite retention and no PII filter anywhere in sight. The gap is architectural, not attitudinal. Integrating an LLM means every user input field becomes a potential unstructured PII ingest point. Unlike a structured database column, there is no schema enforcement stopping an NRIC from landing anywhere.
Redacting PII Before Network Transit

| 1 | { | |
| 2 | "timestamp": "2024-10-12T08:31:02Z", | |
| 3 | "model": "gpt-4o", | |
| 4 | "messages": [ | |
| 5 | {"role": "user", "content": "Patient S1234567A reporting severe chest pain since 0600hrs."} | ← ① |
| 6 | ] | |
| 7 | } | |
| 8 |
- ① NRIC and health data bypass traditional schema locks.
We encountered this on a healthcare visitor-logging system at a Singapore public-sector site. The requirement was to parse unstructured visitor requests — reasons for visit, relationship to patient, requested access duration — without leaking sensitive identifiers to an external LLM API. Names, NRICs, and ward numbers were appearing in raw prompt strings destined for a cloud endpoint. The client had no data residency agreement with the API provider. Each prompt was, in effect, a cross-border personal data transfer.
The architecture we deployed was a gateway layer that intercepts user input before any external API call. A locally hosted, tightly scoped NER model handles the scrubbing: NRICs matching the Singapore checksum format, Singapore mobile numbers (8-digit, starting 8 or 9), and street address patterns are replaced with typed placeholders before the string leaves the VNet. “Patient S1234567A presenting at Ward 7B” becomes “Patient [REDACTED_NRIC] presenting at [REDACTED_LOCATION]”. The LLM receives enough semantic context to route the request. The personal data never crosses the network boundary.
Why this matters under the proposed guidelines: “disclosure” is not limited to intentional sharing. Sending a raw prompt to an external API endpoint is a disclosure event. If you cannot demonstrate that disclosure was necessary and proportionate, you have a purpose limitation problem. The gateway layer makes that a defensible boundary — the external model sees only what it needs to function, and your data flow documentation maps to a specific, auditable perimeter. It does introduce latency and adds a local inference dependency. Those are bounded trade-offs, worth weighing against the compliance gain for each deployment context. The compliance exposure without the gateway is not bounded at all.
Vector Stores Are Not Cryptographic Hashes
A common engineering assumption holds that converting a document into an embedding anonymises the underlying data. It does not. Embeddings are not one-way cryptographic hashes. They preserve semantic meaning and, under the right conditions, can be inverted to reconstruct source text with high fidelity. Morris et al. (2023), in “Text Embeddings Reveal (Almost) As Much As Text” (arXiv:2310.06816), demonstrate that an adversary with API access to the same embedding model can recover short-to-medium-length source text from the vector alone — the attack degrades on longer documents but remains accurate enough to reconstruct personal identifiers.
Apply that to a typical RAG architecture. Your ingestion pipeline reads a PDF containing patient discharge summaries, chunks it into 512-token segments, embeds each chunk, and stores the vectors in Milvus or Pinecone. From a PDPC perspective, that vector store is a personal data store. It requires retention rules, access controls, and a deletion mechanism — the same as any relational table holding the same underlying information.
The deletion problem is harder than it looks. If a user withdraws consent, or if the organisation no longer has a legal or business purpose to retain the data under the PDPA’s retention limitation obligation, deleting the source PDF from S3 is not sufficient. The orphaned vectors remain queryable and semantically invertible. Teams must maintain a mapping table: document ID to chunk IDs to vector IDs. When the source document is deleted, every derived vector must be explicitly removed from the index. In Milvus, that is a delete call by primary key on every chunk ID. In Pinecone, it is a batch delete by metadata filter. Neither happens automatically. Neither is implemented by default in any off-the-shelf RAG framework I have evaluated.
If your current ingestion pipeline has no deletion path, you do not have a compliant RAG system. You have a personal data store with an append-only write path.
Production Rows Do Not Belong in Evaluation Loops

Building a reliable LLM evaluation corpus is operationally necessary. Outputs shift as model providers update their APIs, and teams need hundreds to thousands of labelled test cases to catch regressions before they reach production. The fastest path is exporting real customer interactions from the production database into a staging environment.
That path is the wrong default for most deployments. Without a documented purpose, a valid legal basis, defined retention, and access controls comparable to the production system, pulling real customer rows into staging creates a purpose-limitation problem: users consented to their data being processed for the original service, not for QA evaluation of a different system on a staging cluster. It creates a secondary personal data lake with weaker controls, longer retention, and broader developer access than the production system it was copied from. The PDPC proposed guidelines are explicit that purpose limitation applies to downstream processing, not just initial collection.
The fix is building synthetic test fixtures. For a Singapore context, that means generating synthetic NRICs that pass the UIN checksum algorithm without corresponding to real individuals, fabricating residential addresses against the Singapore postal code directory structure, and scripting edge-case queries that cover the failure modes you care about. The tooling exists: Faker.js includes a Singapore locale, and a custom NRIC generator enforcing the checksum is a modest implementation effort. This is not exotic engineering — it is standard data hygiene, the same discipline applied in healthcare and financial systems long before LLMs entered the picture.
The standard objection is that synthetic data cannot replicate the long tail of real user behaviour. That is a genuine limitation. Synthetic corpora miss rare phrasing patterns, dialect-influenced inputs, and novel failure modes that only emerge at scale. But accepting that limitation does not justify the alternative. The long tail of real user behaviour also contains real NRICs, real medical histories, and real addresses. Housing that data in a staging environment for LLM evaluation is not a technical trade-off. It is a compliance failure. The coverage gap is a problem to manage through better synthetic generation and targeted human review — not by pulling production rows.
The DPIA Mindset for LLM Architecture
None of this requires inventing new compliance infrastructure. Regulated sectors — financial services, healthcare, government — have spent decades building Data Protection Impact Assessment processes for relational systems: mapping data flows, classifying sensitivity, setting retention periods, documenting the legal basis for each processing activity. That process is mature. It applies here without modification.
Every LLM integration deserves the same scrutiny before it ships. Where does personal data enter the prompt? Where does it exit? What logs capture it in transit, and what retention period applies to those logs? If the model is RAG-augmented, what is the deletion mechanism for the vector store? What is the synthetic-data strategy for evaluation? These are not new questions. They are DPIA questions, asked of a new class of system.
GenAI features do not receive a compliance exemption because the data is unstructured. Unstructured personal data is still personal data. The PDPC guidelines make that explicit. Treat LLM prompt paths with the same rigour as database query paths: data flow documented, retention bounded, deletion tested.
Synthetic test data is not optional. If your CI/CD evaluation loop runs against copied production rows, you are not carrying technical debt — you are carrying a data protection liability. The architectural changes — a PII gateway, a vector deletion mapping, a prompt logger gated behind the same scrubber as the API — are not trivial for every existing system, but they are well-defined. The compliance gap they close is not.