PHI Scanning vs. Redaction: What Actually Protects Data

Scanning finds PHI in a prompt. Redaction removes it. They are not the same thing, and redaction is not the safe default most teams assume it is.

Table of contents

PHI scanning and redaction get used as if they were the same product feature. They are not. One tells you where regulated data is. The other removes it. Confusing the two leads regulated teams to deploy AI in a way that is both less useful and less safe than they think.

Scanning is detection — finding where PHI sits inside a prompt, a document, or a model response. Redaction is removal — stripping or masking the PHI that scanning found, before the text leaves your environment. You can scan without redacting. You usually should not redact without scanning. And neither one, on its own, is de-identification in the legal sense HIPAA defines.

This post draws those distinctions and then makes a contrarian point: redaction is often sold as the safe default for AI in healthcare, and for most regulated workflows it is the wrong default. The better model is to send real PHI under a proper Business Associate Agreement, use scanning for policy enforcement and audit, and reach for redaction deliberately — for the specific flows where it actually fits.

What PHI scanning is, and what it is not

PHI scanning is the step that reads a piece of text and identifies the protected health information in it — patient names, dates of service, medical record numbers, diagnoses expressed in free text, and the rest. It produces a map: here is the PHI, here is its type, here is its location.

Scanning does not change the text. It is a detection layer, and detection is useful even when nothing gets removed afterward. A few things scanning gives you on its own:

  • Policy enforcement. If your organization’s policy says a particular workflow should never carry PHI, scanning is what makes that policy enforceable instead of aspirational. The system can see the PHI and act — block, flag, or route — rather than relying on a user to notice.
  • Audit evidence. Scanning produces a record of what kind of regulated data passed through a given prompt. That record is evidence a compliance officer can point to when reconstructing what happened.
  • Catching PHI where it should not be. A staff member pastes a chart note into a workflow built for general business questions. Scanning catches it. Without scanning, that disclosure is invisible until something goes wrong.

None of those uses involve removing anything. That is the point. Scanning earns its place in a compliant deployment whether or not you ever redact. Treating scanning as merely “the thing that happens before redaction” understates what it does.

What redaction actually does — and what it costs

Redaction takes the PHI that scanning found and removes or masks it. The patient’s name becomes [NAME]. The date of birth becomes [DATE]. The redacted text is what reaches the model.

The appeal is obvious. If the PHI never leaves your environment, the model provider never sees it, and the surface for a disclosure shrinks. That is a real benefit, and for some flows it is exactly right.

But redaction has a cost that gets undersold: it is lossy. Removing identifiers also removes context, and clinical context is frequently the thing that makes an AI response useful. A model asked to draft a prior-authorization justification needs the actual treatment history. A model summarizing a patient’s record needs the dates that establish a timeline. Redact aggressively and you can hand the model a prompt too thin to produce anything worth reviewing. Redact conservatively and you have left PHI in. There is no setting that is both maximally safe and maximally useful.

The second cost is subtler and worse: imperfect redaction gives false confidence. Automated detection is good, not perfect. It is strongest on structured identifiers — a formatted medical record number, a recognizable date — and weakest on free-text clinical notes, where PHI hides in phrasing rather than in fields. A note that says “the patient, a retired schoolteacher from a small town who was widowed last spring” contains no name and no number, yet a determined reader could narrow it considerably. A redaction pass that strips the obvious identifiers and leaves that sentence has not de-identified anything. It has produced text that looks safe.

That is the trap. A team that redacts believes the PHI problem is handled. A team that sends PHI under a BAA knows the PHI is there and treats it accordingly. The second posture is often the more honest one.

”We removed the name” is not de-identification

There is a specific legal claim hiding inside a lot of redaction marketing: that redacted data is no longer regulated. It usually is not true.

HIPAA defines de-identification precisely. Under 45 CFR 164.514, health information is de-identified only by one of two methods.[1] The Safe Harbor method requires removing 18 specified identifiers — names, geographic subdivisions smaller than a state, all date elements more specific than a year, contact details, account and record numbers, and a catch-all for any other unique identifying number, characteristic, or code — and requires that the covered entity have no actual knowledge that the remaining information could identify an individual.[1] The Expert Determination method requires a person with appropriate statistical knowledge to determine, and document, that the risk of re-identification is very small.[1]

Data that clears one of those bars is genuinely outside the Privacy Rule — HIPAA no longer applies to it.[1] That is a real and valuable status. But it is a high bar, and a redaction pass that catches the obvious identifiers does not reach it. Safe Harbor’s catch-all clause and its treatment of quasi-identifiers — combinations of facts that are not on the list but still narrow to an individual — are exactly where automated redaction is weakest.

One category sits between regulated and de-identified. A limited data set strips direct identifiers but keeps dates and certain geographic detail; it is still PHI, and disclosing it still requires a data use agreement.[2] So “we redacted the names and the obvious numbers” most likely produces something like a limited data set at best — still regulated — not de-identified data. Calling redacted text “de-identified” because a name was masked is not a small inaccuracy. It is a claim that a regulator can check.

The better default: real PHI under a BAA, scanning for control

If redaction is lossy and imperfect, what is the alternative? For most regulated AI work, it is to send the real PHI — to a provider that is covered.

A Business Associate Agreement is the contract that makes PHI disclosure to a vendor lawful. When the AI tool and the model provider behind it are both covered by a BAA, sending a full clinical prompt is not a workaround. It is the intended path. The model gets the context it needs, the output is worth reviewing, and nothing has been stripped on the hope that what remains is safe.

In that model, scanning does not disappear — it changes job. Instead of feeding a redaction step, scanning becomes the enforcement and evidence layer. Every prompt is checked for PHI before it reaches a model. The organization sets the policy, and the gateway applies it: allow the PHI through under the BAA, redact it for this particular flow, or block it. Scanning is what makes any of those three options enforceable. It also produces the record of what was decided, which feeds the audit trail.

This is the framing worth holding onto. Redaction is one policy outcome a team can choose, per flow. It is not a compliance requirement, and it is not something a platform should do to your data by default. Sending PHI under a BAA is the core promise; redaction is a tool you pick up when a specific situation calls for it.

When redaction is the right call

None of this means redaction is wrong. It means redaction is a deliberate choice for specific flows, not a blanket default. The clearest cases:

  • A model or provider you have decided not to cover. If a particular workflow needs a model that is not under your BAA — an experimental tool, a niche provider — redaction before the data reaches it is the right control. You have decided the provider should not see PHI; redaction enforces that decision.
  • Output that leaves the covered boundary. If an AI-generated summary is going somewhere a limited data set or de-identified data is the appropriate standard — a research context, an external analytics tool — redaction is part of getting there, alongside the legal method that actually clears the bar.
  • Minimum-necessary trimming. HIPAA’s minimum necessary standard asks that PHI use be held to what the purpose requires.[3] For a flow that genuinely does not need a given identifier, removing it is sound practice.

In each case redaction is doing a defined job for a defined reason. What it should not be is the thing a team turns on everywhere because it sounds safer than the alternative — when the alternative, a properly covered BAA path, is both safer and more useful.

Where redaction runs matters too. If a flow does call for it, the scanning and the re-identification should happen inside the same compliance boundary that holds the rest of your data — not handed to a third party you would then have to cover under its own BAA. HASP runs its own PHI scanning pipeline: healthcare-specific recognizers on managed compute inside the HASP compliance boundary. When a customer chooses redaction, both the detection and the re-identification stay inside that boundary.

What this means for evaluating an AI vendor

When a vendor pitches PHI handling, the distinction between scanning and redaction tells you what questions to ask.

Ask whether the tool can send PHI under a BAA at all, or whether redaction is the only path it offers. A platform that can only redact is telling you it cannot be trusted with real PHI — which limits what its AI can actually do.

Ask whether scanning runs even when you are not redacting. If scanning only exists as a precursor to redaction, you lose the policy enforcement and audit value of detection on the flows where you send PHI through.

Ask where redaction runs and who operates it. If detection and re-identification are delegated to a separate service, that service is now in your data path and needs to be covered.

Ask whether the PHI decision is logged. Whatever the policy outcome — allowed, redacted, blocked — there should be a tamper-evident record of it. An audit trail that captures the PHI decision for every prompt is what lets a compliance officer reconstruct not just what the AI did, but how regulated data was handled getting there.

A vendor that frames redaction as the safe default and cannot answer those questions has sold a feature, not a compliance posture. Scanning and redaction are both real tools. Used precisely — scanning for control and evidence, redaction as a deliberate per-flow choice, real PHI under a BAA as the foundation — they protect data. Used as marketing, they protect the appearance of it. See how HASP draws the line in practice on the Audit & Trust page, or read how the BAA covers the full inference path.


Sources

  1. U.S. Department of Health & Human Services. “Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule.” HHS.gov. hhs.gov

  2. U.S. Department of Health & Human Services. “Limited Data Set — HIPAA FAQ.” HHS.gov. hhs.gov

  3. U.S. Department of Health & Human Services. “Minimum Necessary Requirement.” HHS.gov. hhs.gov