Designing Sensitive Information Types
A detection pattern is only as good as the thinking behind it. This guide covers how to design SITs that actually work in production — and how to figure out which ones your organisation needs in the first place.
Anatomy of a SIT
Every sensitive information type has three layers:
- Primary pattern — What you're looking for. A regex, keyword list, or built-in function with checksum validation.
- Corroborative evidence — How sure you are. Keywords near the match that distinguish a 10-digit number from a Medicare number.
- Confidence tiers — What happens. Different evidence combinations map to different confidence levels and policy actions.
Detection methods
- R — Regex only: The format itself is a fingerprint. Distinctive prefix or structure. Confidence potential: High.
- R+C — Regex + checksum: Numeric identifiers with built-in validation (TFN mod-11, ABN mod-89, Luhn). Confidence potential: High.
- R+K — Regex + keywords: Pattern matches common formats but needs context to distinguish. Confidence potential: Medium-High.
- K — Keyword only: No structural pattern. Detection relies on correlated keyword groups. Confidence potential: Low-Medium.
- L — Logic-based: Requires multiple signals combined. Confidence potential: Medium-High.
Confidence levels
- High (85-95): Pattern + domain-specific keyword + additional evidence. Use for blocking, encryption, automatic labelling.
- Medium (75): Pattern + general category keyword. Use for alerting, sensitivity labelling.
- Low (65): Pattern only or pattern + broad keyword. Use for discovery, reporting, audit logging.
Design principles
- Start with the format spec, not with regex — Find the authoritative source for the data format before writing regex.
- Write test cases first — If you can't write concrete examples, you don't understand the data well enough.
- Document false positives honestly — Every pattern has false positives. Document what you know.
- Use corroborative evidence to reduce noise — Keyword proximity is the most effective tool for reducing false positives.
- Deploy wide, then narrow — Start with broader patterns at lower confidence, measure, then tighten.
- One pattern, one data type — Keep patterns atomic. Combine them in policies, not in regex.
Complex SITs
Purview's XML schema supports composable logic through the <Any> element:
- OR: <Any minMatches="1"> — At least one child must match.
- AND: <Any minMatches="N"> — All N children must match.
- NOT: <Any minMatches="0" maxMatches="0"> — None of the children may match.
- N-of-M: <Any minMatches="2"> with 4 children — At least 2 of 4 must be present.
Defining sensitive data for your organisation
The hardest part of DLP isn't writing regex. It's figuring out what you need to detect in the first place.
The fastest way to build a comprehensive sensitive data inventory is a structured workshop. Key questions:
- What data would cause the most harm if it leaked?
- What data do you handle that isn't covered by current DLP rules?
- What data do your staff regularly share via email or collaboration tools?
- What data types appeared in your last incident or near-miss?
- What data do your regulators specifically call out?
The Open Top 500 — a list of 500 sensitive information types across 25 categories — is available on the full page at https://testpattern.dev/design.