Why RedactiPHI?

Frequently asked questions about our approach to PHI de-identification.

The Short Version

What makes RedactiPHI different?

We offer enterprise-grade de-identification accuracy (95% F1 score) at a developer-friendly price point, plus features that others charge extra for or don't offer at all:

  • Joinable Tokenization - Same patient = same token across all documents. Enable longitudinal analytics without exposing PHI.
  • One-Click Re-identification - De-identify for LLM processing, then restore original values. Critical for AI scribes.
  • Cryptographic Audit Receipts - Tamper-proof proof of what was processed, when, and what was found.
  • Developer Dashboard - Usage analytics, API key management, logs - not just a CLI or library.
  • Built-in Webhooks - Get notified when async processing completes.
  • 5-Minute Setup - pip install, get API key, start de-identifying. No Spark clusters or on-prem servers.

How We Compare

How does RedactiPHI compare to Private AI?

Private AI is excellent - they support 50+ entity types, 52 languages, and have strong accuracy. They're a great choice for enterprises with complex requirements.

We differentiate on:

Feature Private AI RedactiPHI
Setup Time Hours to days (on-prem containers) 5 minutes (managed API)
Pricing Model $10k+ (reports of $50k+ at scale) Free tier + transparent pricing
Developer Dashboard No Yes
Joinable Tokens Unclear Yes - deterministic per subject
Re-identification API Limited Full API
Webhooks No Built-in
Cryptographic Receipts No Yes
Languages 52 languages English (multi-language planned)

Choose Private AI if: You need 50+ languages, have dedicated DevOps for container deployment, or have enterprise procurement budget.

Choose RedactiPHI if: You want to start immediately, need joinable tokens for analytics, want transparent pricing, or need re-identification for LLM workflows.

How does RedactiPHI compare to John Snow Labs?

John Snow Labs has the best accuracy in the industry (96-98% F1). Their Healthcare NLP library is incredibly powerful and customizable.

The tradeoffs:

Feature John Snow Labs RedactiPHI
Accuracy (F1) 96-98% 95%
Setup Complexity Spark cluster required pip install + API key
Pricing $1.86-$253/hr (AWS), per-server license Free tier, $0.04/doc
Interface Python library (Spark) REST API + SDK + Dashboard
Re-identification Build your own Built-in API
Customization Highly flexible Policy-based
Bulk Processing Spark-native (massive scale) API batch endpoint

Choose John Snow Labs if: You need the absolute highest accuracy, have Spark infrastructure, need deep customization, or are processing billions of records.

Choose RedactiPHI if: You want competitive accuracy without managing Spark, need a simple REST API, want built-in re-identification, or have moderate volume (<1M docs/month).

How does RedactiPHI compare to AWS Comprehend Medical / Azure Health?

Cloud provider APIs are convenient but have limitations:

  • Lower accuracy - AWS scores ~83% F1, Azure ~91% F1 vs our 95%
  • No re-identification - Once redacted, the data is gone forever
  • No joinable tokens - Can't link the same patient across documents
  • Vendor lock-in - Tied to their cloud ecosystem
  • Usage-based pricing adds up - $0.01/unit sounds cheap until you're at scale

We're cloud-agnostic, offer re-identification, and provide joinable tokens out of the box.

HIPAA note: AWS/Azure are "HIPAA-eligible" but require you to sign a BAA and configure everything correctly. We're HIPAA-compliant out of the box with BAA available.

How does RedactiPHI compare to open source (Presidio, Philter)?

Microsoft Presidio is a solid framework, but it's exactly that - a framework you need to build on.

  • Accuracy gap - Vanilla Presidio scores 70-75% F1. Can be improved with tuning, but that's your job.
  • No healthcare optimization - General PII, not clinical text. Misses things like "Dr. Smith" in medical context.
  • You manage everything - Hosting, scaling, monitoring, updates, security.
  • No re-identification - Build your own token storage and mapping.
  • No HIPAA compliance - You're 100% responsible for making it compliant. No BAA available.

Open source is great for learning or highly custom needs. For production healthcare use, you'll spend more on DevOps than our subscription costs.

Feature Deep Dives

What are "joinable tokens" and why do they matter?

When we de-identify "John Smith" in document A, we create a token like [NAM_abc123]. When "John Smith" appears in document B (for the same subject), we create the same token.

Why this matters: You can run analytics across de-identified data. "How many visits did [NAM_abc123] have?" works because the token is consistent. Most de-identification tools create random tokens each time, breaking any ability to link records.

This is critical for:

  • Longitudinal patient analysis
  • Training ML models on de-identified cohorts
  • Quality metrics across encounters
  • Research datasets that need linkability
What is re-identification and when would I use it?

Re-identification restores the original values from tokens. The primary use case is LLM workflows:

  1. Clinical note comes in: "Patient John Smith, DOB 03/15/1980..."
  2. De-identify before LLM: "Patient [NAM_abc123], DOB [DAT_def456]..."
  3. Send to GPT/Claude for summarization, coding, etc.
  4. Get response with tokens: "[NAM_abc123] presents with..."
  5. Re-identify for clinician: "John Smith presents with..."

The LLM never sees real PHI, but the final output is human-readable.

Access control: Re-identification requires the document ID and is audit-logged. You control who can re-identify.

Can I use RedactiPHI for data pipelines and ML training?

Yes! Common use cases include:

  • Analytics pipelines: De-identify clinical data, maintain joinable tokens for cohort analysis
  • ML pre-training data: Prepare clinical text for model training without PHI exposure
  • Data warehousing: Build de-identified data lakes for research teams
  • LLM fine-tuning: Create training datasets from clinical notes safely
Coming soon: Data Pipeline Mode - bulk ingestion with configurable output formats (Parquet, JSONL, CSV), streaming to cloud storage (S3, GCS, Azure Blob), and integration with data orchestration tools (Airflow, Prefect, Dagster).
What are cryptographic audit receipts?

Every de-identification produces a signed receipt containing:

  • Hash of input document
  • Hash of output document
  • PHI types detected and counts
  • Policy applied
  • Timestamp
  • Cryptographic signature

This creates tamper-proof evidence for compliance audits. You can prove what was processed, when, and that the output hasn't been modified since.

What PHI types do you detect?

We detect all 18 HIPAA Safe Harbor identifiers plus clinical extensions:

  • Names: Patient, provider, family members
  • Dates: DOB, admission, discharge, procedure dates (ages 89+ generalized)
  • Identifiers: SSN, MRN, insurance ID, account numbers
  • Contact: Phone, fax, email, address, ZIP code
  • Digital: IP address, URLs, device identifiers
  • Clinical: Facility names, provider credentials

Technical Questions

What's your detection accuracy?

On our internal benchmark of real clinical notes:

  • Precision: 95.7% (low false positives)
  • Recall: 95.2% (low false negatives)
  • F1 Score: 95.4%

We use a multi-engine approach: pattern matching, transformer-based NER, medical terminology filtering, and name detection heuristics. Results are reconciled with confidence scoring.

Note: The demo on our homepage uses only the pattern engine for speed. The full API uses all engines.

What file formats do you support?

Currently supported:

  • Text: Plain text, JSON, HTML, Markdown
  • Documents: PDF (native + scanned via OCR), password-protected PDFs
  • Healthcare: FHIR R4, HL7v2, C-CDA/CCD XML
  • Archives: ZIP (processes all files inside)
  • Images: PNG, JPG, TIFF (via OCR)

Coming soon:

  • Password-protected ZIP and 7z archives
  • Full C-CDA structured parsing (preserve XML structure)
  • EHI export bundles (eClinicalWorks, Epic MyChart, Cerner)
  • FHIR Bulk Data ($export) NDJSON streaming
  • EMR-specific format handling (Epic, Cerner, Athena, NextGen)
Is my data secure?
  • Encryption: AES-256-GCM for stored data, TLS 1.3 in transit
  • Token storage: Token mappings are encrypted and isolated per customer
  • No training on your data: We never use customer data to train models
  • SOC 2 Type II: (In progress)
  • BAA available: For healthcare customers

Ready to try it?

Start with 25 free documents. No credit card required.

Get Started Free