Why RedactiPHI?

Frequently asked questions about our approach to PHI de-identification.

The Short Version

What makes RedactiPHI different?

We offer enterprise-grade de-identification accuracy (95% F1 score) at a developer-friendly price point, plus features that others charge extra for or don't offer at all:

Joinable Tokenization - Same patient = same token across all documents. Enable longitudinal analytics without exposing PHI.
One-Click Re-identification - De-identify for LLM processing, then restore original values. Critical for AI scribes.
Cryptographic Audit Receipts - Tamper-proof proof of what was processed, when, and what was found.
Developer Dashboard - Usage analytics, API key management, logs - not just a CLI or library.
Built-in Webhooks - Get notified when async processing completes.
5-Minute Setup - pip install, get API key, start de-identifying. No Spark clusters or on-prem servers.

How We Compare

How does RedactiPHI compare to Private AI?

Private AI is excellent - they support 50+ entity types, 52 languages, and have strong accuracy. They're a great choice for enterprises with complex requirements.

We differentiate on:

Feature	Private AI	RedactiPHI
Setup Time	Hours to days (on-prem containers)	5 minutes (managed API)
Pricing Model	$10k+ (reports of $50k+ at scale)	Free tier + transparent pricing
Developer Dashboard	No	Yes
Joinable Tokens	Unclear	Yes - deterministic per subject
Re-identification API	Limited	Full API
Webhooks	No	Built-in
Cryptographic Receipts	No	Yes
Languages	52 languages	English (multi-language planned)

Choose Private AI if: You need 50+ languages, have dedicated DevOps for container deployment, or have enterprise procurement budget.

Choose RedactiPHI if: You want to start immediately, need joinable tokens for analytics, want transparent pricing, or need re-identification for LLM workflows.

How does RedactiPHI compare to John Snow Labs?

John Snow Labs has the best accuracy in the industry (96-98% F1). Their Healthcare NLP library is incredibly powerful and customizable.

The tradeoffs:

Feature	John Snow Labs	RedactiPHI
Accuracy (F1)	96-98%	95%
Setup Complexity	Spark cluster required	pip install + API key
Pricing	$1.86-$253/hr (AWS), per-server license	Free tier, $0.04/doc
Interface	Python library (Spark)	REST API + SDK + Dashboard
Re-identification	Build your own	Built-in API
Customization	Highly flexible	Policy-based
Bulk Processing	Spark-native (massive scale)	API batch endpoint

Choose John Snow Labs if: You need the absolute highest accuracy, have Spark infrastructure, need deep customization, or are processing billions of records.

Choose RedactiPHI if: You want competitive accuracy without managing Spark, need a simple REST API, want built-in re-identification, or have moderate volume (<1M docs/month).

How does RedactiPHI compare to AWS Comprehend Medical / Azure Health?

Cloud provider APIs are convenient but have limitations:

Lower accuracy - AWS scores ~83% F1, Azure ~91% F1 vs our 95%
No re-identification - Once redacted, the data is gone forever
No joinable tokens - Can't link the same patient across documents
Vendor lock-in - Tied to their cloud ecosystem
Usage-based pricing adds up - $0.01/unit sounds cheap until you're at scale

We're cloud-agnostic, offer re-identification, and provide joinable tokens out of the box.

HIPAA note: AWS/Azure are "HIPAA-eligible" but require you to sign a BAA and configure everything correctly. We're HIPAA-compliant out of the box with BAA available.

How does RedactiPHI compare to open source (Presidio, Philter)?

Microsoft Presidio is a solid framework, but it's exactly that - a framework you need to build on.

Accuracy gap - Vanilla Presidio scores 70-75% F1. Can be improved with tuning, but that's your job.
No healthcare optimization - General PII, not clinical text. Misses things like "Dr. Smith" in medical context.
You manage everything - Hosting, scaling, monitoring, updates, security.
No re-identification - Build your own token storage and mapping.
No HIPAA compliance - You're 100% responsible for making it compliant. No BAA available.

Open source is great for learning or highly custom needs. For production healthcare use, you'll spend more on DevOps than our subscription costs.

Feature Deep Dives

What are "joinable tokens" and why do they matter?

When we de-identify "John Smith" in document A, we create a token like [NAM_abc123]. When "John Smith" appears in document B (for the same subject), we create the same token.

                        Why this matters: You can run analytics across de-identified data. "How many visits did [NAM_abc123] have?" works because the token is consistent. Most de-identification tools create random tokens each time, breaking any ability to link records.
                    

This is critical for:

Longitudinal patient analysis
Training ML models on de-identified cohorts
Quality metrics across encounters
Research datasets that need linkability

What is re-identification and when would I use it?

Re-identification restores the original values from tokens. The primary use case is LLM workflows:

Clinical note comes in: "Patient John Smith, DOB 03/15/1980..."
De-identify before LLM: "Patient [NAM_abc123], DOB [DAT_def456]..."
Send to GPT/Claude for summarization, coding, etc.
Get response with tokens: "[NAM_abc123] presents with..."
Re-identify for clinician: "John Smith presents with..."

The LLM never sees real PHI, but the final output is human-readable.

Access control: Re-identification requires the document ID and is audit-logged. You control who can re-identify.

Can I use RedactiPHI for data pipelines and ML training?

Yes! Common use cases include:

Analytics pipelines: De-identify clinical data, maintain joinable tokens for cohort analysis
ML pre-training data: Prepare clinical text for model training without PHI exposure
Data warehousing: Build de-identified data lakes for research teams
LLM fine-tuning: Create training datasets from clinical notes safely

                        Coming soon: Data Pipeline Mode - bulk ingestion with configurable output formats (Parquet, JSONL, CSV), streaming to cloud storage (S3, GCS, Azure Blob), and integration with data orchestration tools (Airflow, Prefect, Dagster).
                    

What are cryptographic audit receipts?

Every de-identification produces a signed receipt containing:

Hash of input document
Hash of output document
PHI types detected and counts
Policy applied
Timestamp
Cryptographic signature

This creates tamper-proof evidence for compliance audits. You can prove what was processed, when, and that the output hasn't been modified since.

What PHI types do you detect?

We detect all 18 HIPAA Safe Harbor identifiers plus clinical extensions:

Names: Patient, provider, family members
Dates: DOB, admission, discharge, procedure dates (ages 89+ generalized)
Identifiers: SSN, MRN, insurance ID, account numbers
Contact: Phone, fax, email, address, ZIP code
Digital: IP address, URLs, device identifiers
Clinical: Facility names, provider credentials

Technical Questions

What's your detection accuracy?

On our internal benchmark of real clinical notes:

Precision: 95.7% (low false positives)
Recall: 95.2% (low false negatives)
F1 Score: 95.4%

We use a multi-engine approach: pattern matching, transformer-based NER, medical terminology filtering, and name detection heuristics. Results are reconciled with confidence scoring.

Note: The demo on our homepage uses only the pattern engine for speed. The full API uses all engines.

What file formats do you support?

Currently supported:

Text: Plain text, JSON, HTML, Markdown
Documents: PDF (native + scanned via OCR), password-protected PDFs
Healthcare: FHIR R4, HL7v2, C-CDA/CCD XML
Archives: ZIP (processes all files inside)
Images: PNG, JPG, TIFF (via OCR)

Coming soon:

Password-protected ZIP and 7z archives
Full C-CDA structured parsing (preserve XML structure)
EHI export bundles (eClinicalWorks, Epic MyChart, Cerner)
FHIR Bulk Data ($export) NDJSON streaming
EMR-specific format handling (Epic, Cerner, Athena, NextGen)

Is my data secure?

Encryption: AES-256-GCM for stored data, TLS 1.3 in transit
Token storage: Token mappings are encrypted and isolated per customer
No training on your data: We never use customer data to train models
SOC 2 Type II: (In progress)
BAA available: For healthcare customers

Ready to try it?

Start with 25 free documents. No credit card required.

Get Started Free