Text anonymization is a critical task for enabling research and development in high-stakes domains containing private data, like medicine, law, and social services. While much research has focused on redacting sensitive content from text, substantially less work has focused on what to replace redacted content with, which can enhance privacy and becomes increasingly important with greater levels of redaction. In this work, we formulate predicting replacements for sensitive spans as a research task with principled use-inspired evaluation criteria. We further propose a multi-token completion method for accomplishing this task that is designed to preserve consistency with low compute requirements, thus facilitating practitioners to anonymize data locally before sharing it externally. Human and automated annotations demonstrate that our approach produces more realistic text and better preserves utility than alternative infilling methods and differentially private mechanisms across multiple do-mains without retraining. Overall, our work explores the under-studied task of what to replace redacted content with and contributes grounded evaluations capturing utility, facilitating future work.
Research areas