Evaluating LLM Watermarking Robustness Across Model Scales

TL;DR

We benchmark three LLM watermarking algorithms (KGW, Unigram, EXP) across three OPT model scales (125M, 1.3B, 2.7B), two datasets (C4, WMT16), and three adversarial attacks using the MarkLLM framework. Our central finding is that the low-entropy "scale floor" problem is algorithm-specific: KGW fails on OPT-125M (TPR = 0.09 at 10% FPR) while Unigram and EXP achieve near-perfect detection at the same scale. GPT-3.5 paraphrasing is by far the most damaging attack, yet Unigram remains markedly more paraphrase-resistant and imposes the lowest perplexity cost, making it the strongest practical choice under adversarial deployment.

Detectability Across Scales

Bar chart showing True Positive Rate at 10% False Positive Rate for KGW, Unigram, and EXP algorithms across OPT-125M, OPT-1.3B, and OPT-2.7B model sizes on C4 and WMT16 datasets. KGW collapses to near-chance at OPT-125M while Unigram and EXP maintain near-perfect detection. — **Figure 1.** Detectability across model scales (TPR at 10% FPR). KGW collapses to chance at OPT-125M while Unigram and EXP retain near-perfect detection. Above 125M, all three algorithms plateau.

Key Findings

Algorithm-Specific Scale Floor

KGW fails to embed reliably on OPT-125M (TPR = 0.09 on C4, 0.16 on WMT16) due to the low-entropy embedding problem. Unigram and EXP, with different embedding mechanisms, are immune to this failure mode. Above the scale floor, detectability plateaus: going from 1.3B to 2.7B yields no measurable benefit.

Paraphrasing Is the Dominant Attack

Lexical attacks (word deletion, synonym substitution) leave detection largely intact (≤ 6 pp average TPR drop). GPT-3.5 paraphrasing causes a 57.6 pp average TPR drop, roughly 9× more damaging than lexical attacks. Both KGW and EXP suffer comparably; the shared vulnerability is prefix-context dependence, not the embedding mechanism itself.

Unigram Is the Deployment Winner

Unigram is paraphrase-resistant by a factor of nearly two (36 pp drop vs. 67–70 pp for KGW/EXP), retains TPR = 0.92 on OPT-2.7B + C4 even after paraphrasing, and imposes the lowest perplexity cost across scales. It is also the simplest of the three algorithms.

Robustness to Paraphrasing

Chart showing surviving watermark detection rate (TPR) after GPT-3.5 paraphrasing attack for KGW, Unigram, and EXP across OPT-1.3B and OPT-2.7B models on C4 and WMT16 datasets. Unigram consistently dominates with the highest surviving detection rate. — **Figure 3.** Surviving detection after GPT-3.5 paraphrasing. Unigram dominates across all model–dataset combinations.

Reproducibility

All 3,600 generated samples, attack outputs, and detection scores are checked into the repository. The full pipeline replays in approximately 30 minutes on a Colab T4 without rerunning the 3-hour generation phase. See the README for instructions.