# Watermarks in Large Language Models

##### Posted on (Update: ) 0 Comments

This is the note for the talk Statistical Inference in Large Language Models: A Statistical Framework of Watermarks given by Weijie Su at JSM 2024

I love the following two figures when Weijie shared the talk again in a topic session on Aug 07.

## Statistical Framework of Watermarks for LLM

watermarking has been used a a principled approach to provable detection of LLM-generated text from its human-written counterpart.

a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules.

- select a pivotal statistic of the text and a secret key
- evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate

- derived optimal detection rules for these watermarks under our framework
- competitive and sometimes enjoy a higer power than existing detection approaches through numerical experiments.

### Introduction

LLMs have emerged in recent years as a disruptive technology to generate human-like text and other media

while this reality enhances productivity in many sectors, this mismatch between ownership and generation of content could lead to several unwanted outcomes

- exacterbating misinformation
- facade of AI-assisted education
- data pollution

a viable approach is to watermark text by embedding a signal into the LLM-generated text in a manner that allows the watermark to be provably detected

- allow the verifier to identify LLM-generated text for malicious purposes without relying on assumptions about the text
- a reasonable watermarking scheme should be approximately unbiased, meaning it does not significantly distort the meaing or style of the original text

numerous watermarks have been proposed

leveraging the probabilistic natures of LLMs

- incorporate pseudorandomness into the text-generation process of LLMs
- the coupling between the LLM-generated text and the pseudorandomness serves a signal
- the constructed signal becomes pronounced for detection only when the pseudorandom numbers are known

the size of the token vocabulary is of the order of $10^4$ and varies with language models

$\cW$: the vocabulary of all tokens

- $\vert \cW\vert = 50, 272$ for OPT-1.3B
- $\vert \cW\vert = 32, 000$ for LLaMA
- $\vert \cW\vert = 50, 257$ for GPT-2 and GPT-3.5

after generating text is the form of a token sequence, $w_1w_2\cdots w_{t-1}$, the (unwatermarked) LLM generates the next token $w_t$ according to a multinomial distribution $P_t = (P_{t, w})*{w\in \cW}$ on the vocabulary $\cW$, satisfying $\sum*{w\in \cW}P_{t, w}=1$.

- $P_t$: the next-token prediction (NTP) distribution aty step $t$

a watermarked LLM generates the next token that is jointly determined by a pseudorandom variable and the NTP distribution

- $\zeta_t$: the pseudotime variable at step $t$, which is available only to the verifier.

formally, the watermarked LLM samples a token according to the rule

\[w_t = \cS(P_t, \zeta_t)\]where

- $\cS$ is a (deterministic) decoder function
- to achieve approximate unbiasedness, require that the probability distribution of $w_t$ over the randomness embodied in $\zeta_t$ is close to $P_t$, conditional on all previous tokens, $w_1\cdots w_{t-1}$

#### Gumbel-max watermark

Let $\zeta = (U_w)_{w\in \cW}$ consist of $\vert \cW\vert$ i.i.d. copies of $U(0, 1)$.

Gumbel-max trick: $\argmax_{w\in\cW} \frac{\log U_w}{P_w}$ follows the NTP distribution $P\equiv (P_w)_{w\in \cW}$

Scott Aaronson proposed the following decoder

\[S^{gum}(\zeta, P) := \argmax_{w\in \cW} \frac{\log U_w}{P_w}\]Aaronson suggested declaring the presence of the watermark if the following statistic is above a certain threshold:

\[T_n^{ars} = -\sum_{t=1}^n \log(1-U_{t, w_t})\]intuition: when the watermark is employed, a token $w_t$ is more likely to be selected when its associated pseudorandom number $\zeta_{t,w_t}$ is large.

#### Inverse Transform Watermark

given an NTP distribution $P$ and $\pi$ that maps all tokens in $\cW$ to a permutation of $1, 2, \ldots, \vert \cW\vert$, consider the multinomial distribution with probability mass $P_{\pi^{-1}(i)}$ at $i$ for $1\le i\le \vert\cW\vert$.

the CDF takes the form

\[F(x; \pi) = \sum_{w'\in \cW} P_{w'}\cdot \one_{\pi(w')\le x}\]taking as input $U ~ U(0, 1)$, the generalized inverse of this CDF is defined as

\[F^{-1}(U; \pi) = \min\{i: \sum_{w'\in \cW}P_{w'}\cdot 1_{\{\pi(w')\le i\} } \ge U\}\]the inverse transform watermark with the following decoder

\[\cS^{inv}(P, \zeta) = \pi^{-1}(F^{-1}(U;\pi))\]detection rule: examine the absolute difference between $U_t$ and the rank of the generated token, $\pi_t(w_t)$, and sum the difference across the token sequence

- when the text is watermarked, the difference turns out to be small because the involved two random variables are highly dependent.

from a statistical viewpoint, relates watermarks to hypothesis testing. The effectiveness of the Gumbel-max watermark and inverse transform watermark is not clear.

- even though they come with detection rules that seem intuitive, it is not clear if they are statistically optimal in the sense of having the optimal trade-off between Type I and Type II erros
- if they are not optimal, it is of interest to find a better detection rule

Challenges:

- how one can provably control the Type I error for any given watermark, considering that the NTP distributions, which are not accessible to the verifier, vary from token to token

the paper:

- a statistical framework for watermarks.
- Type I error is controlled by leveraging a pivotal statistic that is distributionally independent of the NTP distributions under the null

- application to the Gumbel-max watermark
- Aaronsonâ€™s detection rule is suboptimal in the sense that its class-dependent efficiency is relatively low

- application to the inverse transform watermark
- overcome a significant challenge in applying the framework to analyze this watermark by deriving an asymptotic distribution when the text is watermarked
- obtain the optimal detection rule for the inverse transform watermark in a closed-form expression by maximizing its class-dependent efficiency

### Related work

- in a different direction, many methods have been proposed to detect text generated by LLMs (often not watermarked) from human-written counterparts
- the simplest method is to build a classifier using synthetic and human text data

## A Statistical Framework for Watermark Detection

a unified way to represent $\zeta_t$:

\[\zeta_t = \cA(w_{1:t-1}, Key)\]- Key: a secret key
- $\cA$: the deterministic hash function maps any token sequence and a key to a pseudorandom number

Soundness of pseudorandomness. In the watermarked LLM, the pseudorandom variable $\zeta_{1:n}$ constructed above are i.i.d. copies of a random variable. Furthermore, $\zeta_t$ is (statistically) independent of $w_{1:t-1}$.

although $\zeta_t$ is completely determined by the prior text $w_{1:t-1}$ (and the secret key), a run of the hash function $\cA$ could effectively introduce fresh randomness, thereby making $\zeta_t$ statistically independent of $w_{1:t-1}$.

regard $\zeta$ as a random variable, which allows us to formally define the unbiasedness of a watermark.

We say a watermark is unbiased if, for any multinomial distribution $P$, $\cS(P, \zeta)$ follows $P$. That is, for any NTP distribution $P$ and token $w\in \cW$.

\[\bbP(S(\bP, \zeta) = w) = P_w\]Let $w_{1:n}$ be a sequence of tokens generated by a human who has no knowledge of the secret key. Then, the human-generated token $w_t$ and $\zeta_t$ are (statistically) independent conditional on $(w_{1:t-1}, \zeta_{1:t-1}),$ for all $1\le t\le n$.

### Pivotal Statistics

- under $H_0$, $w_t$ and $\zeta_t$ are indenpdent given $(w_{1:t-1}, \zeta_{1:t-1})$
- under $H_1$, $w_t$ and $\zeta_t$ are dependent given $(w_{1:t-1}, \zeta_{1:t-1})$

By the Neyman-Pearson lemma, the most powerful test is based on the likelihood ratio.

challenge: $P_t$ is unknown, and worse can vary with $t$

seek a pivotal statistic $Y_t = Y(w_t, \zeta_t)$ such that its distribution is the same for any NTP distribution $P_t$ under the null.

\[H_0: Y_t \sim \mu_0 \quad \text{i.i.d. for } 1\le t\le n\qquad H_1: Y_t\sim \mu_{1, P_t}\quad \text{for } 1\le t\le n\]where

- $\mu_0$: denotes the (known) distribution of $Y_t$ when the text is human-written
- $\mu_{1, P_t}$ denotes the unknown distribution of $Y_t$ conditional on $(w_{1:t-1}, \zeta_{1:t-1})$ when the text is generated by the watermarked llm