Content-recall testing
Surface verbatim memorization through next-passage prediction and direct probing of source texts.
ACM SIGKDD 2026 · Lecture-style Hands-on Tutorial
A hands-on forensic tutorial on copyright infringement and plagiarism detection in large language models.
Large language models increasingly reproduce copyrighted passages and paraphrase protected sources — raising urgent legal, ethical, and scientific questions about how to evidence such misuse. This tutorial reframes copyright infringement and plagiarism as an evidence-discovery process rather than a binary classification task, and equips attendees with practical, reproducible forensic tools to audit models even under black-box access.
Across three hours we cover two complementary halves. Part I — Copyright Infringement introduces the legal framing and walks through Copyright Detective, an interactive forensic system unifying content-recall testing, paraphrase-level similarity analysis, persuasive-jailbreak probing, and unlearning verification. Part II — Plagiarism Detection builds on “Do Language Models Plagiarize?” to examine verbatim, paraphrase, and idea-level plagiarism in model outputs — and how to measure each. Attendees leave able to run these audits themselves.
Surface verbatim memorization through next-passage prediction and direct probing of source texts.
Catch leakage that survives rewording using ROUGE, semantic, Jaccard, and MinHash signals.
Stress-test refusal mechanisms with Ethos, Alliance-Building, and Reciprocity strategies.
Probe whether “forgotten” content is truly erased via representational drift analysis.
Distinguish verbatim, paraphrase, and idea-level plagiarism in generated text.
Frame infringement as discoverable evidence and produce defensible, reproducible audit reports.
Tentative agenda · all times in Korea Standard Time (UTC+9) · subject to change.
Why LLM copyright and plagiarism are evidence problems, not classification problems. Legal landscape and scope of the tutorial.
Content-recall testing, paraphrase-level similarity, persuasive-jailbreak probing, and unlearning verification — demonstrated live in the interactive forensic system.
From “Do Language Models Plagiarize?” — measuring verbatim, paraphrase, and idea-level plagiarism, with hands-on detection workflows.
Responsible deployment, limitations of current methods, and where the research goes next.
The full accepted proposal for the KDD 2026 tutorial.

Assistant Professor
Stevens Institute of Technology
Corresponding tutorAlgorithm Engineer
Pine AI
Lead author, Copyright Detective
Professor
The Pennsylvania State University
Plagiarism & trustworthy AIZhang, G., Zhu, J., Qian, C., Gong, N., Mihalcea, R., Xu, Z., He, J., Ma, J., Huang, Y., Xiao, C., Li, B., Abbasi, A., Lee, D., Ji, H., & Zhang, D. (2026). Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks. arXiv preprint arXiv:2602.05252 [cs.CL].
arxiv.org/abs/2602.05252 ↗Lee, J., Le, T., Chen, J., & Lee, D. (2023). Do Language Models Plagiarize? In Proceedings of the ACM Web Conference 2023 (WWW ’23), pp. 3637–3647.
doi.org/10.1145/3543507.3583199 ↗@misc{zhang2026copyrightdetective,
title = {Copyright Detective: A Forensic System to Evidence LLMs
Flickering Copyright Leakage Risks},
author = {Guangwei Zhang and Jianing Zhu and Cheng Qian and Neil Gong
and Rada Mihalcea and Zhaozhuo Xu and Jingrui He and Jiaqi Ma
and Chaowei Xiao and Bo Li and Ahmed Abbasi and Dongwon Lee
and Heng Ji and Denghui Zhang},
year = {2026},
eprint = {2602.05252},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2602.05252}
}