'Memorized GPT' Found to Reproduce Copyrighted Content...AI Raises Concerns, Stirring Debates on Norms

ChatGPT model GPT-4 was found to reproduce sentences from novels and news articles in a memorized manner. The first revelation of AI reproducing what it remembered during training was through a paper.

According to a paper published by researchers from the University of Washington and the Allen Institute for AI (AI2), GPT-4 exhibited a tendency to remember and regenerate sentences encountered during training. Researchers Avilasha Ravichander, Yejin Choi, Chandra Bhagavatula, and others demonstrated this fact using an analysis technique called “information-guided probing” in a paper published in March 2025. This method allows for the statistical determination of whether GPT-4 output specific memorized sentences, without accessing the internal structure or weights of the model.

The analysis of the BookMIA ebook dataset revealed numerous instances of GPT-4 reproducing parts of novels verbatim, including word arrangement, proper nouns, and punctuation. Some news articles, including content from the New York Times, were reproduced in a similar manner, albeit with lower frequency, suggesting the possibility of memorization in the same way.

Open AI's Sam Altman — Open AI’s Sam Altman

The research findings are directly related to the copyright infringement lawsuit filed by the New York Times against OpenAI. In late 2023, the New York Times filed a lawsuit alleging unauthorized use of its articles in OpenAI’s model training. The core issue is whether GPT-4 can output copyright content memorized during training upon user requests.

The “information-guided probing” technique used in this study is evaluated as technical evidence capable of proving this issue. The analysis showing that GPT-4 remembered and reproduced sentences encountered during training suggests that the generated sentences might be the original author’s expression rather than AI’s creation.

Under U.S. copyright law, it is the specific expression, not the idea itself, that is protected. Therefore, if the structure of a sentence generated by GPT-4 is similar to the original, it could be deemed infringement despite claims of fair use. Additionally, if AI produces copyrighted sentences at a user’s request, there could be a liability for indirect infringement imposed on the company.

Copyright dispute between New York Times and OpenAI

Technological alternatives, institutional shortcomings

Researchers also proposed technological alternatives to prevent memorization. “SUV (Selective Unlearning)” is a method to block the model from learning specific data like copyrighted content, and “DE-COP” is an analysis technique that identifies whether generated sentences originated from training data afterward. However, evidence of these technologies being applied to commercial models like GPT-4 has not been confirmed.

The concept of inserting a watermark into generated content to trace its origin is also under discussion. However, this method is limited to application post-generation and cannot remove or detect memorized expressions embedded within the model. For models like GPT-4 that have already learned from extensive data, the technology to completely eliminate memorized content remains in its early stages.

OpenAI logo

Call for redefinition of norms

Discussions on how far content retained during training by large language models can be considered memory are expanding beyond the boundaries of technology into the realms of law and institutions. Without clear standards for the source of data collection, copyright holder consent, and the scope of learning, similar disputes will inevitably recur.

The case of GPT-4 is just the beginning, as large language models worldwide are not free from copyright issues. The memorization findings presented in the paper go beyond simple case collection, potentially serving as empirical evidence influencing future court judgments.

Determining how AI technology should balance between freedom of expression and creators’ rights is now an issue for legislative and judicial systems in various countries to decide. As technology advances, the corresponding legal responsibilities and ethical standards must become more sophisticated.

The Secrets of the Wine Industry: The Truth Behind Brand-Determined Prices (Part 1)

Seoul City Offers ‘Seoul Happiness Trip’ to 1,100 Tourism-Disadvantaged Teams for Free 1-Night, 2-Day Travel

Leave a Comment 응답 취소

Categories

Follow us

Contact Us

‘Memorized GPT’ Found to Reproduce Copyrighted Content…AI Raises Concerns, Stirring Debates on Norms

Leave a Comment 응답 취소