OpenAI: Copyrighted material is crucial for training AI chatbots

(Image credit: Daniel Rubino)

What you need to know

OpenAI has found itself in the corridors of justice after being slapped with multiple lawsuits over copyright infringement.
The company admits that it's impossible to create AI chatbots without using copyrighted material from the internet.
It highlighted that copyright law doesn't forbid training while making its submission.

While the OpenAI's fiasco that led to its board of directors to stripe Sam Altman of his position at the company as CEO is out of the way, the company can't catch a break as more trouble is seemingly brewing. As 2023 came to an end, The New York Times publicly announced its plans to sue Microsoft and OpenAI over AI unfairly using its copyrighted material, which negatively impacted the outlet monetarily.

Recently joining the fray, two non-fiction authors filed a class-action lawsuit against Microsoft and OpenAI for intellectual property theft, further staking a claim of $150,000 as restitution for damages. For those unaware, AI-powered chatbots like OpenAI's ChatGPT or Microsoft's Copilot (formerly Bing Chat) heavily ~~steal~~ rely on already existing information and resources from the internet (predominantly from websites) for training purposes.

The issue here is that the AI chatbots use the information to curate specific and detailed responses to queries, with "subtle" attribution to the source. What's more, no compensation is provided to content creators for using their work to train these models.

OpenAI recently admitted that it's literally "impossible" to create tools like ChatGPT without copyrighted material from the internet while submitting its defense to the House of Lords communications and digital select committee. For an AI chatbot to provide users with accurate information, it has to refer to vast resources already existing on the internet. However, the twist is that virtually everything on the internet right now is copyrighted.

Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.
OpenAI

OpenAI indicated that limiting its training data set to copyright-free material would create AI chatbots that cannot meet the average user's minimum requirements. Per the company's submission and defense strategy, it's apparent that "fair use" of copyrighted content is its entire lifeline.

Fair use of copyright resources creates a gray area, ultimately presenting a scenario where chatbots can obtain and use copyrighted information without necessarily seeking permission from the owner first. "Legally, copyright law does not forbid training," OpenAI added.

There's no AI without copyrighted content

OpenAI and ChatGPT — (Image credit: Daniel Rubino)

OpenAI, one of the most sought-after companies when it comes to generative AI has openly admitted that it's next to impossible to create AI-powered chatbots like ChatGPT without using copyrighted material to train the models. This is despite having unlimited access to Microsoft resources, on top of its initial multi-billion dollar investment in the technology.

In the past few months, ChatGPT has suffered several setbacks, including reports that it's getting dumber and a decline in its user base. This is amid speculations that OpenAI is running on fumes and on the verge of bankruptcy. Granted, it's quite costly a fair to run a chatbot daily. Figuratively speaking, it's to the tune of 700,000 dollars per day and one water bottle per query for cooling. A report highlighted that generative AI could consume energy to power a small county by 2027 for a year.

While the matter is still in court, it'll be interesting to see how things pan out. President Biden issued an Executive Order addressing safety and privacy concerns revolving around AI, but guardrails for the technology remain a major concern among most users.

AI chatbots have been spotted having lucid hallucinations, erroneously recommending a Food Bank as a tourist attraction, and even asking readers to take part in a poll to determine the cause of a woman's unfortunate passing. If this happened while the chatbots had access to copyrighted material, it raises a lot of concern about how much damage the technology would cause when restricted to copyright-free data. In the meantime, Google's Bard could potentially rise up the ranks having unlimited access to the entire internet.

What are your thoughts on AI chatbots using copyrighted resources without compensation and sweeping the issue under the rug as "fair use"? Let us know in the comments.

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.

1 Comment Comment from the forums

naddy69

Copyright laws are some of the clearest, easiest to enforce laws on the books. You simply can't use/reprint/distribute copyrighted material without the written consent of and/or paying the copyright holder. Period.

It is important to note that it is up to the copyright holder to defend the copyright. If a copyright holder knowingly lets someone use the material without written consent and/or payment, the copyrighted material becomes public domain and is no longer copyrighted.

This is why we are seeing these lawsuits. I guarantee you that more will come.

"OpenAI recently admitted that it's literally "impossible" to create tools like ChatGPT without copyrighted material from the internet"

Then you better re-think your business model. All of this "AI" junk is going to be seriously derailed by this. "Fair Use" does not mean using any amount - that YOU deem acceptable - of copyrighted material for free. The copyright holders determine this, not you. You will HAVE to pay up.

That's the whole purpose of copyrights. It is - literally - the Right To Copy.

Which means you will have to charge everyone that uses "AI", every time they use it. Or you will have to NOT include copyrighted material from everyone who sues you. And if you are not paying the copyright holders, the number of people suing you is only going to grow.

"Legally, copyright law does not forbid training," OpenAI added.

Really? What if schools used illegally copied books for "training" students? Do you really think they could get away with that? The schools BUY the required books and lend them to the students. In college, each student BUYS the required books.

In neither case are the students provided free, bootleg copies by the school/college. The copyright holders ARE PAID for their copyrighted materials. Period.

Otherwise it does not get used by the school/college. Period.
Reply

What you need to know

There's no AI without copyrighted content

Get the Windows Central Newsletter