ChatGPT's Carbon Footprint and Illegal Data Scrapping

In a new turn of events, the New Yorker reported that ChatGPT is using 500,000 Kilowatt-hours of electricity a day. Also, the possible move by The New York Times to sue OpenAI could mean complete doom for ChatGPT with a possible $150,000 for each piece of infringement.

The woes of ChatGPT seems to be infinite. The new research now shows that ChatGPT uses 17,000 times more energy than a US household. This means that OpenAI will soon join the list of high electricity consumers in the commercial sector such as Sumsung, Google, and Microsoft which consume 23, 12, and 10 terawatt-hours per day respectively.

The Times Sues OpenAI

This is not the first time the big tech company has come into the spotlight over copyright concerns. Insiders reported that after weeks of tense negotiations, the two companies, OpenAI and NYT failed to reach an agreement over a possible licensing deal.

As an initial step, the NYT updated the terms of service to prohibit AI companies from using its content in training AI models. What is intriguing in the possible lawsuit is that should the NYT press charges and win the case, OpenAI will bear a heavy burden in penalties. First, OpenAI will wipe off ChatGPT datasets. The experts further pointed out that the ChatGPT maker could pay USD150,000 for every infringing piece of content.

Well, that’s scary, right?

OpenAI’s success with the ChatGPT has seen myriads of legal woes that are likely to bring down the large language model. Sarah Silverman and the other two people pressed charges over infringement of copyright laws by OpenAI and Meta. Similarly, top international publishers also pressed charges against OpenAI and Google regarding intellectual rights. The NYT experts attending the launching of Google’s Genesis pointed out that it “seemed to take for granted the effort that went into producing accurate and artful news stories.”

So, how exactly is ChatGPT Trained?

ChatGPT is trained through a two-step process: pretraining and fine-tuning. This process allows it to generate coherent and contextually relevant responses based on existing content. Here’s a simplified overview of how the training works:

Pretraining

During pretraining, the model is trained on a large corpus of publicly available text from the internet. This corpus includes a wide range of topics and writing styles, allowing the model to learn grammar, facts, reasoning abilities, and even some level of common sense. The training data consists of parts of books, articles, websites, and more.

The model learns to predict the next word in a sentence based on the preceding words. It learns to capture patterns, context, and linguistic relationships in the text.

Fine-tuning

After pretraining, the model is not directly ready for generating specific responses. It has learned a lot from the internet data, but it also learned some biases and factual inaccuracies present in that data. Fine-tuning is the crucial step to make the model safe and controlled.

Fine-tuning involves training the pre-trained model on a narrower dataset that is carefully generated with human reviewers. These reviewers follow guidelines provided by OpenAI to review and rate possible model outputs for a range of example inputs. The model generalizes from this reviewer’s feedback to respond to a wide array of user inputs safely and coherently.

OpenAI maintains a strong feedback loop with reviewers, involving weekly meetings to address questions, provide clarifications on guidelines, and iteratively improve the model’s behavior.

A keen look at the process of training ChatGPT quickly shows you that there is little to no involvement of content creators or those who own publicly available information.

Generative AI Violating Copyright Laws

There is no doubt that the large language model survives by copying millions of works to train their AI tools. It implies that their successful exit from the court corridors largely depends on how they negotiate solutions with content owners to escape the intellectual property penalties. In the words of Daniel Gervais, “Copyright law is a sword that’s going to hang over the heads of AI companies for several years unless they figure out how to negotiate a solution.”

In as much as it’s not in the interest of anyone to wipe off datasets from ChatGPT, OpenAI seems to come into direct competition with giants including NYT after scrapping illegal data. This is an unfair competition that puts the publishers and content creators in an awkward position. The NYT lawyers are optimistic that fair use cannot protect ChatGPT maker as the tool is more of a replacement for their content.