ChatGPT loses its quality, according to new studies

ChatGPT — OpenAI took the world by surprise with the release of ChatGPT, a breakthrough AI language model, in late 2022. The AI service’s success paved the way for a one-of-a-kind AI race, with hundreds of tech startups racing to duplicate it.

While the service has received some criticism, OpenAI has upgraded it, improving the language model to be as error-free as possible. After a couple of tweaks, ChatGPT appears to have found its stride.

The most recent version of the AI language model pioneer prompted a currency surge, spurring calls for research to be halted. However, fresh study appears to indicate that the AI bots may have suffered a setback, resulting in a decline.

The ChatGPT study

Stanford and UC Berkeley researchers ran a study between March and June 2022 in which they systematically assessed different versions of ChatGPT. They set severe criteria for evaluating the chatbot’s ability in coding, arithmetic, and visual thinking activities. The results of ChatGPT’s performance were unfavorable.

According to the testing results, there was a concerning decrease in performance between the versions evaluated. ChatGPT accurately answered 488 of 500 questions on prime numbers during a math challenge in March, obtaining a 97.6% accuracy rate. The proportion had dropped to 2.4% by June, with only 12 questions properly answered.

The decline was not apparent until the chatbot’s software development abilities were tested.

“For GPT-4, the percentage of generations that are directly executable dropped from 52.0% in March to 10.0% in June,” the study said.

The discoveries were made using the models’ pure versions, which did not incorporate any code interpreter plugins.

The researchers employed visual stimuli and a dataset from the Abstract Reasoning Corpus for reasoning. There was a noticeable decrease, however it was not as severe as in algebra and coding.

“GPT-4 in June made mistakes on queries on which it was correct for in March,” the study said.

Possible reasons for the decline

The decline was unexpected, leading to the inquiry, “What could explain ChatGPT’s painfully obvious downgrades in recent months?” Researchers suggest that it could be a side consequence of OpenAI’s optimizations, according to a presented hypothesis.

Another likely explanation is that the changes were made as a precaution to prevent ChatGPT from responding to potentially dangerous enquiries. However, the alignment for safety may restrict ChatGPT’s utility for other purposes.

According to the researchers, the model has a tendency to provide wordy, indirect solutions rather than straightforward ones.

On Twitter, AI researcher Santiago Valderrama stated, “GPT-4 is getting worse over time, not better.” He also speculated that the original ChatGPT framework may have been replaced by a cheaper, faster mix of models.

“Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run,” he noted.

Valderrama further remarked that while smaller models may result in faster responses, they do so at the price of less knowledge.

“There are hundreds (maybe thousands already?) of replies from people saying they have noticed the degradation in quality,” Valderrama continued. “Browse the comments, and you’ll read about many situations where GPT-4 is not working as before.”

Other insights

Dr. Jim Fan, another AI researcher, remarked on some of his discoveries on Twitter after attempting to make sense of the data. Fan compared them to how OpenAI improved its models.

“Unfortunately, more safety typically comes at the cost of less usefulness, leading to a possible degrade in cognitive skills,” he wrote.

“My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from March to June, and didn’t have time to fully recover the other capabilities that matter.”

Fan also made out that the safety alignment rendered the code unnecessarily long by including unnecessary data regardless of the prompts.

“I believe this is a side effect of safety alignment,” he offered. “We’ve all seen GPTs add warnings, disclaimers, and back-pedaling.”

Fans speculated that cost-cutting efforts, as well as the addition of warnings and disclaimers, contributed to ChatGPT’s demise. In addition, the lack of extensive community feedback might have played a role. Although additional testing is needed, the results confirmed users’ fears concerning the deterioration of ChatGPT’s once-highly praised outputs.

To avoid future degradation, supporters have called for open-source solutions like Meta’s LLaMA, which allows for community debugging. They also emphasized the need of continuous benchmarking in discovering regressions.

Meanwhile, ChatGPT enthusiasts should temper their expectations because the quality of the unique and pioneering language model AI chatbot looks to have decreased.