Stanford Study: ChatGPT Is Suddenly Much Worse at Math

A recent study conducted by Stanford University has unveiled significant performance fluctuations in OpenAI’s AI chatbot, ChatGPT, over a span of a few months. When the study examined OpenAI’s GPT-4 AI chatbot, it recognized a prime number 97.6 percent of the time in March, but just 2.4 percent of the time in June.

Fortune reports that the AI chatbot ChatGPT from OpenAI has experienced significant performance changes over a period of a few months, according to a recent Stanford University study.

OpenAI CEO Sam Altman speaks in Abu Dhabi, United Arab Emirates, Tuesday, June 6, 2023. Altman on Tuesday suggested an international agency like the International Atomic Energy Agency could oversee artificial intelligence worldwide while visiting the United Arab Emirates. (AP Photo/Jon Gambrell)

The study, led by Stanford computer science professor James Zou, examined the performance of two versions of the chatbot, GPT-3.5 and GPT-4, across four distinct tasks: solving math problems, answering sensitive questions, generating software code, and visual reasoning. The results revealed unexpected shifts in the chatbot’s ability to execute certain tasks.

One of the most striking findings was the drastic change in GPT-4’s ability to solve math problems. According to the study, GPT-4 was able to recognize that the number 17077 is a prime number 97.6 percent of the time when it was asked in March. But only three months later, its accuracy fell to a pitiful 2.4 percent. In contrast, the GPT-3.5 model showed an opposite trajectory, improving its accuracy from 7.4 percent in March to 86.8 percent in June on the same task.

Zou and his team attribute these fluctuations to the unpredictable effects of changes in one part of the model on others. “When we are tuning a large language model to improve its performance on certain tasks, that can actually have a lot of unintended consequences, which might actually hurt this model’s performance on other tasks,” Zou said.

The researchers also noted that the exact nature of these unintended side effects remains elusive due to the lack of visibility into the models powering ChatGPT. “These are black-box models,” Zou stated, “So we don’t actually know how the model itself, the neural architectures, or the training data have changed.”

Another key finding from the study was ChatGPT’s failure to properly explain its reasoning process. In March, the chatbot would lay out its “chain of thought,” but by June, it stopped showing its step-by-step reasoning. “For reasons that are not clear,” Zou says, “ChatGPT stopped showing its step-by-step reasoning.”

Zou emphasized the importance of transparency in AI models explaining that it is important for a chatbot to display its work so that researchers can examine how it generates specific conclusions, in this case, whether or not 17077 is a prime number. He also highlighted the need for continuous monitoring of AI models’ performance over time, stating, “The main message from our paper is to really highlight that these large language model drifts do happen. It is prevalent. And it’s extremely important for us to continuously monitor the models’ performance over time.”

COMMENTS