- Researchers from the best American universities warn that extending pre-training can be harmful to performance
- Too pre-training can offer worse performance due to something similar to the butterfly effect
- The more they train, the more they become sensitive to small changes that could interrupt the final result
Carnegie Mellon, Stanford, Harvard and Princeton researchers are challenging one of the accepted central beliefs of Ai Development, which the more training data, the better the performance will be.
As reported by HPCWIREA new document describes the concept of “catastrophic over -entry”, whereby the extended pretending can damage the performance of a model after adjustment.
The researchers compared two versions of the OLMO-1B model, one trained in 2.3 billion tokens and another with 3 billion billion. Despite the largest training set, the most widely trained model was made up to 3% worse at reference points such as Alpacaeval and ARC.
Get the turning point
This performance drop, according to the study, is linked to a phenomenon called “progressive sensitivity.”
As the tokens count increases, the model becomes more fragile. Even small adjustments, such as adjustments during fine adjustment, or noise introduction, can reverse the previous profits.
The authors demonstrated this by injecting the Gaussian noise into previously trained models, noting that the performance was degraded more abruptly the longer the model trained.
The point at which this additional training begins to degrade performance is called “turning point.”
Once achieved, the benefits of training begin to be overcome by the risk of internal instability. The study found that this turning point often occurs beyond 2.5 billion tokens in smaller models, such as OLMO-1B.
“Catastrophic overementation can be inevitable … especially when pre-training and fine adjustment tasks are misaligned,” the authors warn in their article, which you can access through the ARXIV pre-printing server.
While researchers do not suggest the end of previous training, they do feel that developers should consider how much training is sufficient. As the document concludes, “our findings require a renewed approach in the scale of the model that considers the entire training pipe.”
For the developers of the scale, the message seems clear: sometimes, less is more.