New sources of better AI training data

Large Language Models (LLMs) are no longer trained solely on data from the Internet. In the past, LLMs were based on the vast data pool of the Internet, but this approach has reached its limits. To advance LLMs, companies like OpenAI are turning to new types of data: targeted annotation and filtering improve the quality of existing data, human feedback optimizes the behavior of models, and the use of proprietary data, such as chat histories and internal documents, expands the scope of training. But the biggest change is coming from new approaches: These include synthetic data generated by the LLMs themselves, as well as human-generated data sets that specifically fill gaps in Internet training. They make it possible to improve skills that previously could not be adequately trained. In this way, LLMs not only become “Internet simulators”, but also learn to master more complex tasks that are not adequately represented on the Internet.

Related posts:

Stay up-to-date: