Large language models (LLMs) sit at the centre of today's AI boom. They underlie chatbots like OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Gemini. These models are improving all the time: just a few weeks ago, OpenAI released what it claims is a “majorly improved” version of GPT-4 Turbo. But how do they work?
LLMs, like most modern AI systems, are powered by complex machine-learning algorithms that use statistical techniques to discern patterns from huge amounts of data. They are built with three components: the computing power required to train and run them (imagine warehouses full of advanced computer chips); the data on which they are trained; and the algorithmic architecture which enables the system to discern patterns and make predictions. While all three are important, it is advances in algorithmic architecture that laid the foundation for the current generation of LLMs.
In 2017, researchers at Google developed the “transformer architecture”. This improved on the previous state-of-the-art in natural language processing — the subfield in AI concerned with allowing computers to comprehend and produce language — by introducing the mechanism of “self-attention”. Whereas previous models processed words in a sentence sequentially, the transformer architecture allows a model to analyse and weigh the importance of all words in a sentence simultaneously. This has enabled LLMs to grasp the nuances of language with far greater depth and accuracy than was possible in the previous paradigm.
LLMs are trained in two stages. In the first stage (“pre-training”) the models are fed huge amounts of data, which they break down into a series of small chunks called tokens. It is then instructed to teach itself how to predict the probability of a token, given the preceding one. In doing so, the model develops an understanding of language and grammar, developing an understanding of which words appear alongside each other and in what contexts. It does this by comparing its predictions against the actual data and adjusting its internal weights — numerical values within the model's architecture — until the discrepancy between what it predicts and what the data contains is minimal. This process produces what is often called a “base model”. Given any prompt, a base model will respond by continuing to predict what comes next.
In the second stage (“fine-tuning”), the model is tailored for specific tasks and preferences through exposure to a smaller, task-relevant dataset. This stage involves not just the algorithmic adjustment of the model’s weights but also human intervention. People are hired to evaluate the responses generated by the base model, rating them as good or bad based on accuracy, relevance, and appropriateness for the task at hand. In this way, the base models are sculpted to perform a particular personality, such as the helpful, harmless assistant that Claude, GPT, and Gemini present themselves as. In other cases, as with Meta’s open-source model Llama-2, model weights are released publicly so that users can fine-tune them themselves.
To the surprise of experts, scaling up the amount of computing power and data used to train these models has led to improvements in their capacities to reason, evidenced by their ever-improving performance across a range of technical benchmarks and real-world tests. The transformer architecture appears to have further room to grow, too — suggesting that LLMs will continue improving at pace, at least in the near future.