What is ChatGPT?
ChatGPT has been in the news a lot recently, and we’re going to hear a lot more about it in the near future. GPT stands for Generative Pre-trained Transformer, meaning that it uses a transformer network (a type of Deep Learning neural network) trained on a very large corpus of text to produce human-like text based on a prompt. GPT is a type of Large Language Model (LLM) and there are several other competing models in various states of development. One way to think of this is that GPT is a neural network implementation of a predictive text algorithm.
ChatGPT adds an interactive use interface to the front end of the GPT-3 model, allowing users to quickly and easily interact with the tool, and even allowing users to fine-tune how it responds based on previous interactions and examples of different writing styles. Give ChatGPT a prompt, and it will build an answer, an essay, or even a joke on a given topic by “analyzing” what has been written before in its training data and any data you have provided, and giving you a set of predictive text on the topic.
This post will break down what a predictive text algorithm is generally, what neural networks are and how they work, why ChatGPT is built using a neural network, and some background on ChatGPT’s development and how the underlying models were trained. The next posts will begin to look at some of the problems with LLMs, including GPT, and with ChatGPT specifically.
Predictive Text Algorithms
One way to conceptualize a predictive text algorithm is to imagine looking at every book that has ever been published (GPT also uses a much larger body of internet text, which presents additional problems that I’ll look at in a future post) and building a table of probabilities for every common English word (there are about 40,000 or them) that captures the odds that of what the next word will be. We get a table of 40,000 x 40,000 words with each cell filled with a probability representing how often in the training text the second word follows the first. As you might imagine, many of the entries will be zeroes.
With this table in hand, we can “prompt” the table by giving it a keyword, and then hop through the table by picking the entry with the highest probability of being the next word. If we ask our model to generate text based on the prompt “cat” we might get something like “cat through shipping variety is made the aid emergency can the.” Hardly a useful sentence.
In order to get an output that makes more sense (seems more like something a human would write), we might extend our table to include the probabilities for the most likely next word given the previous two words. Our output in this case would be a little better, but now our table has 64 Trillion (40,000^3 entries), rather than just 1.6 Billion (40,000^2). If we want to continue to improve the output of our algorithm, we need to continue it for greater depth of the-words-that-came-before. The problem with this approach is that it becomes too computationally expensive before it starts to produce good quality text. (For example, if we want to consider the previous 13 words, still not enough to produce a coherent essay-length output, we need a table with as one entry for each atom of hydrogen in 100 stars the size of the sun.) This problem is complicated by the fact that we also don’t have enough written text to fill out much of the table at that depth.
Because of these limitations, we need to find a better approach. We need an approach that is both more computationally efficient, and one that can achieve reasonable results using the training data we do have. GPT (and all Large Language Models, LLMs) uses a neural network to estimate the probabilities that we don’t have the computational capacity and necessary volume of training data to build traditionally.
Neural Networks
The idea of Neural Networks dates back to 1943, when Warren McCulloch and Walter Pitts formulated a simple model of networks of neurons and attempted to analyze the model mathematically. The first discussion of using neural networks as a possible approach to artificial general intelligence (AGI, or teaching machines to think “like” humans) was in the 1948 paper Intelligent Machinery by Alan Turing. In 1949, Donald Hebb argued that such networks would be capable of learning. Computer simulations of neural networks began in the 1950s, but it soon became apparent that we didn’t have enough computing power to model enough neurons at a fast enough speed to do anything interesting. Neural Networks were put on the back burner of computing and AI for a few decades. They were studied rigorously again in the 1980s and 90s, but researchers decided that the only interesting things they could do were things that could be done more simply with probabilistic models. Neural networks finally broke out of the background between 2009 and 2012 when Swiss researchers successfully applied recurrent neural networks and deep feed-forward neural networks to win a series of international pattern recognition and machine learning competitions.
Neural networks work by simulating the type of computation that biological neurons perform. You can think of a simulated neuron as having a set of inputs (analogous to the connections (dendrites) coming to a biological neuron from other neurons); a function that converts those inputs to an output value (analogous to the body of the neuron); and a set of output connections that carries that output to some number of other neurons. Each output connection we can think of as a signal with a particular strength, or weight. In a neural network, these units are usually organized into layers, which are frequently (but not always) two-dimensional arrays of interconnected neurons. A number of layers are connected in ways that generate the type of output we are seeking. There are many different types of connection topographies within layers, many different types of layer architectures, and many different ways of “stacking” the layers. ChatGPT, which is built on GPT 3, uses something like 10 million neurons with about 175 billion connections arranged in about 400 layers.
Training the neural network is done by feeding it lots of data, and providing it feedback in the form of a difference signal from the desired output (at the final output layer, or sometimes at a layer close to the final layer). The network then adjusts the weights across the entire network in order to try to reduce the distance between the actual output and the desired output. As it turns out, it takes about as many training data runs as there are connections in the network to “fully” train the network. In the case of GPT-3, that means about 175 billion words of training data.
GPT-3 was trained on about 500 billion words of text. (Technically, tokens, which differ from words a little bit, but not enough to matter for our purposes.) The sources of that data are described in the table below (from Wikipedia):
Common Crawl is a dataset produced by the Common Crawl non-profit organization that contains over 3 billion we pages (as of Oct 2022). WebText2 is a corpus containing all Reddit submissions from 2005 through April 2022. Books1 and Books2 contain digitized published books. You may notice that the data sets are weighted withing the model in a way that isn’t the same as their token count. *Note: for a more detailed account of exactly how ChatGPT works, and an excellent overview of the underlying technologies, see this post by Stephen Wolfram: What Is ChatGPT Doing … and Why Does It Work?—Stephen Wolfram Writings
One of the things that neural networks are good at is generalizing patterns from very large, but incomplete data sets, more quickly and sometimes more usefully than a brute-force algorithm can accomplish. And that’s what ChatGPT does – it is a neural network that generalizes patterns from a large corpus of text and iteratively predicts what word should come next given the previous words (including the text it generated). By doing so, it has gotten to a level of capability that looks like useful human-written text in many ways, though it still has some important and significant limitations.
In the next post, I’ll look at some of the problems with GPTs and LLMs generally, and some of the specific issues with OpenAI’s approach to ChatGPT (and the underlying GPT-3 and -4 models).
Update:
As I was editing this post, OpenAI has released GPT-4 to the public. OpenAI has chosen not to reveal the specifics of the neural network implementation of GPT-4, but has promised that it will be “safer and more useful” than previous versions. Rumors, which have been denied by OpenAI’s founder, suggest that the model is on the order of 1000x bigger than GPT-3.