The latest version of ChatGPT, (based on GPT-4) is about a week old, and already something funny is happening.
A Good Start
People are finding some pretty cool ways to use ChatGPT to help with their normal work.
One user gave ChatGPT a description of his company’s mission and a rough job description; GPT wrote a formal job description, and it was posted to a job board minutes later.
Another user had GPT interview him and write content based on the interview.
One friend had it draft a wealth management plan.
Another had it build the financial policy manual for his company, including lists of reports and deliverables, deadlines, and cost estimates.
People have figured out that they can get GPT to change its writing style by providing examples, and that they can even have it to write legal documents.
Several people have had it draft marketing emails and web site copy to sell their products.
For the first few days, it seemed the sky was the limit with this amazing new tool.
And Some Frustration
In the last couple of days, I’ve increasingly been hearing about problems. Users are getting frustrated with the tool misbehaving and doing unexpected things.
In one case, for a user who was having it create budgets and financial plans, it started changing hourly rates and budget lines on its own. When confronted, it apologized profusely for the mistake.
Other users have been frustrated that it seems to change its mind if you ask it the same question more than once.
One guy was puzzled when he asked it to cut and paste some text into a new response, but it wrote a new paragraph on the same topic instead.
And of course, some users have tricked it into doing nonsensical things – like writing a set of instructions for patching a hole in the wall of a house made entirely of cheese.
What’s Happening?
What’s happening here? Is GPT misbehaving? Making mistakes? Deliberately undermining its users?
Without getting into the technical explanation for each of the types of “errors” above (and there are relatively simple explanations based on how GPT works for each of them), I think there is a meta-error happening. And we, the users, are the ones making it.
We have been so impressed with our initial experiences that we are treating GPT as something like an experienced Executive Assistant and expecting that level of performance from it. It is performing well enough that when it misses the mark, we are getting frustrated with it in the same way we would if it were human.
The problem is, GPT isn’t built to be an Executive Assistant. It is built to be a language model. What that means is that it does one thing very well – it predicts what word should come next given the previous couple thousand words, based on a corpus of about 500 billion words of written language (for more on how it does that, see my previous blog post here.).
It’s worth repeating for emphasis: the only thing GPT does is generate text that looks convincing by modeling a statistical process to predict what word should come next.
What it doesn’t do is… everything else. It doesn’t build plans. It doesn’t apply logic. It doesn’t understand meaning. It doesn’t understand consequences. It doesn’t do rigorous research. It doesn’t formulate new ideas.
But GPT’s text generation model is good enough that it looks like it does those things. The training data includes many, many examples of those types of writing, so it produces output that is convincing and might even be reasonably accurate. But any accuracy or inaccuracy GPT provides is a side effect of what is in the training data.
So, if you ask GPT for a financial plan, it generates text that looks like a financial plan. If you ask it to explain the logic of Zeno’s Paradox, it generates text that looks like an essay on that topic. If you ask it to explain what it means to be alive, it generates text that looks like an answer to that question. If you ask it to provide references on a certain topic, it generates text that looks like a list of references (some of them are decent references, and some of them are rubbish).
What we are seeing is a disconnect between what we expect GPT to do and what it is actually capable of doing. This disconnect seems to be present in large part because GPT sounds so human that we expect it to behave like a human. GPT may be a victim of its own success. But it turns out that humans do a lot more than generate streams of words that sound human. We are expecting way too much from GPT.
It is important to understand what GPT does, and what it doesn’t do. The reactions of some pretty smart people experimenting with the tool suggest that together, we (the users and OpenAI) haven’t done well at communicating the limitations of GPT. I hope we become more educated users soon – before we start treating GPT as if it were an authority on any important topic.
Perhaps GPT has passed the Turing Test, at least well enough that we are beginning to treat it as if it were an underperforming co-worker. It will be interesting to see what happens over the next few months.