Why Most AI Writing Can’t Get Its Facts Straight

It’s been almost a year since OpenAI, the San-Francisco lab co-founded by Elon Musk, released Generative Pre-trained Transformer 3, the language model that can produce astoundingly coherent text with minimal human prompting — enough time to draw some conclusions on whether its brute-force approach to artificial intelligence can in time allow most writing to be delegated to machines. In my current job at Bloomberg News Automation, I’m in the business of such delegation, and I have my doubts that the trail blazed by GPT-3 leads in the right direction.

In these past months, lots of people have tested GPT-3, often with surprising results like these fake Neil Gaiman and Terry Pratchett stories or these “Dr. Seuss” poems about Elon Musk — or these perfectly readable newspaper columns, clearly published by editors both in awe of the new technology and relieved that AI wouldn’t be taking away their jobs any time soon.

It’s taken me a while to figure out what all these GPT-3 products resemble, and now I know: A monologue from the classic play by Nikolai Gogol, “The Inspector General.” The central character, a complete nonentity named Ivan Khlestakov, arrives in a provincial town and is taken by its elite for a high-ranking government inspector about to conduct a secret investigation into their shady affairs. Khlestakov, fired up by the red carpet treatment, the free-flowing champagne and the attentions of the town’s eligible ladies, lets loose a self-aggrandizing tirade (here in Arthur Sykes’ translation):

On one occasion I took charge of a Department. It was a funny story: the Director went off somewhere— nobody knew where. So, naturally, people began to ask how was his place to be taken? Who was to fill it? Any number of generals coveted the post and tried it, but they soon gave the thing up—too difficult for 'em! It looked easy enough, but, on closer inspection, it proved a devil of a business! There was nothing to be done, but come to me. In a twinkling the streets were choke-full of couriers, couriers after couriers. Just picture to yourselves thirty-five thousand couriers! How's that for a situation, I ask you?

Nobody could say Khlestakov’s boast was incoherent or grammatically incorrect; it was just buffoonishly untrue in every detail. I was reminded of this when testing the Bloomberg version of GPT-3, trained on a body of text produced by our thousands of writers. Fed the first sentence of this column by my Opinion colleague Anjani Trivedi, dealing with the semiconductor shortage, the model continued the narrative by citing a report by “Mark A. Rosenbaum, director of research and the president of Computer Strategies Consulting and author of The Chip Shortage”; to the best of my knowledge, a firm called Computer Strategies does exist, but both the person and the book are fictitious. The key to Rosenbaum’s report, according to GPT-3, “is something called the ‘R&D capitalization method,’ which he says allows you to ‘make an apples to apples look like apples to oranges comparison.’” The model went on:

While the details for the numbers are a little fuzzy, Rosenbaum has made several calculations, based on those numbers, that are all fairly conservative and suggest that the chip shortage is real and very real. What I want to take a moment to say here is that if you own shares of anyone you care about, you should read this analysis and be sure to discuss it with your clients. Because while Intel, Apple and Qualcomm are probably going to take a beating, they may end up having a good week.

The text generator’s political commentary is no less colorful and imaginative. Given the first line of this column by Zev Chafets about Palestinian Authority President Mahmoud Abbas, it proceeded to report the appointment of Hillary Clinton as the Palestinian Secretary of State “under a presidential nomination system that had been ruled unconstitutional more than a decade ago.” Abbas, GPT-3 added, “threatened to call Clinton’s father, former president Jimmy Carter, ‘one of those thugs who put the Jews in prison.’”

The AI model imbibed billions and billions of lines of text to mature as an artificial Khlestakov. Its capacity for invention — or let’s be tech optimists and call it imagination — appears to exceed that of many humans; the Abbas-Clinton-Carter connection is certainly beyond my modest imaginative powers. That’s why GPT-3 can be good at literary parody, a genre that requires a well-developed sense of the absurd. Nothing can develop that quality better than an inordinate amount of chaotic reading, which is the method used to train models such as GPT-3.

What the most spectacular GPT-3 products prove is that pure literary creativity, especially the derivative kind, is fungible. Surprisingly, the flight of fancy is the easiest part of writing to hand over to a machine; just train it on more obscure style and content examples than the work of Gaiman or Dr. Seuss, and few people will wince at its poetry published in literary journals or its paperback fantasy or science fiction — as long as these contributions are carefully edited for traces of bias that “stochastic parrots” like GPT-3 can inherit from the data used to train them.

I could even imagine some heir to GPT-3 being used by news organizations or, say, Substack writers to produce opinion columns. A lot of these — though none written by my Bloomberg Opinion colleagues — are relatively predictable: You more or less know in advance what a specific writer will say on any issue. So if a speech model is trained on a specific columnist’s body of work, you might get a well-honed engine that can opine on anything in a certain writer’s voice given just the first line. Again, the output would need an edit to avoid reputation-killing errors. But if a columnist gets something wrong, hey, in the end it’s just an opinion and everybody’s got one. The ritual column, which readers scan to be stroked or triggered and the columnist writes to put in their obligatory two cents, is a clear use case.

Paradoxically, it’s the most technical, formulaic stories — those dealing with market signals, deal announcements, statistical releases — that a GPT-34-like engine can’t be trusted to handle, because no matter how often we repeat that it’s a text engine, not a knowledge one, text is always only a means to an end. It always delivers a message, imparts knowledge, even if it’s only trying to create coherent sentences based on a statistical model. In news automation, voice and style — which a well-trained model is demonstrably able to imitate — are not needed, but it’s important to rule out invention, minimize interpretation and stick to the data from which the story is built. People, and sometimes robots, make trading decisions based on these stories, and an error in a potentially market-moving story can be costly. We can’t use a “stochastic parrot,” an AI Khlestakov — or, to be more generous, a fount of derivative creativity — to produce this kind of text. As GPT-3’s developers from OpenAI have pointed out,

In the long term, as machine learning systems become more capable it will likely become increasingly difficult to ensure that they are behaving safely: the mistakes they make might be more difficult to spot, and the consequences will be more severe.

To minimize the potential for errors, the OpenAI team showed that excellent results can be achieved when the model is trained with human feedback: Human labelers rate the outputs to tell the models which ones are acceptable and which are not. The example used in the OpenAI paper was summarizing Reddit posts, but theoretically, it could be applied to factual, data-based stories, too. Yet the amount of human labor necessary to train the model so it never strays from the facts and draws safe and relevant conclusions from them is much greater than the amount of work it takes to write a simple program that would produce the text based on a set of rules. Brute-forcing the task also requires considerable computing resources and consumes a fair amount of energy. Replacing the human labor of coders writing simple story scripts with the human labor of labelers plus the necessary processing power may not be worth it.

If AI is the future of writing, I certainly hope it’s not the kind of AI that needs to burn the equivalent of a coal mine as it ingests hundreds of gigabytes of data and then uses dozens of exhausted workers on minimum wage to label outputs to complete its training. Gogol’s play ends as a real government investigator arrives and Khlestakov’s moment of glory ends abruptly amid stunned silence; it’s unlikely he needs much training never to accept free drinks in a similar situation again. Humans are, in general, flexible and capable of learning from their mistakes; they can be held responsible for their errors, and those who like writing can sometimes produce truly original work — something today’s AI is unable, and not even really trying, to do. And humans who earn their living by writing aren’t begging to be replaced.

Let’s accept that even the discussion of a text-generating AI as a competitor to humans is an astounding development. The progress made in this area in recent years is undeniable. But whether humans can be outcompeted when it comes to writing remains an open question. At least with existing techniques, an AI victory in this race is unlikely.

(Bloomberg)