ChatGPT is (NOT) a Blurry JPEG of the Web

Scroll for more

When it comes to describing the capabilities and limitations of large language models (LLMs) like ChatGPT, the internationally renowned science-fiction author Ted Chiang suggests that ChatGPT should be regarded as a “blurry JPEG of the Web”. In this post, we’ll dive into what this analogy means and how it can help us appreciate the nature of LLMs, as well as what are the limitations of that analogy.

To start, Ted Chiang (author of Stories of Your Life) describes the manner in which some of the modern Xerox photocopiers worked back in 2013. Rather than using the physical Xerographic process, these machines would scan an image, compress it — in a lossy manner — by identifying areas of the images that are substantially similar and store only a single instance of them. While this compression process was relatively efficient and effective for most purposes, it could sometimes lead to errors, where the photocopier identified similarities even when there were small textual differences, effectively corrupting the content of a text file.

Fast-forward to today’s LLMs, which Chiang regards as the result of running a lossy compression algorithm over the whole Internet. Rather than storing every single piece of information found on the web, LLMs identify statistical regularities in text and store them in a neural network. This allows users to query the LLM and retrieve information that’s similar (but not identical) to what they would find online.

According to Chiang, the resulting output is like a “blurry JPEG of all the text on the Web” — it may not be an exact replica, but it’s close enough to be useful. However, just as lossy compression can introduce compression artifacts into images, the textual output of LLMs may also feature some sort of compression artifacts: hallucinations that are plausible enough to be convincing, but not entirely accurate.

Yet, while it is easy to recognize the compression artifacts in a JPEG, to the extent that they are visible to the eyes, to identify the compression artifacts (i.e. hallucinations) of LLMs like ChatGPT, we need to compare them against the originals (or against our own knowledge).

Ted Chiang compares the process of interpolation in LLMs to the process by which image editing software reconstructs the pixels lost during compression: “estimating what’s missing by looking at what’s on either side of the gap”. He concludes that ChatGPT can be analogized to a “blur tool” for text rather than photos, but should not be regarded as anything more than that.

One of the most impressive features of ChatGPT is its ability to paraphrase information rather than simply regurgitating exact quotes. In particular, the fact that GPT doesn’t memorize information but can generate novel responses for every query could lead to the belief that it has acquired some degree of understanding over the information it has ingested, and the underlying structure of language. Yet, when ChatGPT starts hallucinating (e.g. when asked to do complex arithmetic operations), it becomes clear that it does not have a proper understanding of most of the things it says.

All in all, Chiang’s analogy provides a powerful framework for understanding the nature of LLMs like ChatGPT. By regarding these models as a lossy compression of the internet, we can better appreciate their strengths (e.g., generating novel responses) and limitations (e.g., hallucinations). For instance, we wouldn’t want ChatGPT to serve as a reliable source of information if it’s prone to hallucinations and can’t accurately recall specific facts. Moreover, the blurry JPEG analogy also underlines the dangers of training LLMs with their own generated output (akin to the process of repeatedly compressing a JPEG) which raises concerns about the potential for self-reinforcing biases and errors.

However, by reducing LLMs to mere lossy compressions of data, Chiang neglects some of the underlying functionalities of these systems. In reality, when ingesting information, LLMs create a vast latent space of possibilities, which can manifest into an indefinite number of textual representations. Some of these might be accurate representations of the original content, while others might be mere hallucinations.

Yet, by considering that all hallucinations are bad, Chiang overlooks the creative potential of LLMs. Just as science-fiction authors must imagine alternative worlds, societes and characters in their mind before putting words on paper, LLMs can generate stories by hallucinating new contexts, personalities, and conversations through a creative re-interpretation of the training data. These creative hallucinations are neither an accurate nor an inaccurate description of the original content, rather, they are novel responses based on patterns learned during training. Hence, in a way, the hallucination capacity of an LLM can be compared to the imagination capacity of humans: it is good when used in the right context (e.g. generating a compelling narrative), and bad when used in the wrong context (e.g. producing misinformation).

Ultimately, the capacity of LLMs to generate content that extends beyond the scope of the original training data is fundamentally different from mere lossy compression. Rather than being confined to simply regurgitating stored knowledge, LLMs have the ability to expand far beyond the information they have been fed with, much like humans do when they use their imagination to create new stories or ideas. Therefore, instead of referring to LLMs as a “lossy compression” algorithm, perhaps a better analogy could be that of a “lossy expansion” algorithm.