Do stochastic parrots understand what they recite?

There had never been a time like this before in the history of Artificial Intelligence (AI) where AI was this democratic and celebrated. With the advent of deep learning, big data, and high-performance computing, things that seemed impossible just a decade ago can now be achieved with a few lines of code. The area where the advancement leap has been the greatest is in Natural Language Processing (NLP). This surge in NLP began with the coming of Transformers proposed in the paper “Attention is all you need” by Vaswani and others. This article discusses some of the strengths and limitations of transformer-based language models.


Before we dig deeper, let me put a quick word on the journey of NLP to its present form, also known as Neural NLP. Natural Language Processing is a field of AI which deals with the processing of human language to understand, analyze and generate text to get some useful tasks done. Like traditional AI systems, NLP also began as expert systems having painfully designed rule sets and inference engines. Computer scientists soon realized that human languages were too complex to be effectively represented by hand-crafted rules. Since the 1990s, Statistical NLP has relied on statistical methods and algorithms to learn patterns, or rules, from selected linguistic features. Machine learning algorithms enabled these models to generalize well on unseen data. However, the need for experts in feature extraction hindered these models from moving beyond specific domains until deep learning came around. DL models were sophisticated neural nets designed to extract relevant features from raw text, thereby replacing the role of the expert. As the name indicates, deep neural networks have hundreds of thousands of hidden layers. Each layer has thousands of neurons. These models outperformed traditional models on a wide variety of tasks and even humans in some. Though this may come as a surprise to a naive user, mathematicians would know better. When you have an arbitrary non-linear function with 185 billion learnable parameters, you could probably fit it to any training data. How do they generalize so well on unseen data has been a topic of study for ML researchers for quite some years.

Big Data Jobs

Data for Large Language Models

DL models owe their success not only to the size of their neural net architecture but also to the massive data that it has been fed. BERT (Bidirectional Encoder Representations from Transformers), the ancestor of all the transformer-based models ruling the NLP today, combined 2.8 billion words of Wikipedia data with 800 million words of book corpus and has about 340 million parameters. The GPT-3 model from OpenAI has been trained over 45 terabytes of data from the internet and books. It has seen more text than any human will ever read in their lifetime. Its ability to mimic human language text and style is pretty impressive and scary at the same time. Read the article published in ‘The Guardians” — A robot wrote this entire article. Are you scared yet, human? . This article was written by the generative language model GPT-3 and has led to much debate in the NLP community.

The publishers claim that the article did not need any more editing than editing human-authored content. Did GPT-3 really mean all these things it had written about humankind?

Stochastic Parrots and the Octopus test

Recently many computational linguists have marked their disagreement on the over-hype, and the media attention these Large Language Models (LLM) have received in the past few years. An interesting one among them is “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data” by Emily Bender and Alexander Koller. These professors from the University of Washington and Saarland University are renowned researchers in Computational Linguistics. Prof. Emily refers to these LLM as stochastic parrots, for these models have been trained just to respond appropriately given a particular context without actually “understanding” anything they generate.

Trending AI Articles:

1. Why Corporate AI projects fail?2. How AI Will Power the Next Wave of Healthcare Innovation?3. Machine Learning by Using Regression Model4. Top Data Science Platforms in 2021 Other than Kaggle

The authors formulated an interesting thought experiment called “The Octopus Test” [4]. Octopus test shows that meaning cannot be learned from form alone. A form is the identifiable, physical part of a language, while the meaning relates the form to things in the external world. The training corpus of these models contains only the form, i.e., the language text.

Say, two persons, A and B, were independently stranded on two different uninhabited islands. They discover an abandoned telegraph wire connection through which they started communicating with each other. A highly “intelligent” Octopus O taps into this wire and starts listening to their conversation. Though it does not know English and has never visited the land, it is still capable of identifying statistical patterns in their communication. Soon it learns enough to predict B’s response to A’s messages. One day, the Octopus cuts the wire and inserts itself into the communication, intelligently replying to A’s messages. Will A be able to find that it is not B but O on the other end of the wire? This test is weaker than the original Turing test as A has no reason to suspect anything amiss on the other end.

The answer depends on the kind of task the Octopus is made to undertake. For example, suppose A creates a new device, ‘coconut catapult’, and shares its details with O (impersonated as B). Even though O has no idea what a “coconut” or “rope” means, it will be able to respond, “Cool idea! Great Job!”. O can identify the context (a new invention) from their previous conversation and relate coconut to mangoes and ropes to nails as semantically similar. A would conceive this as a meaningful reply because A attributes meaning to these words and not because O understood what it said.

In a different scenario, A is chased by a bear and calls for B’s advice to create a weapon with her sticks. There is no way O can be of any help to A here. In order to give a meaningful reply, O has to know how these objects relate to each other in the real world.

Are you wondering how GPT -2 would have replied? Well, it said, “Take one stick and punch the bear, and then run faster to the store. Wait there until the bear is gone, and then give her all of the sticks. Then go back inside and get your gun.” Well, that would not be very helpful to A, would it?

Limitation of LLMs

One might feel that the Octopus test is being unfair to a machine. One can argue that this human level of understanding is not necessary to carry out many of the tasks these models are programmed to do. However, the concern the authors are trying to raise here is different; there is no way a DL model can pick up something that is not in its training data in the first place. Can a DL model, trained only on the programming code with no input and output pair, be made to predict the expected output of a sample code? The paper is an interesting read and strongly recommended.

The second concern is about the architecture itself. BERT and cousins are based on Transformer, which is all about attention. Using attention mechanisms, the models learn to focus on the relevant words in training data. A recent study showed that these models dropped down to random performance when the data was slightly altered without affecting its interpreted meaning. The authors claim that these alterations removed some of the linguistic clues in the data, which helped the model perform. So were these models cheating all the way by learning only the clues?

The third concern is about the evaluation task itself. Are our tasks and benchmark datasets capable enough to accurately evaluate these models’ understandings? LLM could be picking up idiosyncratic patterns in the data for their tasks and merely reproducing the distribution of the linguistic forms. As long as the distribution in training data matches those in test sets, these models can showcase high accuracy. However, we cannot say the same about the real-world language text.

The innovation and excitement LLMs have brought into the AI world are exceptional, beyond doubt. I do not intend to undermine their contribution, but we NLP researchers need to be careful not to miss the forest for the trees. Language is much more than mere data.

References :

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).Bender, Emily M., and Alexander Koller. “Climbing towards NLU: On meaning, form, and understanding in the age of data.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.Bender, Emily M., et al. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜.” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021.Niven, Timothy, and Hung-Yu Kao. “Probing neural network comprehension of natural language arguments.” arXiv preprint arXiv:1907.07355 (2019).

Don’t forget to give us your 👏 !

Do stochastic parrots understand what they recite? was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read MoreBecoming Human: Artificial Intelligence Magazine – Medium