For decades scientists, linguists, and technologists have been working to build artificial intelligence (AI) that is powerful enough to parse, interpret, and create language indistinguishable from the way humans naturally write and speak. In 2020, a major breakthrough in the field of natural language processing (NLP) promised to finally enable human-like output from machines.
The third iteration of OpenAI’s Generative Pre-Training (GPT) language generator, released earlier this year, “achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic,” according to an article co-authored by its scientists.
Open AI’s success marks a pivotal moment in the evolution of NLP, which has been a focus of computer science ever since Alan Turing proposed using a real-time conversation between a human and a machine as the basis of a test of AI nearly 60 years ago.
With the advent of GPT-3, application developers now have access to an NLP system with near-human capabilities, and that in turn creates the possibility of widely available human-machine language interfaces.
The success of GPT-3 is derived mainly from the adoption of a neural network architecture called a transformer, which is radically different from many techniques tried in the past. Transformers provide a representation of context that can solve some of the most vexing problems in NLP, which has had a meaningful impact on their ability to engineer human-level language processing.
But to understand the impact transformers will have on the world of NLP and the potential applications that will derive from them, it’s useful to look back on how we got here.
Linguistic approaches & early NLP
Over the past six decades, a number of different approaches to NLP have been tried. Language models have been used to develop question-answer systems, perform named-entity recognition, and make natural language inferences.
Three broad approaches to NLP that have been widely applied include linguistic models, statistical models, and, most recently, deep learning, which has been the most successful by far.
In the early days of NLP, computer scientists turned to linguists for frameworks they could use to process language. Linguists postulated rules, known as a grammar, that dictated how to organize words in a sentence and make the meaning accessible to a machine.
Early grammars had large numbers of rules hand-coded into parsers, but those parsers were limited in the scope of language they could process. One response to the unsolved problems of rule-based approaches was to try a more data-driven approach.
One early data-driven attempt to improve NLP used data to discover probability distributions and parse tree structures. This approach became known as statistical natural language processing. But like linguistic approaches, statistical NLP was constrained by the need for expert input.
A third approach, based on deep learning, avoids the need for handcrafted rules or annotated data sets. Instead, it is driven by examples of language as it is typically used. This approach emerged in the mid to late 2000s thanks in large part to advances in algorithms that could be used to train deep neural networks.
Representing words with numbers
Prior to their use in natural language processing, deep learning networks had been successfully applied to a wide range of AI problems. But their adoption within NLP applications has accelerated the rate at which machines can understand and create new language.
The first step toward enabling neural networks to understand language is to transform words into some kind of numerical representation that can be used by those networks. That representation would capture significant features of words, especially meaning. A commonly used NLP technique, called embedding, now maps words to a set of numbers, which is known as a vector.
In neural networks, the vectors are learned by analyzing words in the context of other words. For example, to find a representation for the word toy, embedding procedures analyze large numbers of example sentences with the word toy in them, such as “Alice gave a toy to the child.”
Two algorithms,Word2Vec and GloVe, are highly effective at creating embeddings that encode features of words. One limitation of these algorithms is that words with multiple meanings had a single numeric representation. The word ‘bank’ can refer to a financial institution or the side of a river, but there is no way to tell from an embedding which meaning is appropriate for a particular context.
Language modeling with neural networks
In addition to mapping words to a numeric representation, neural networks must be designed to analyze sentences of arbitrary length. Recurrent neural networks (RNNs) are well-suited to this task. RNNs implement a memory mechanism with connections to previous states. This allows for sequential processing of each word in the sentence while capturing the context of previously processed words.
Simple RNNs can be difficult to train because of a problem known as the vanishing gradient. Neural networks are trained by adjusting weights in the network based on the error in network outputs. Every neuron that participated in predicting an incorrect output has its weights adjusted by a factor that gets smaller the farther back in the network one goes.
In the case of RNNs, that includes neurons back to the start of the processing. As the weight adjustment approaches zero, there the adjustment becomes too small to effectively train the network.
The long short-term memory (LSTM) neural network architecture uses local memory to store prior state information. This helps avoid vanishing gradients, but the trade-off is long training cycles. Also, there is a limit to how far back LSTMs can maintain information about connections.
The limitation was overcome by the introduction of a new sequencing technique called a transformer, which significantly improved the processing ability of the networks and became the basis of NLP advances that led to the launch of GPT-3.
Rise of the transformers
RNNs and LSTMs advanced the state of the art in natural language processing. But it was a paper entitled “Attention is All You Need” that introduced the concept of the transformer and enabled a step-change in the amount of language information machines could parse and understand.
Transformers are radically different from early approaches to NLP. As a result, they have effectively addressed some of the most vexing problems in natural language processing and the effectiveness of these advances is apparent in the language models now being created.
In transformers, all words are processed in parallel. This is called bi-directional processing, although non-directional may be a more appropriate description.
Transformers also implement a mechanism called self-attention, which computes relationships between words in a sentence. This provides the means to identify words in the sentence that are semantically related. For example, a pronoun is highly related to the noun or noun phrase it refers to and would have a high relationship weight.
Moreover, transformers include encoders and decoders. Self-attention focuses on other words in a sentence as it encodes words. In the decoder, self-attention focuses more on the most important related word as it decodes.
Another significant advance comes from how transformers encode information about the position of words in sentences, which is important to understanding the syntactic structure of the sentence. The problem of the static, single representation of a word was addressed with the introduction of contextualized word representations.
In the past, it was common to craft a language model for every processing task. But transformers have led to the creation of models that can be applied to multiple tasks. These are known as pre-trained models, which capture knowledge of language processing in a network that can then be easily adapted to a variety of tasks.
For example, an additional layer could be added to a pre-trained model to implement a document classifier. The augmented network could then be trained using supervised learning techniques to learn document categories without having to learn a language model.
The end result is that in just a few short years, transformers have massively improved an AI’s ability to understand language and greatly expanded the number of applications that can leverage NLP.
GPT-3 & the future of NLP
Natural language has evolved from building an abstract model of grammar to building deep neural networks that learn from data. The success of transformers is built on the combination of advances in computing and improvements in theory and algorithm design, enabling them to solve problems around context and the relationship between words that is essential to modeling understanding.
That modeling is not just evolving, but accelerating through these new techniques. In just a few years, OpenAI’s GPT has gone through three iterations, with the latest, GPT-3, having 175 billion machine learning parameters. The GPT-3 language model now generates the most human-like natural language to date.
For example, The Guardian recently used GPT-3 to write an op-ed piece to persuade humans that AI is not an existential threat. In benchmark tests of language translation, GPT-3 achieved state of the art performance in both English-to-German and English-to-French translation.
Where does natural language processing go from here? GPT-3 and its successors will significantly change user interfaces, business intelligence, and language translation.
The next-generation model, GPT-4 will likely have trillions of parameters that go beyond just language processing to tasks that require other forms of intelligence, such as vision. For example, GPT-4 may combine language with video processing. Transformers have enabled the most effective artificial implementation of Chomsky’s language organ.
Transformer technology will be the basis for conversational interfaces. Certainly keyboards and pointing devices will continue to be the dominant way of interacting with machines for some time but developers will now have the option of creating conversational interfaces. This will be particularly useful in emerging applications, such as the use of robotics in elder care.
Business intelligence and analytics have been largely constrained to working with numeric data. The ability to analyze large volumes of text will accelerate research, particularly in biomedical research, which has tens of millions of scientific publications widely available. Automated translation will also significantly reduce the cost of conducting business in different languages.
Advances in NLP theory are not enough though, as the size of models needed to model human languages are so large they can only be built using large volumes of computing resources and extensive corpora of written language. Looking ahead, the scalability of transformers will enable larger and more expansive models that go beyond just language processing to integrate multiple forms of intelligence in a single model.