As we witness artificial intelligence (AI) virtual assistants capture the imagination and ire of most in society, it’s important to understand that we’re in the middle of the story. To a certain extent, AI, at least in the form of virtual assistants, or machine learning models is evolving.
A recent example is found in wearable technology, like the watches most of us wear every day. Way back in 2014, as everyone was making fun of Google Glass, and the Apple Watch wasn’t even a thing, I said this in the NY Times.
“The future of wearables is in the awkward adolescent stage. I have a feeling we’ll be here for awhile,” wrote Ian O’Byrne, an assistant professor at the University of New Haven. “At some point, a product will come out that ‘just makes sense.’”
As hype and hysteria build over ChatGPT, I’m paying attention to the technology, barriers, and opportunities that exist behind the scenes. To make this a bit easier, let’s look at how GPT (Generative Pre-trained Transformer) has evolved over the last decade and try to figure out where it could/should go.
Keep in mind, this is one look at one learning model. I’m providing an overview of GPT from OpenAI because it’s relatively easy to get a sense of what they’re doing and why. This does not mean that OpenAI is transparent, releasing most of the code, or sharing the secret recipe to any extent. While OpenAI has released its algorithms to the public in the past, it has opted to keep GPT-3 locked away. There are also many other machine learning, AI virtual assistants, tools, and platforms that will evolve. This is just one that we’re using to make sense of things.
Please note, I’ll continue to add to this post and revise it over time. I’ll leave time stamps below to indicate what changes (and when) are made.
Before GPT
Natural language processing (NLP) models were used for sentiment classification or textual entailment. Sentiment classification is the automated process of identifying opinions in text and labeling them as positive, negative, or neutral, based on the emotions customers express within them. Textual entailment is a simple exercise in logic that attempts to discern whether one sentence can be inferred from another.
This was challenging as the large datasets needed to train NLP models for a very specific task were not freely available. In addition, training a model to identify emotions in the text didn’t make it generalizable for other tasks in other situations. In other words, it was a one-trick pony.
GPT-1
GPT-1 (Generative Pre-Training model one) proposed learning a generative learning model using unlabeled data and then tweaking this downstream by giving the model examples of classification, sentiment analysis, textual entailment, and other ways to look at the data. The paper for GPT-1 is available here.
The breakthrough in this work was the use of unsupervised learning as a pre-training objective for supervised fine-tuning in the development of the learning model. This is where the name Generative Pre-training comes from. This can get a bit confusing, but this great post helps. I share a brief overview below.
- Supervised learning involves machine learning algorithm learning under the presence of a supervisor. An example is a classroom teacher helping a student learn the difference between apples and oranges. The teacher uses a bunch of pictures of apples and oranges and indicates how they’re different. As you come across different fruits, vegetables, apples, and oranges…you return to that set of pictures of apples and oranges to recalibrate how you tell them apart.
- Unsupervised learning involves the learner finding and identifying the similarities and differences in a dataset without the guidance of a supervisor. An example is a student that wants to identify the differences between types of apples. They go to the store and purchase as many apples as possible. They then go to other stores and purchase as many as possible. They take all of these apples home to examine, measure, weigh, carve, and eat them. The student then is gradually able to identify a specific apple, how it tastes, and differentiate between other apples just by looking at it.
For GPT-1, unsupervised learning was used to identify the learning objective using the BooksCorpus dataset of over 7000 unpublished books used to train the language model. Unsupervised learning was used in training the dataset, which was followed by supervised fine-tuning for specific tasks. This means that the machine learning algorithm was allowed to play and learn in the BooksCorpus dataset and identify what it learned. Supervision was then introduced to help clarify and extend what it learned from the dataset.
GPT-1 showed the promise of generative pre-training that could generalize to different tasks and datasets.
GPT-2
In 2019, one year after the GPT-1 paper, Language Models are Unsupervised Multitask Learners was released and identified the use of larger datasets and adding more parameters (the values, labels, and limiters added to data) to develop a stronger language model.
For GPT-2, developers scraped Reddit and pulled data from outbound links of high-upvoted articles. This dataset, called WebText, consisted of 40GB of text data from over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
Two breakthroughs came from this work in GPT-2.
- Task Conditioning is aimed at developing a model that is aimed at learning multiple tasks at the same time. Task conditioning seeks to produce different outputs for the same input for different tasks. Task conditioning in the development of NLP models requires using natural language instructions in order to perform the task.
- Zero Shot Learning and Zero Short Task Transfer. I’ll have a follow-up post on this, but zero shot learning is when the machine learning model recognizes something it hasn’t encountered before. Zero Short Task Transfer is where the algorithm understands the task being given without much explanation or examples given. An example of this would be switching from English to Spanish as the model learns how to quickly move across languages and still achieve the same tasks.
GPT-2 showed that giving a machine learning model a larger dataset, and more parameters (things to keep track of) allowed it to understand many more tasks and complete them in zero-shot settings. An example of this in an NLP model means that you train on language pairs that it sees (Spanish and English, French and English). You then test the model on pairs that it doesn’t see (French and Spanish) to have it translate and complete tasks. GPT-2 was able to complete this and providing more data and more time identified the ability for more complexity in the model.
GPT-3
Published in 2020, Language Models are Few-Shot Learners detailed the improvements made to make it one of the most powerful NLP models to date. GPT-3 was built by OpenAI using 175 billion parameters, this is 100 times more parameters than used in GPT-2.
GPT-3 was trained using five different datasets, each given different weights or values: Common Crawl, WebText2, Books1, Books2, and Wikipedia. GPT-3 was given more than one epoch, or passes through the dataset. It’s tough to identify (in my humble research) what is included in Books1 and Books2. It seems like it might be a mix of Project Gutenberg books, materials scanned by Google as they scanned books or some mix of sources.
GPT-3, as shown in ChatGPT and other instances, has been shown to easily handle writing articles and conversing while carrying on tasks just in time. Because of the large dataset and a large number of parameters, the model is able to create high-quality text, but it is limited by longer responses that are repeated over time. GPT-3 is, for the most part, limited to one task. Once it is involved in that task, it struggles as it regroups and tries to interpret and move to another task or complete bidirectional tasks. A bidirectional task is where the machine learning model is looking left and right, backward and forwards, as it converses with you as it completes a task. If you think about Google AutoComplete, Google will fill in some search terms that it thinks you might be looking for. This is looking forward and predicting. A bidirectional task would be suggesting a better start to the query, or indicating that your original query or task isn’t what you really are looking for. 🙂
GPT-3.5
Released two years after GPT-3, OpenAI rolled out what could be identified as GPT-3.5, but you most likely know it by a different name, ChatGPT. ChatGPT utilizes a fine-tuned version of GPT-3.5 that’s essentially a general-purpose chatbot.
According to OpenAI, GPT-3.5 was trained on a blend of text and code published before the end of 2021. This means that it stopped training/learning at this point and it is not able to access or process more recent data, events, or information.
Like GPT-3 and other text-generating AI, GPT-3.5 learned the relationships between sentences, words, and parts of words by ingesting huge amounts of content from the web, including hundreds of thousands of Wikipedia entries, social media posts, and news articles.
OpenAI decided not to release all of GPT-3.5 for use. Instead, it used it to create several systems fine-tuned to achieve specific tasks.
One of these is text-davinci-003, which is better at both long-form and high-quality writing than models built on GPT-3. It also has fewer limitations and scores higher on human preference ratings than InstructGPT, a family of GPT-3-based models released by OpenAI earlier in 2022.
GPT-3.5 also made advances in providing the OpenAI Playground, which is the neat user interface you identify as ChatGPT.
GPT-4
GPT-4 was announced in March of 2023. GPT-4 is not available to the public yet, but only to selected partners and researchers. GPT-4 has a much larger model size of 1 trillion parameters, compared to GPT-3.5’s 175 billion parameters. This means it can handle more complex tasks, generate more accurate responses, and be more aware of the context.
GPT-4 has a more extensive training dataset that includes text and code published before the end of 2021. By learning from text and code published before the end of 2021, GPT-4 has a wider and deeper knowledge base than GPT-3.5. This gives GPT-4 an advantage over GPT-3.5 in knowledge and understanding.
GPT-3.5 is faster in generating responses than GPT-4, because it has a smaller model size and less data to process. It is also a text-to-text model, meaning it only accepts text inputs and generates text outputs. It can still perform various natural language tasks but with less accuracy and reliability than GPT-4.
GPT-4 is a data-to-text model, meaning it can accept both text and image inputs and generate text outputs. This makes it a multimodal system that can showcase human-level performance on various professional and academic benchmarks.
GPT-4 is better equipped to handle longer text passages, maintain coherence, and generate contextually relevant responses. It also makes fewer logic and reasoning errors with more complex prompts.
Until the next model
It’s also important to note that the next step in the development of these tools and spaces is a need for a bunch of individuals to provide data, tasks, information, queries, and errors in multiple languages and formats. This would help build task conditioning, zero-shot learning, and opportunities for bidirectional inquiry.
I also think the next iteration of this NLP will again focus on tools and products for specific purposes. We’ll see what comes next. I’ll continue to update this post as things change.
If you’d like to stay on top of areas like this, you should be reading my weekly newsletter. You can follow here or on Substack.
Cover photo by Suzanne D. Williams on Unsplash