[LLM #1] The Emergence of Conversational Programming
Moving from arduous Structured Instructions to fluid Natural Interactions
Traditional computer systems are driven by low-level instructions from human beings. About a decade back, we took a detour. We started building systems (components) not via plain instructions, but instead by providing input-output examples (data) and tweaking knobs (parameters) to learn these input-output behaviors. Dozens of human centric problems (text, vision, speech and more) that were unsolved for decades now became amenable to this data-and-knobs paradigm.
A major transition happened again a few years back with the emergence of large language models (also called foundation models, GPT, ..) trained on next-word prediction at scale. The models, trained on Internet scale data, could not only do human-like text generation, but even learn to execute completely new tasks, without re-training the model, using only a few in-context examples. The euphoria continued. Next-word supervision got augmented with Instructions and Conversations supervision.
Instructions (format: ‘do this task. <input> <output>’)
Conversations (format: ‘assistant: … user: …’)
The combination of large data pre-training, instructions and conversation supervision turned out to be unbelievably powerful. It created a unified interface for multiple Natural Language Processing (NLP) tasks and enabled assistants who could take all sorts of instructions in natural language and deliver on them in a fraction of time.
Opening up ChatGPT to the masses was a landmark event. It disrupted our collective mental model of how we should program and interact with computers — moving away from giving pain-sticking, precise low-level instructions in structured languages (Python, C++, ..) to a more free-flowing, ambiguous, natural language instructions.
This was the birth of Conversational Programming - a new paradigm for programming, or rather, interacting with computers. Humans and computer programs can now get tasks done, plan and achieve complex goals, all by entering into an ordinary dialog with one or more LLM systems. Bots can now respond to customers fluently, mine long financial and legal documents for insights and even assist humans in their endeavors to improve creativity and productivity.
In this post, let us examine how this disruptive conversational programming paradigm came into being, how it is evolving and the path ahead.
Exposing LLMs to the Masses
OpenAI was at the forefront of creating and capitalizing on this disruptive change by providing access to its massive, cloud-based, Large Language Models (LLMs) via plain remote API calls.
Indeed, after building these powerful but siloed LLMs, the first question they needed to get right was this.
Now we have these general-purpose, instruction-amenable, powerful LLMs:
How do we expose them as intuitive APIs to the world of structured language programmers?
The real impact comes, not only by allowing humans to talk to these LLMs, but by allowing existing computer programs to call these LLMs to get tasks done at scale.
How to expose is a tricky question and there are many valid ways to create an input-output API around LLMs :
input instruction :- output answer,
input context :- output next word,
input context :- output many sentences.
OpenAI nailed this problem. Starting with a simple Completion API, they created a huge programming ecosystem around its massive LLMs, piece-by-piece. In fact, it was fun to watch how they cut through complexity at each stage of the API evolution and prioritized elegant simplicity at every stage. The API design has also helped developers carve a mental model of the new, enigmatic entity that they must communicate with to get the tasks done.
Let’s dive in into the API design to unwrap the elements of the new programming paradigm.
Evolution of OpenAI APIs
The first API version that exposed the LLM to external calls closely mimicked the process of next token generation. The input to the LLM is called a prompt and the model generates one or more tokens that complete the prompt.
openai.Completion.create(model, prompt=‘…’, …)
As mentioned earlier, one could imagine exposing the LLM as an API in many different, often complex, ways. The Completion primitive, however, goes right to the heart of it. The API looks naive, but far from it, it allows the developers to get arbitrary tasks done by:
providing task instructions and examples as part of the prompt, and,
forcing the model to generate the task output, by merely asking it to complete the prompt using its enormous, random-access memory.
Slowly it became apparent that one could piggyback arbitrary NLP tasks on top of the simple Completion API, provided we knew how to instruct (specify the prompt) correctly.
Then came the Chat API.
openai.ChatCompletion.create(model, messages = […], …)
This involved a few subtle changes over the Completion API.
the prompt was replaced by a sequence of messages. The latter represents a conversation context or history.
the prompt text was upgraded to an (actor role, message) pair: instead of a single user prompting the LLM, this now enables multiple actors who send differentiated messages to the LLM.
Note that the prompt in the earlier Completion API implied a single-turn conversation: user sends a single instruction to the LLM and gets a response. The ChatCompletion API raised the level of discourse to a multi-turn conversation: the new instructions are interpreted and discharged in the context of the previous conversation and the response is continually added to the conversation context.
This multi-turn conversation API effectively materialized the mental model of multiple actors communicating together: the user actor messages the LLM actor to take the conversation forward. In essence, the new API enables a team of actors to communicate and work together, using the mechanism of multi-turn conversations to solve problems collaboratively. The team could consist of multiple humans with a single LLM, a single human with multiple LLM actors or, in the most general case, multiple human and LLM actors.
Then came Plugins. The plugins allowed the OpenAI’s ChatGPT to interact with the external world (APIs). These plugins (or tools) were available through the ChatGPT UIs for sometime but not via GPT-x API calls. The challenge here was to give users access to the world of powerful plugins, without drastic changes to the earlier API or disrupting the mental model.
From the programmatic point of view, all communication till now manifested in a closed world, among users (programs) and the LLM assistant. To do harder tasks, the user programs often rely on powerful external tools (or APIs or agents) from the structured world. While the assistant could suggest which tools to use or how to use them, the suggestions were limited to those that were common knowledge between the user and the LLM. Otherwise, the user must first inform the LLM of the tools they knew of (no fixed definition format) and then the LLM could help (no fixed output format though, build your own parser). Thus, invoking external tools seamlessly while conversing with the LLMs was tricky.
Aside: A bunch of natural language planner programs, e.g., ReAct, were proposed earlier for discharging advanced tasks with LLMs. These planners can be used to instruct the LLM how to discharge a task by invoking one or more pre-specified tools. Again, the format of tool specification or responses are adhoc, need task-based tuning and often lead to unreliable behavior.
This hurdle was overcome by another elegant tweak to the ChatCompletion API: add a functions parameter to enable external functions/tools to work with the ChatCompletion API.
ChatCompletion.create(model, functions=[…], messages = […],…)
The functions parameter consists of a list of descriptions of external (user-defined) functions that the user may wish to call. The format may appear slightly unintuitive initially but one soon begins to appreciate its utility. Each function description consists of the name of a user-defined function, what it does, the input arguments to the function, their types and semantic meaning. All of it in natural language. Here is a specification of a get_current_weather function (from original release) with a single parameter of record type with location and unit fields.
Let us consider another example. Suppose while conversing with the LLM, the LLM retrieved a list of revenues for a bunch of companies from its memory and now we want to compute ratios or averages over these revenues. LLMs can do math but it is best to avoid. On the other hand, you have a robust average function, implemented in Python elsewhere, and all you wish is that the LLM help you call the average function, by providing the correct inputs to average. You can then call average, get the results you wanted and trust that you’ve the computed the correct results.
The overall idea is that, based on the conversation history, if the LLM can come up with the right inputs for the function, you (a structured program) can call the function and message the result back to the LLM. Now LLM knows the correct answers and can proceed with its conversation without being accused of being bad at math.
Aside: Not much is said about how the functions parameter is handled internally with LLMs. To make it work, we need to setup a prompt for the LLM, which instructs it to pick one of the function names from the functions list and provide the values for each of its arguments. I’d like to call it the Func prompt.
The Func prompt eliminates the need for an outer ReAct loop. Instead, it reduces decision making to picking the right function to call at every turn of the conversation.
Even though it may appear incongruous at the first glance, adding the functions parameter to the Completion API was a masterstroke that simultaneously activated many capabilities.
Enables seamless interaction between the structured and natural language programming worlds.
Provides a simple standard on how structured and natural language programs should communicate.
Generate structured data, in the specified format, as an input to an imaginary function, whose input schema we specify. No need to call the function.
Interleave prompts to the ongoing virtual conversation with the real actions (function calls) or tasks that you want to get done, seamlessly
Allows executing planners (think AutoGPT) which invoke a sequence of several existing tools via external APIs to get the task done
In summary, through the evolution of the Completion API, we have witnessed the birth of a new programming paradigm, bit-by-bit, which seamlessly combines the power of traditional programming instructions with natural language instructions.
Conversational Programming: The Path Ahead
Before concluding, it is important to recall that the idea of Conversational Programming is not new and has come up in different forms earlier. Notably, Marvin Minsky’s 1967 article explores the idea that computers can do more than what they programmed to do. Programs (or instructions) need not be perfect or precise, they only need to satisfy the programmer’s intent. The intent, in turn, isn’t a point goal but a distribution of potential outcomes, any one of which might satisfy the programmer. Consequently, it is ok for the programmer to provide a non-precise specification of their intent (in natural language) and gradually refine it as the conversation flows — an easy segue to conversational programming.
What we see now is a more complete and practical manifestation of the idea, which not only instructs computers via natural language conversations but also works in sync with traditional structured programming. It is worth noting that the current system architecture is nested in outer-structured-and-inner-conversational format: conversational programs are embedded inside the outer structured programming paradigm. This might get reversed in future or the boundaries may be eliminated.
Are there other ways to mix natural language programming with structured programming? Of course. In fact, several approaches have been proposed by researchers: lmql, guidance, parsel, and more. More on these in a separate post.
Finally, we may further generalize the functions parameter (in the ChatCompletion API) to a setting where multiple agents work together with LLMs to reach a goal. Here is a potential mental model for an environment of communicating, collaborating things.
The conversation is a virtual artifact, like a voice in your head or a sub-conscious scratchpad.
the LLM is a virtual planner, which receives task requests from agents.
To execute the virtual plan, the LLM must provide structured data back to the agents, who will then execute the real actions.
You, a human agent, may ask an LLM for help to finish a task. The LLM responds “To finish, I need you to run an errand, please”. You oblige and so the cycle continues.