One of the key reasons why the world is divided about the usefulness of GPT LLMs is their lack of reliability. Hallucinate, create fake links, can’t multiply big numbers, wrong but confident etc.
Why seek reliability? Because otherwise the human has to check or verify everything that the LLM utters. Human time is precious.
To dig deeper into the human-must-verify problem, let me introduce the concept of an HL-verifiable response:
A response from an LLM is Human-Linearly (HL-) verifiable if a human can verify the correctness of the response in a single linear scan of the response.
OK? So if you scan the response once and are more-or-less certain that it is ok, we’re done, HL-verifiable! Otherwise, the task is hard, could get very hard! I think anything more than a linear scan of the response is hard work for humans. For long responses, even a log-scan would be preferable. Also, humans vary in expertise. So, let’s say HL-verifiable on an average, “in expectation”.
Verification times can vary from linear (in the size of response) to super-exponential. For example, try verifying a (GPT-generated) program with a bunch of non-existent API calls. Only a bit of an effort. Try something harder? Verify if a program that implements a non-standard ‘sorting’ algorithm is correct. You might spend a lot of time reading the program over-and-again, trying it on sample inputs, maybe even coming up with an intuitive proof. One could define harder levels than ‘linear’ verifiability, but let’s not go there right now.
Exploiting GPT's unreliability: HL-hardness
Now, why all these theoretical contortions? Is the idea of HL-verifiability any useful? I think it is.
See the attached snapshot below. Instead of assigning your student to write a program to solve a problem, give them a program (generated by LLM) and ask them to debug and verify! Certainly not HL-verifiable by student!
So, we’ve used hardness of the task (not HL-verifiable) to our supreme advantage. Teacher spends a small effort writing the prompt. Students spends a lot of effort verifying the output solution(s), and in the process, learns from the huge space of solutions as opposed to a single solution. Win-win!
Hardness Levels of Verifiability - can Bots help?
Going further, what about fake links? Checking fakeness of ‘many’ links sounds like a time-taking human task. Definitely not HL-verifiable. What if we ask a bot to check those links — if it boils down to checking only 404, easy for them!
So makes sense to extend the nomenclature to human+assistant linearly (HAL)-verifiable. You ask:
Can the assistant verify parts (links / API calls / paragraphs) of the LLM response, in real-time, within the time human finishes the linear scan?
If the answer is yes, then HAL-verifiable! Many tasks that were not HL-verifiable become HAL-verifiable. Again, we could think of different modes/levels in which human and assistant could collaborate to verify the response, without making it seem hard work for either.
In summary:
The unreliability of LLMs is a bane to us because we need to verify its responses.
“Verify that the response of an LLM is correct!” — the task is well-defined. Solving this task can be easy or hard -- depends on the response. No obvious solution!
We escape the predicament of unreliability by defining levels of effort required for verification. In particular, Human-Linear (HL-) verifiability means easy to verify by humans. If not HL-verifiable, then hard! Easy with assistants = HAL-verifiable.
Insight: We can exploit the idea of not-HAL-verifiable to invent new hard tasks, potentially with an economic incentive.
Insight: To make the readers trust the LLM responses more, we can augment the responses with specialized artifacts (e.g., citations, explanations) that make the response HL-verifiable.
I haven't seen anyone talk about this before. Thoughts and comments welcome!
I think the notion of HL-verifiability is very useful — unless of course, the LLMs stop hallucinating completely!
What are the other ways we can use it to our advantage?