Hamel dives deep into how LLM frameworks like langchain
, instructor
, and guidance
perform tasks like formatting the response in a valid JSON output. He intercepts the API calls from these Python libraries to shed some light on how many API calls (to OpenAI’s GPT services) they make and what prompt they use. I’ve always been skeptic of the usefulness of many of the LLM “wrapper” libraries, specially for larger and more serious projects, as they are fine for quick prototypes.
Hamel’s blog post makes it clear that you should not blindly trust any of the LLM libraries, because some of them are just using some (stupid) prompt “engineering” behind the scene to provide you with good looking output, and they’re performance is pretty much hit and miss (unless you’re able to view they’re prompts and to verify they’re not doing anything silly).
An example from his post, investigating the API calls for guardrails
:
from pydantic import BaseModel, Field
from guardrails import Guard
import openai
classPet(BaseModel):
pet_type: str = Field(description="Species of pet")
name: str = Field(description="a unique pet name")
prompt ="""
What kind of pet should I get and what should I name it?
${gr.complete_json_suffix_v2}"""guard = Guard.from_pydantic(output_class=Pet, prompt=prompt)
validated_output, *rest = guard(
llm_api=openai.completions.create,
engine="gpt-3.5-turbo-instruct")
print(f"{validated_output}")
## {## "pet_type": "dog",## "name": "Buddy
Not a valid JSON output!
What is happening here? How is this structured output and validation working? Looking at the mitmproxy UI, I can see that the above code resulted in two LLM API calls, the first one with this prompt:
Followed by another call with this prompt:
Woof. That’s a whole lot of ceremony to get structured output!