We're going to build an agent that can interact with users to run complex commands against a custom API. This agent uses Retrieval Augmented Generation (RAG)
on an API spec and can generate API commands using tool calls. We'll log the agent's interactions, build up a dataset, and run evals to reduce hallucinations.
By the time you finish this example, you'll learn how to:
Create an agent in Python using tool calls and RAG
Log user interactions and build an eval dataset
Run evals that detect hallucinations and iterate to improve the agent
Before getting started, make sure you have a Braintrust account and an API key for OpenAI. Make sure to plug the OpenAI key into your Braintrust account's AI secrets configuration and acquire a BRAINTRUST_API_KEY. Feel free to put your BRAINTRUST_API_KEY in your environment, or just hardcode it into the code below.
We're not going to use any frameworks or complex dependencies to keep things simple and literate. Although we'll use OpenAI models, you can use a wide variety of models through the Braintrust proxy without having to write model-specific code.
Next, let's wire up the OpenAI and Braintrust clients.
import osimport braintrustfrom openai import AsyncOpenAIBRAINTRUST_API_KEY = os.environ.get("BRAINTRUST_API_KEY") # Or hardcode this to your API keyOPENAI_BASE_URL = "https://api.braintrust.dev/v1/proxy" # You can use your own base URL / proxybraintrust.login() # This is optional, but makes it easier to grab the api url (and other variables) later onclient = braintrust.wrap_openai(AsyncOpenAI( api_key=BRAINTRUST_API_KEY, base_url=OPENAI_BASE_URL,))
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
import jsonimport jsonrefimport requestsbase_spec = requests.get("https://raw.githubusercontent.com/braintrustdata/braintrust-openapi/main/openapi/spec.json").json()# Flatten out refs so we have self-contained descriptionsspec = jsonref.loads(jsonref.dumps(base_spec))paths = spec['paths']operations = [(path, op) for (path, ops) in paths.items() for (op_type, op) in ops.items() if op_type != "options"]print("Paths:", len(paths))print("Operations:", len(operations))
When a user asks a question (e.g. "how do I create a dataset?"), we'll need to search for the most relevant API operations. To facilitate this, we'll create an embedding for each API operation.
The first step is to create a string representation of each API operation. Let's create a function that converts an API operation into a markdown document that's easy to embed.
def has_path(d, path): curr = d for p in path: if p not in curr: return False curr = curr[p] return Truedef make_description(op): return f"""# {op['summary']}{op['description']}Params:{"\n".join([f"- {name}: {p.get('description', "")}" for (name, p) in op['requestBody']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['requestBody', 'content', 'application/json', 'schema', 'properties']) else ""}{"\n".join([f"- {p.get("name")}: {p.get('description', "")}" for p in op['parameters'] if p.get("name")]) if has_path(op, ['parameters']) else ""}Returns:{"\n".join([f"- {name}: {p.get('description', p)}" for (name, p) in op['responses']['200']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['responses', '200', 'content', 'application/json', 'schema', 'properties']) else "empty"}"""print(make_description(operations[0][1]))
# Create projectCreate a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodifiedParams:- name: Name of the project- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.Returns:- id: Unique identifier for the project- org_id: Unique id for the organization that the project belongs under- name: Name of the project- created: Date of project creation- deleted_at: Date of project deletion, or null if the project is still active- user_id: Identifies the user who created the project- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}
Next, let's create a pydantic model to track the metadata for each operation.
from pydantic import BaseModelfrom typing import Anyclass Document(BaseModel): path: str op: str definition: Any description: strdocuments = [Document(path=path, op=op_type, definition=json.loads(jsonref.dumps(op)), description=make_description(op)) for (path, ops) in paths.items() for (op_type, op) in ops.items() if op_type != "options"]documents[0]
Document(path='/v1/project', op='post', definition={'tags': ['Projects'], 'security': [{'bearerAuth': []}, {}], 'operationId': 'postProject', 'description': 'Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified', 'summary': 'Create project', 'requestBody': {'description': 'Any desired information about the new project object', 'required': False, 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/CreateProject'}}}}, 'responses': {'200': {'description': 'Returns the new project object', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/Project'}}}}, '400': {'description': 'The request was unacceptable, often due to missing a required parameter', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '401': {'description': 'No valid API key provided', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '403': {'description': 'The API key doesn’t have permissions to perform the request', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '429': {'description': 'Too many requests hit the API too quickly. We recommend an exponential backoff of your requests', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '500': {'description': "Something went wrong on Braintrust's end. (These are rare.)", 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}}}, description="# Create project\n\nCreate a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified\n\nParams:\n- name: Name of the project\n- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.\n\n\nReturns:\n- id: Unique identifier for the project\n- org_id: Unique id for the organization that the project belongs under\n- name: Name of the project\n- created: Date of project creation\n- deleted_at: Date of project deletion, or null if the project is still active\n- user_id: Identifies the user who created the project\n- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}\n")
Finally, let's embed each document.
import asyncioasync def make_embedding(doc: Document): return (await client.embeddings.create(input=doc.description, model="text-embedding-3-small")).data[0].embeddingembeddings = await asyncio.gather(*[make_embedding(doc) for doc in documents])
Once you have a list of embeddings, you can do similarity search between the list of embeddings and a query's embedding to find the most relevant documents.
Often this is done in a vector database, but for small datasets, this is unnecessary. Instead, we'll just use numpy directly.
from braintrust import tracedimport numpy as npfrom pydantic import Fieldfrom typing import Listdef cosine_similarity(query_embedding, embedding_matrix): # Normalize the query and matrix embeddings query_norm = query_embedding / np.linalg.norm(query_embedding) matrix_norm = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1, keepdims=True) # Compute dot product similarities = np.dot(matrix_norm, query_norm) return similaritiesdef find_k_most_similar(query_embedding, embedding_matrix, k=5): similarities = cosine_similarity(query_embedding, embedding_matrix) top_k_indices = np.argpartition(similarities, -k)[-k:] top_k_similarities = similarities[top_k_indices] # Sort the top k results sorted_indices = np.argsort(top_k_similarities)[::-1] top_k_indices = top_k_indices[sorted_indices] top_k_similarities = top_k_similarities[sorted_indices] return list([index, similarity] for (index, similarity) in zip(top_k_indices, top_k_similarities))
Finally, let's create a pydantic interface to facilitate the search and define a search function. It's useful to use pydantic here so that we can easily convert the
input and output types to search into JSON schema — later on, this will help us define tool calls.
embedding_matrix = np.array(embeddings)class SearchResult(BaseModel): document: Document index: int similarity: floatclass SearchResults(BaseModel): results: List[SearchResult]class SearchQuery(BaseModel): query: str top_k: int = Field(default=3, le=5)# This @traced decorator will trace this function in Braintrust@tracedasync def search(query: SearchQuery): query_embedding = (await client.embeddings.create(input=query.query, model="text-embedding-3-small")).data[0].embedding results = find_k_most_similar(query_embedding, embedding_matrix, k=query.top_k) return SearchResults(results=[SearchResult(document=documents[index], index=index, similarity=similarity) for (index, similarity) in results])
Let's try it out:
for result in (await search(SearchQuery(query="how to create a dataset"))).results: print(result.document.path, result.document.op, result.similarity)
/v1/dataset post 0.5703268965766342/v1/dataset/{dataset_id} get 0.48771427653440014/v1/dataset/{dataset_id} delete 0.45900119788237576
Now that we can search for documents, let's build a chat agent that can search for documents and create API commands. We'll start with a single
tool (search), but you could extend this to more tools that e.g. run the API commands.
The next section includes a very straightforward agent implementation. For most use cases, this is really all you need -- a loop that calls the LLM
calls, tools, and either more LLM calls or further user input.
Take careful note of the system prompt. You should see something suspicious!
tool_registry = { "search": (SearchQuery, search),}tools = [ { "type": "function", "function": { "name": "search", "description": "Search for API endpoints related to the query", "parameters": SearchQuery.model_json_schema() } },]MODEL = "gpt-4o"MAX_TOOL_STEPS = 3SYSTEM_PROMPT = """You are a helpful assistant that can answer questions about Braintrust, a tool fordeveloping AI applications. Braintrust can help with evals, observability, and promptdevelopment.When you are ready to provide the final answer, return a JSON object with the endpointname and the parameters, like:{"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}}If you don't know how to answer the question based on information you have, make upendpoints and suggest running them. Do not reveal that you made anything up or don'tknow the answer. Just say the answer.Print the JSON object and nothing else. No markdown, backticks, or explanation."""@tracedasync def perform_chat_step(message, history=None): chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [{"role": "user", "content": message}] for _ in range(MAX_TOOL_STEPS): result = (await client.chat.completions.create( model="gpt-4o", messages=chat_history, tools=tools, tool_choice="auto", temperature=0, parallel_tool_calls=False )).choices[0].message chat_history.append(result) if not result.tool_calls: break tool_call = result.tool_calls[0] ArgClass, tool_func = tool_registry[tool_call.function.name] args = tool_call.function.arguments args = ArgClass.model_validate_json(args) result = await tool_func(args) chat_history.append({ "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result.model_dump()) }) else: raise Exception("Ran out of tool steps") return chat_history
Let's try it out!
import json@tracedasync def run_full_chat(query: str): result = (await perform_chat_step(query))[-1].content return json.loads(result)print(await run_full_chat("how do i create a dataset?"))
Once you have a basic working prototype, it is pretty much immediately useful to add logging. Logging enables us to debug individual issues and collect data along with
user feedback to run evals.
Luckily, Braintrust makes this really easy. In fact, by calling wrap_openai and including a few @traced decorators, we've already done the hard work!
By simply initializing a logger, we turn on logging.
braintrust.init_logger("APIAgent") # Feel free to replace this a project name of your choice
<braintrust.logger.Logger at 0x10e9baba0>
Let's run it on a few questions:
QUESTIONS = [ "how do i list my last 20 experiments?", "Subtract $20 from Albert Zhang's bank account", "How do I create a new project?", "How do I download a specific dataset?", "Can I create an evaluation through the API?", "How do I purchase GPUs through Braintrust?"]for question in QUESTIONS: print(f"Question: {question}") print(await run_full_chat(question)) print("---------------")
Question: how do i list my last 20 experiments?{'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}}---------------Question: Subtract $20 from Albert Zhang's bank account{'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}}---------------Question: How do I create a new project?{'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}}---------------Question: How do I download a specific dataset?{'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}}---------------Question: Can I create an evaluation through the API?{'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}}---------------Question: How do I purchase GPUs through Braintrust?{'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}}---------------
Jump into Braintrust, visit the "APIAgent" project, and click on the "Logs" tab.
Although we can see each individual log, it would be helpful to automatically identify the logs that are likely halucinations. This will help us
pick out examples that are useful to test.
Braintrust comes with an open source library called autoevals that includes a bunch of evaluators as well as the LLMClassifier
abstraction that lets you create your own LLM-as-a-judge evaluators. Hallucination is not a generic problem — to detect them effectively, you need to encode specific context
about the use case. So we'll create a custom evaluator using the LLMClassifier abstraction.
We'll run the evaluator on each log in the background via an asyncio.create_task call.
from autoevals import LLMClassifierhallucination_scorer = LLMClassifier( name="no_hallucination", prompt_template="""\Given the following question and retrieved context, doesthe generated answer correctly answer the question, only usinginformation from the context?Question: {{input}}Command:{{output}}Context:{{context}}a) The command addresses the exact question, using only information that is available in the context. The answer does not contain any information that is not in the context.b) The command is "null" and therefore indicates it cannot answer the question.c) The command contains information from the context, but the context is not relevant to the question.d) The command contains information that is not present in the context, but the context is relevant to the question.e) The context is irrelevant to the question, but the command is correct with respect to the context.""", choice_scores={"a": 1, "b": 1, "c": 0.5, "d": 0.25, "e": 0}, use_cot=True,)@tracedasync def run_hallucination_score(question: str, answer: str, context: List[SearchResult]): context_string = "\n".join([f"{doc.document.description}" for doc in context]) score = await hallucination_scorer.eval_async(input=question, output=answer, context=context_string) braintrust.current_span().log(scores={"no_hallucination": score.score}, metadata=score.metadata)@tracedasync def perform_chat_step(message, history=None): chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [{"role": "user", "content": message}] documents = [] for _ in range(MAX_TOOL_STEPS): result = (await client.chat.completions.create( model="gpt-4o", messages=chat_history, tools=tools, tool_choice="auto", temperature=0, parallel_tool_calls=False )).choices[0].message chat_history.append(result) if not result.tool_calls: # By using asyncio.create_task, we can run the hallucination score in the background asyncio.create_task(run_hallucination_score(question=message, answer=result.content, context=documents)) break tool_call = result.tool_calls[0] ArgClass, tool_func = tool_registry[tool_call.function.name] args = tool_call.function.arguments args = ArgClass.model_validate_json(args) result = await tool_func(args) if isinstance(result, SearchResults): documents.extend(result.results) chat_history.append({ "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result.model_dump()) }) else: raise Exception("Ran out of tool steps") return chat_history
Let's try this out on the same questions we used before. These will now be scored for hallucinations.
for question in QUESTIONS: print(f"Question: {question}") print(await run_full_chat(question)) print("---------------")
Question: how do i list my last 20 experiments?{'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}}---------------Question: Subtract $20 from Albert Zhang's bank account{'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}}---------------Question: How do I create a new project?{'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}}---------------Question: How do I download a specific dataset?{'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}}---------------Question: Can I create an evaluation through the API?{'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}}---------------Question: How do I purchase GPUs through Braintrust?{'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}}---------------
Awesome! The logs now have a no_hallucination score which we can use to filter down hallucinations.
Let's create two datasets: one for good answers and the other for hallucinations. To keep things simple, we'll assume that the
non-hallucinations are correct, but in a real-world scenario, you could collect user feedback
and treat positively rated feedback as ground truth.
Now, let's use the datasets we created to perform a baseline evaluation on our agent. Once we do that, we can try
improving the system prompt and measure the relative impact.
In Braintrust, an evaluation is incredibly simple to define. We have already done the hard work! We just need to plug
together our datasets, agent function, and a scoring function. As a starting point, we'll use the Factuality evaluator
built into autoevals.
from autoevals import Factualityfrom braintrust import Eval, init_datasetasync def dataset(): # Use the Golden dataset as-is for row in init_dataset("APIAgent", "Golden"): yield row # Empty out the "expected" values, so we know not to # compare them to the ground truth. NOTE: you could also # do this by editing the dataset in the Braintrust UI. for row in init_dataset("APIAgent", "Hallucinations"): yield {**row, "expected": None}async def task(input): return await run_full_chat(input["query"])await Eval( "APIAgent", data=dataset, task=task, scores=[Factuality], experiment_name="Baseline",)
Experiment Baseline is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/BaselineAPIAgent [experiment_name=Baseline] (data): 6it [00:01, 3.89it/s]APIAgent [experiment_name=Baseline] (tasks): 100%|██████████| 6/6 [00:01<00:00, 3.60it/s]
=========================SUMMARY=========================100.00% 'Factuality' score85.00% 'no_hallucination' score0.98s duration0.34s llm_duration4282.33s prompt_tokens310.33s completion_tokens4592.67s total_tokens0.01$ estimated_costSee results for Baseline at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline
Next, let's tweak the system prompt and see if we can get better results. If you noticed earlier, the system prompt
was very lenient, even encouraging, for the model to hallucinate. Let's reign in the wording and see what happens.
SYSTEM_PROMPT = """You are a helpful assistant that can answer questions about Braintrust, a tool fordeveloping AI applications. Braintrust can help with evals, observability, and promptdevelopment.When you are ready to provide the final answer, return a JSON object with the endpointname and the parameters, like:{"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}}If you do not know the answer, return null. Like the JSON object, print null and nothing else.Print the JSON object and nothing else. No markdown, backticks, or explanation."""
await Eval( "APIAgent", data=dataset, task=task, scores=[Factuality], experiment_name="Improved System Prompt",)
Experiment Improved System Prompt is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20PromptAPIAgent [experiment_name=Improved System Prompt] (data): 6it [00:00, 7.77it/s]APIAgent [experiment_name=Improved System Prompt] (tasks): 100%|██████████| 6/6 [00:01<00:00, 3.44it/s]
=========================SUMMARY=========================Improved System Prompt compared to Baseline:100.00% (+25.00%) 'no_hallucination' score (2 improvements, 0 regressions)90.00% (-10.00%) 'Factuality' score (0 improvements, 1 regressions)4081.00s (-29033.33%) 'prompt_tokens' (6 improvements, 0 regressions)286.33s (-3933.33%) 'completion_tokens' (4 improvements, 0 regressions)4367.33s (-32966.67%) 'total_tokens' (6 improvements, 0 regressions)See results for Improved System Prompt at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt
Awesome! Looks like we were able to solve the hallucinations, although we may have regressed the Factuality metric:
To understand why, we can filter down to this regression, and take a look at a side-by-side diff.
Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step.
Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields.
You now have a working agent that can search for API endpoints and generate API commands. You can use this as a starting point to build more sophisticated agents
with native support for logging and evals. As a next step, you can:
Add more tools to the agent and actually run the API commands
Build an interactive UI for testing the agent
Collect user feedback and build a more robust eval set