In this example, we'll build an app that automatically generates HTML components, evaluates them, and captures user feedback. We'll use the feedback and evaluations to build up a dataset
that we'll use as a basis for further improvements.
We'll start by using a very simple prompt to generate HTML components using gpt-3.5-turbo.
First, we'll initialize an openai client and wrap it with Braintrust's helper. This is a no-op until we start using
the client within code that is instrumented by Braintrust.
import { OpenAI } from "openai";import { wrapOpenAI } from "braintrust";const openai = wrapOpenAI( new OpenAI({ apiKey: process.env.OPENAI_API_KEY || "Your OPENAI_API_KEY", }));
This code generates a basic prompt:
import { ChatCompletionMessageParam } from "openai/resources";function generateMessages(input: string): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are a skilled design engineerwho can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.Your designs value simplicity, conciseness, clarity, and functionality overcomplexity.You generate pure HTML with inline CSS, so that your designs can be rendereddirectly as plain HTML. Only generate components, not full HTML pages. Do notcreate background colors.Users will send you a description of a design, and you must reply with HTML,and nothing else. Your reply will be directly copied and rendered into a browser,so do not include any text. If you would like to explain your reasoning, feel freeto do so in HTML comments.`, }, { role: "user", content: input, }, ];}JSON.stringify( generateMessages("A login form for a B2B SaaS product."), null, 2);
[ { "role": "system", "content": "You are a skilled design engineer\nwho can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.\nYour designs value simplicity, conciseness, clarity, and functionality over\ncomplexity.\n\nYou generate pure HTML with inline CSS, so that your designs can be rendered\ndirectly as plain HTML. Only generate components, not full HTML pages. Do not\ncreate background colors.\n\nUsers will send you a description of a design, and you must reply with HTML,\nand nothing else. Your reply will be directly copied and rendered into a browser,\nso do not include any text. If you would like to explain your reasoning, feel free\nto do so in HTML comments." }, { "role": "user", "content": "A login form for a B2B SaaS product." }]
Now, let's run this using gpt-3.5-turbo. We'll also do a few things that help us log & evaluate this function later:
Wrap the execution in a traced call, which will enable Braintrust to log the inputs and outputs of the function when we run it in production or in evals
Make its signature accept a single input value, which Braintrust's Eval function expects
It looks like in a few of these examples, the model is generating a full HTML page, instead of a component as we requested. This is something we can evaluate, to ensure that it does not happen!
const containsHTML = (s) => /<(html|body)>/i.test(s);containsHTML( await generateComponent( "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode." ));
true
Now, let's update our function to compute this score. Let's also keep track of requests and their ids, so that we can provide user feedback. Normally you would store these in a database, but for demo purposes, a global dictionary should suffice.
// Normally you would store these in a database, but for this demo we'll just use a global variable.let requests = {};async function generateComponent(input: string) { return traced( async (span) => { const response = await openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: generateMessages(input), seed: 101, }); const output = response.choices[0].message.content; requests[input] = span.id; span.log({ input, output, scores: { isComponent: containsHTML(output) ? 0 : 1 }, }); return output; }, { name: "generateComponent", } );}
To enable logging to Braintrust, we just need to initialize a logger. By default, a logger is automatically marked as the current, global logger, and once initialized will be picked up by traced.
Now, we'll run the generateComponent function on a few examples, and see what the results look like in Braintrust.
const inputs = [ "A login form for a B2B SaaS product.", "Create a profile page for a social network.", "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode.",];for (const input of inputs) { await generateComponent(input);}console.log(`Logged ${inputs.length} requests to Braintrust.`);
Let's also track user ratings for these components. Separate from whether or not they're formatted as HTML, it'll be useful to track whether users like the design.
Once you create a human review score, you can evaluate results directly in the Braintrust UI, or capture end-user feedback. Here, we'll pretend to capture end-user feedback. Personally, I liked the login form and logs viewer, but not the profile page. Let's record feedback accordingly.
// Along with scores, you can optionally log user feedback as comments, for additional color.logger.logFeedback({ id: requests["A login form for a B2B SaaS product."], scores: { "User preference": 1 }, comment: "Clean, simple",});logger.logFeedback({ id: requests["Create a profile page for a social network."], scores: { "User preference": 0 },});logger.logFeedback({ id: requests[ "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode." ], scores: { "User preference": 1 }, comment: "No frills! Would have been nice to have borders around the entries.",});
As users provide feedback, you'll see the updates they make in each log entry.
Now that we have a dataset, let's evaluate the isComponent function on it. We'll use the Eval function, which takes a dataset and a function, and evaluates the function on each example in the dataset.
import { Eval, initDataset } from "braintrust";await Eval("Component generator", { data: async () => { const dataset = initDataset("Component generator", { dataset: "Interesting cases", }); const records = []; for await (const { input } of dataset.fetch()) { records.push({ input }); } return records; }, task: generateComponent, // We do not need to add any additional scores, because our // generateComponent() function already computes `isComponent` scores: [],});
Once the eval runs, you'll see a summary which includes a link to the experiment. As expected, only one of the three outputs contains HTML, so the score is 33.3%. Let's also label user preference for this experiment, so we can track aesthetic taste manually. For simplicity's sake, we'll use the same labeling as before.
Next, let's try to tweak the prompt to stop rendering full HTML pages.
function generateMessages(input: string): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are a skilled design engineerwho can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.Your designs value simplicity, conciseness, clarity, and functionality overcomplexity.You generate pure HTML with inline CSS, so that your designs can be rendereddirectly as plain HTML. Only generate components, not full HTML pages. If youneed to add CSS, you can use the "style" property of an HTML tag. You cannot useglobal CSS in a <style> tag.Users will send you a description of a design, and you must reply with HTML,and nothing else. Your reply will be directly copied and rendered into a browser,so do not include any text. If you would like to explain your reasoning, feel freeto do so in HTML comments.`, }, { role: "user", content: input, }, ];}JSON.stringify( generateMessages("A login form for a B2B SaaS product."), null, 2);
[ { "role": "system", "content": "You are a skilled design engineer\nwho can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.\nYour designs value simplicity, conciseness, clarity, and functionality over\ncomplexity.\n\nYou generate pure HTML with inline CSS, so that your designs can be rendered\ndirectly as plain HTML. Only generate components, not full HTML pages. If you\nneed to add CSS, you can use the \"style\" property of an HTML tag. You cannot use\nglobal CSS in a <style> tag.\n\nUsers will send you a description of a design, and you must reply with HTML,\nand nothing else. Your reply will be directly copied and rendered into a browser,\nso do not include any text. If you would like to explain your reasoning, feel free\nto do so in HTML comments." }, { "role": "user", "content": "A login form for a B2B SaaS product." }]
await displayComponent( "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode.");
Now that we've run another experiment, a good next step would be to rate the new components and make sure we did not suffer a serious aesthetic regression. You can also collect more user examples, add them to the dataset, and re-evaluate to better assess how well your application works. Happy evaluating!