Mabl increasingly relies on generative AI large language models (LLMs) in the background to make test automation smarter and more efficient, but there are times when you want to integrate with an LLM directly as part of your end-to-end tests. Whether you want to test your own models, call a specific publicly-available model to generate, analyze data in your test, or run some automated benchmarks, mabl makes it easy to integrate with services like Google Gemini, OpenAI ChatGPT, and Anthropic Claude, with minimal configuration. Here, we’ll run through three examples where we integrate directly with these providers. Each of these examples includes one type of test and one model provider for simplicity, but the models, prompt types, and use cases are relatively interchangeable.
Browser-based Image Validation with OpenAI ChatGPT
In this scenario, I have an image generation service and I want to automate the process of validating that the images are appropriate for my input. To accomplish this, I create a simple browser test for the front-end app and add an API step that passes the input and the output to ChatGPT for validation.
Looking at the test steps, you can see that, after logging into the app, I set a variable for the input (image_prompt) and enter the value of that variable into the image generation field. Using a variable here makes it easier for me to change the value in the future, test more scenarios with datatables, or even set it based on another API call (see below for more on that!).
I also capture the URL of the generated image as a variable - image_url. Using ChatGPT in this example is handy because it accepts image URLs, unlike Gemini and Claude, which both require you to send Base64-encoded images as part of the API call.
Finally, I pass both the image_prompt and image_url to the OpenAI API, which decides whether to pass or fail the test based on its assessment of the prompt and image. Let’s take a look at that API call.
As you can see, this is a POST call to the https://api.openai.com/v1/chat/completions API. You’ll need to set up an OpenAI platform account to access this API. You also need to include the Content-Type application/json header, and you can use your OpenAI API key as a Bearer token.
Here’s my full request body. I won’t go into detail on the structure of the request, since OpenAI does a great job of it in their API reference docs, but we should talk about the content of the prompt. You’ll note that we’re taking advantage of the multi-modal capabilities in GPT-4-turbo by passing both text and an image (URL). The text provides the prompt () and asks the model to evaluate whether the image ( provided in the second content block) is relevant.
Finally, we just need an assertion to determine whether to pass or fail the test. In this case, I use a simple assertion in the API step to pass the test if the response contains “TRUE” but I could have included more sophisticated logic - both within the API step and via subsequent assertion steps based on data included in variables defined in the API steps
Bringing it all together, I first pass the prompt, “Black cat jumping over a river” and a screenshot from Adobe Express, which I’m using for image generation. You can see the screenshot and the response here:
"TRUE\n\nThe image shows a black cat in mid-jump over a body of water, which appears to be a river or stream. This directly corresponds to the description of a black cat jumping over a river, making the image highly relevant for someone looking for this specific scenario."
And that’s a wrap! Now we have a useful model for including LLM API calls in browser tests. Let’s take a look at a slightly different example.
Generating Data for Mobile Tests with Google Gemini
In the last example, we used an LLM to validate that a relevant image was generated by our web app based on a known input, but we can’t really predict what people use as the input. If only there was some type of AI that could generate input prompts for us…oh wait, this is another great use case for an LLM!
In this case, we use mabl’s native mobile testing capability, and we once again rely on the API request (API step) feature to interact with the generative AI APIs:
The structure of the test is similar to the browser example above, but we use the LLM to generate the value. First, you declare the prompt for that in a variable - with a value like, “Please give me a prompt that showcases the image generation power of large language models. Provide the prompt only, with no explanation” and use that as the body of our API call to Gemini. We capture the response and use it as our prompt for the image generation service. Finally, we use the original call to ChatGPT above to validate that the image generation was relevant to the generated prompt.
Let’s take a look at the new part - the API call to generate the image prompt using Gemini.
The call is straightforward. Most of the detail is in the URL. First, note that I’m using the Google AI Studio deployment of the models (generativelanguage.googleapis.com). This is because I couldn’t figure out a way to use a vanilla API key with Vertex AI. You’ll also notice that I’m using gemini-pro as my model because we’re only asking it to work with text. If I were asking it to analyze the image as well, I’d want to use the multi-modal gemini-pro-vision. Finally, I provide my API key for Google AI studio via the link above.
Otherwise, beyond the header (Content-Type | application/json), I have a really simple body to send along the validation prompt. I also set the temperature to 0.8 so that I get some real variation between test runs. Finally, we have an assertion that the response is valid.
If you’re curious, Gemini generated a prompt something like this on the first run, “Imagine a vast, swirling vortex of iridescent colors and ethereal forms. In its center, a kaleidoscope of abstract patterns dances and transforms, creating a symphony of visuals that defy description.”
And Adobe's generative AI returns this – seems pretty relevant to me!
Highlighting the impact of a relatively high temperature, Gemini generated this prompt on the second run, “Generate an image of a golden retriever floating in a pool with a rubber duck on its head, wearing sunglasses, and sipping a margarita.”
And here's Adobe's (very suitable) generated image.
Viola! For future runs, we could pass these prompts and images to the flow in the first example and we’d have an end-to-end test that uses generative AI to generate a prompt, generate an image based on that prompt, and validate that the generated image is relevant to the prompt!
Full disclosure: In a real-world use case, I probably wouldn’t make a new call to Gemini every time I want a prompt for image generation. Instead, I would make a call manually to generate lots of interesting prompts in CSV format and upload that to mabl to use as part of data-driven tests.
Benchmarking Anthropic Claude’s Models with API Tests
So far, we’ve demonstrated how you can easily integrate calls to OpenAI ChatGPT and Google Gemini in end-to-end browser and mobile tests via API steps in mabl. Hopefully, you can find use for this in generating input data, validating non-deterministic behavior in your AI-driven systems, and more. But what if you’re testing or benchmarking the APIs themselves, or you want to integrate these models in API-driven transactions or flows? In this case, you may want to use mabl’s more comprehensive API testing feature rather than API steps. Let’s explore that, and we’ll use Anthropic Claude as our target.
For this exercise, my goal is to understand the performance and accuracy tradeoffs between the three Claude models: Opus, Sonnet, and Haiku.
I want to validate the same prompt for each of the three models, so I create a datatable with the three models. Next I create a simple API test that will use this datatable to validate three scenarios in parallel - one for each of the Claude models (driven by the llm variable).
Here are the Variables for the API call to Claude:
api.url | https://api.anthropic.com/v1/messages |
llm | Variable driven by the datatable above. |
imgBase64 | [A large string - the image in Base64-encoded format] |
validationPrompt | [The prompt for Claude to validate the image] |
Now let's look at the request body.
We start by specifying the model, which will vary based on the datatable scenario.
The message simply passes the image and the text prompt.
Again, we won’t go into detail into the other parameters here but they are explained thoroughly in Anthropic’s API documentation.
For our benchmarking use case, we want to understand the performance and reliability differences between these models. Here, I can sample runs in mabl’s test results dashboard.
For deeper analysis, I’ll want to export the data to an analytics tool using mabl’s BigQuery integration or Results API. With this data, you can answer questions like,
- Will any of Claude’s models meet my needs?
- Is the fast and affordable Haiku sufficient for my use cases?
- Do I need the more expensive and (potentially) slower Opus to deliver the reliability that I need?
- Or is Sonnet the perfect blend of accuracy, price, and performance for me?
- How do the Claude models compare to Gemini and ChatGPT for my use cases?
Try it for Yourself
Hopefully this gives you a sense of why and how you might interact with popular generative AI APIs into end-to-end tests using features that are fully available for mabl. You can sign up for a free trial of mabl to try it for yourself (of course you’ll also need accounts for the target generative AI providers). And keep an eye on mabl’s release notes; we’re constantly enhancing our platform to make it easier for you to integrate AI into your testing.