Evaluating generative AI text tools using an experimental framework

Contributed by

Neontribe is a digital agency working in user research, software development and design.

Use this Guide if you want to test generative AI to understand how you might be able to use it safely for your charity. This Guide will go through the process of setting up and carrying out an experiment, using CAST’s experiment canvas.

Jump to steps

Key software:: ChatGPT
How they set it up:: In house technical team implemented it
Setup costs:: None
Ongoing costs:: Chat GPT-4o subscription, charged per input
Guide in use since:: June 2024

Steps to evaluating generative AI text tools using an experimental framework

Tools like ChatGPT, Claude, and Google Gemini are known as generative artificial intelligence (AI) but they’re not intelligent in the sense that a human being is intelligent. They don’t think or understand the way we do.

They use algorithms known as large language models (LLMs). Once a generative AI text program has been trained on a vast amount of writing, it can identify common patterns, search for information, and combine these to generate new text. It’s like a very sophisticated version of the predictive text on our phones.

There are lots of things about generative AI text tools which might be useful for charities, but there are risks as well. Running an experiment can help you decide if you’d like to use generative AI, and provide insights and data to share with your colleagues. CAST have developed a framework which you can use as a guide for your experiment, you can find a link to this in the 'Further Information' section at the end of this Guide.

Start by working out what you want to learn or achieve with your generative text AI experiment.

Do you want to find out if AI can help you with a specific task? Do you want to understand more generally what it can do?
Do you want to make a case for or against using AI in your charity? Are you open either way?
What are your hopes and concerns about what the results will show?
What will you do with the results? E.g. keep them private, share them internally, or publish them outside your charity.

Neontribe are a digital agency committed to tech for good.

Neontribe decided to test if they could add their own data to an AI tool and get accurate answers to specific questions about the data.

Their goal was to understand whether generative AI could be useful for their charity clients, and whether there were risks that they should make charities aware of.

There are many generative AI tools available, which can help you write content, create images, or even make music.

Many use an interface which feels like using a chat screen or a search engine: you can type in questions and the AI tool will give you a text answer. But there isn’t a real person on the other side of the screen responding to your messages, and the answers given might not be accurate.

When you’re choosing which tool to use, you may want to think about:

Functionality. What tasks do you want the AI to handle?
Ease of use. Some tools have user-friendly interfaces, while others require more technical knowledge.
Cost. Some tools are completely free, some are free with some limits, and others you need to pay for.
Security and privacy. Choose a tool with strong security measures and clear data protection policies.

Neontribe chose to use GPT-4 for their experiment. GPT-4 is available through ChatGPT Plus and as an application programming interface (API) called GPT-4o for developers to build applications and services. An API is a way for two different software programs to communicate with each other.

GPT-4 uses one of the most advanced large language models available. It can be fed data you’ve stored in a spreadsheet to add to its knowledge. In addition, by using the API GPT-4o Neontribe could customise the tool easily, without needing to do lots of programming.

Plan out the steps of your experiment. The CAST experiment canvas could be a helpful way to think about what you’ll do. It’s available as a Miro board that you can make a copy of to edit online, or as a document that you can print.

Work through all the sections on the canvas and make notes for each one:

Description: A brief overview of what you want to do and why.
Hypothesis: What do you think will happen?
To test this we’ll: Describe what you’re going to do in more detail.
We’ll know if it works by measuring: How will you know if your hypothesis was right? What can you measure to show whether the experiment succeeded or failed?
Tools used: List the tools you’re using for your experiment.
Boundaries of the experiment: For example, you might say ‘we’ll only trial this for a week’, or ‘we’re only testing this with one team’.
Person in the loop: Who will do what?
Data and privacy: What data do you need to use or collect? How will you keep it safe, in line with your charity’s data policy?
Engagement: Have you involved people who might be affected by the experiment?

For their experiment Neontribe tried something that they thought charities might find useful.

They knew that when a charity is starting a digital project they will often shortlist a few different agencies that they might like to work with. They wanted to find out if an AI tool might help make this easier.

Dovetail is an online directory of digital agencies with experience working with nonprofits. With permission from Dovetail, Neontribe prepared a comma separated value (CSV) file containing data about all the digital agencies in the directory. They uploaded this file to the custom AI chat tool they had created using GPT-4o.

Their hypothesis was:

“We think that a custom GPT could help a charity quickly produce a reliable and useful shortlist of agencies that will be a good fit for their project.”

Neontribe set out what they would measure to test this idea. They deliberately chose data they were very familiar with, so they could easily check for any mistakes in the AI's responses.

Prepare your chosen generative AI tool with the necessary instructions. Start with a small-scale test before expanding.

Craft clear instructions or prompts for the AI
Try different prompts and approaches
Keep notes about the whole process
Document all results, including unexpected outcomes

Once Neontribe had uploaded their CSV file to the custom AI chat tool, they experimented with different prompts and instructions for the tool to explore what it could do. For example, ‘recommend three agencies which would be a good fit for X charity’s project to create a mobile app, with a budget of X per day’

They started with a small dataset and gradually increased it, documenting how the custom chat tool’s performance changed with larger datasets.

Review the output from your experiment. Look for patterns, strengths, and limitations in the generative AI's performance.

Look back at your hypothesis

Compare results to your success/failure criteria
Check for mistakes or inaccurate information (sometimes known as "hallucinations")
Notice anything odd or unexpected in the output
If something happened that you didn’t expect - do you know why?
Think about the implications of your findings
What will you do next?

Make a note of all your findings and reflections.

Neontribe's experiment revealed several limitations of the AI custom chat tool:

It gave inconsistent responses
There were issues with how it handled missing values in the data
They encountered "hallucinations" where it provided inaccurate information
Performance decreased with larger datasets

However, they found that the custom chat tool returned more useful and accurate answers the more they ‘trained’ it with prompts and instructions. They also found that making prompts as specific and detailed as possible made the responses more consistent.

The results didn’t prove Neontribe’s original hypothesis, but the findings have helped to inform their discussions about generative AI capabilities and potential uses for charities. They found they couldn’t rely on a generative AI such as ChatGPT to produce reliable answers to factual questions, but it was able to quickly produce a draft answer which they could check for accuracy.

Prepare a summary of your experiment and key findings
Discuss the implications with your team
Consider how this experiment informs your charity's approach to using generative AI tools
Share your learnings with other charities if appropriate

There are lots of potential benefits from using generative AI tools, but it’s important to share the challenges and problems too. For example, producing content with mistakes in could harm a charity’s reputation or even harm the people it aims to help.

Carrying out experiments and sharing findings can help the whole sector make informed choices about these new tools.

Neontribe documented their process and results using CAST’s experiment canvas, and shared their findings on public platforms like LinkedIn.

Further information

Contact [email protected]
Find out more about CAST’s AI Experimentation Canvas
See Neontribe’s completed experiment canvas
Read about Gov UK’s experiments with generative AI

Was this Guide useful?

Let us know.

Shared Digital Guides

Evaluating generative AI text tools using an experimental framework

Step 1: Identify the goal of your experiment

Step 2: Choose your AI tool

Step 3: Plan what you’ll do

Step 4: Run your experiment and record the results

Step 5: Analyse your results

Step 6: Share your findings

Further information

Was this Guide useful?