In this episode of Your AI Injection, host Deep and Bill tackle the intricate world of AI effectiveness and regulatory compliance. They dissect real-world AI mishaps, including a dealership's chatbot error leading to incorrect car pricing and an airline chatbot's false bereavement policy, to illustrate the importance of rigorous AI testing. Shifting to a technical discussion, they discuss AI vulnerabilities, underscoring the necessity for testing frameworks that assess functionality, ethical integrity, and security against adversarial inputs. The conversation turns to regulatory frameworks like the EU AI Act and NIST guidelines, advocating that adherence to such standards is vital for legal compliance.
xyonix solutions
Learn more about Xyonix's AI Testing, Compliance & Certification Solution, the best way to ensure your company is following the law with AI regulations, being ethical, and thoroughly testing and optimizing your AI systems. Learn more about Xyonix's Virtual Concierge Solution, the best way to enhance your customers' satisfaction.
Learn more about AI Risk & Compliance:
Or take our free risk assessment HERE!
Listen on your preferred platform here.
[Automated Transcript]
Deep: All right. so we're gonna talk to you all today about, some A. I. Testing compliance certification. We're gonna touch on a bunch of topics. But basically, the high level is there's a bunch of stuff happening, uh, in the news that I'm sure a lot of you all have heard about, um, whether it was the Chevy Tahoe where some folks at a dealership, um, you know, grabbed a couple of developers and, slapped them in front of the open AI API and built themselves a chat bot.
And some clever folks on the Internet quickly managed to convince it to sell them a brand new Chevy Tahoe for 1. Uh, or whether it's the Canadian Airlines, case where, I can't remember exactly what happened in the Canadian Airlines case. Do you remember, Bill?
CHECK OUT SOME OF OUR POPULAR PODCAST EPISODES:
Bill: No, I thought it was just like a heavily reduced fare somewhere. Ah, okay. And I think they had to, I think the idea was that they were trying to hold Canadian Airlines to, to that promise of getting a very cheap airfare.
Xyonix customers:
Deep: And the funny part is the, Yeah, so as everybody, on this show may or may not know, LLMs really love to please and, uh, anyone who's trying to get your bot to say things that it shouldn't, kind of takes advantage of that, uh, vulnerability.
Amongst some other obvious ones. And so I think the key in the in both cases is the executives try to weasel out of it, saying that, hey, whatever our bot says, that's not what we say. So we don't have to give you a Chevy Tahoe for a dollar or give you a cheap fare on the airlines. And the moral of the story is that.
It's not enough to just take some developers and throw them in front of the OpenAI API and then shove it in front of the world, it will do something, yes, but to be responsible, you really want to test your bots, you know, you want to measure what they say, you want to validate that they're being ethical, you want to secure them, and you know, with all of the regulations coming down from the EU, you AI act to, you know, um, we've got regulations that, uh, NIST has put out, the Biden administration's been vocal on this.
You want to make sure your bots actually follow the law. And that's both today and in the future. And so we're gonna talk about that. We've got a bunch of tools and capabilities in house where we help folks so that I think we're in a good position to like, um, offer some advice here. Bill, any openings to those here as we dig in a little bit?
Thanks
Bill: Well, I just, I love these conversations with you, Deep, because we've been at this business for so long and, it just seems, it's, it's fun to have these conversations because we can look at, see sort of what's transpired over the last just a couple of months, really. And, uh, I think the world changed when OpenAI released GTP4 because it really grabbed the public consciousness and everybody now is in a mad scramble, sort of surfing that wave to use these fascinating and wonderful tools.
To their advantage in business or some, some sort of really cool, creative way. And, of course, the, I think the natural next stage of that, the thing down the road is, is. Well, how are we going to, how are we going to talk about how effective these bots actually are and are they, can we corral them and how do we measure when they're outside of the corral wandering through the forest and we want to come back because as a couple of, you know, you cited a couple of incidents where some promises were made and I'm sure there are savvy.
people out there who are trying to get these bots to say and do things that, that either get them, uh, a free airfare or a cheap truck, or maybe just embarrass the company or just to say something embarrassing. Right? That's not new, right? I mean, if you look back at the release of these chatbots, say, for example, in time, there's always a group of people out there that want, want that bot to say something terribly racist or ageist or sexist or, you know, You know, just something that
Deep: trick it into being a Nazi sympathizer.
Bill: Right. if you're a CEO of a really big company and you're looking at this technology, I'm sure you're just chomping at the bit to use it. Because you can see, wow, it can really be able to leverage this for customer service.
I can leverage this for, you know, lots of different stuff. But I think the problem that is coming down the pipe is, you know, one of being able to control these bots. And, and how do we actually measure some sense of efficacy? And, uh, I think that's what we're going to talk a little bit about today.
Deep: I think that the term that I keep hearing a lot is guardrails.
Like, how do we put guardrails up around these bots? Because this generation of AI is generative. As we all know, these bots are incredibly powerful, but they can also just say things that they're not supposed to say. So how do you get those guardrails up? You know, I was, Just chatting with a friend, you know, who's, pretty high up in a, in a major, you know, international, food service, company that everybody listening to this episode has been to multiple times.
And, you know, one of the things that she was telling me is like, Hey, look, we've got, I can't even tell you how many teams in our company are trying to deploy some kind of, you know, customer experience, customer service, or even internal facing chat bot thing. All, based on these large LLMs, whether it's Gemini or.
You know, or, or open AI or, or whoever, um, and all of them are basically stuck in this scenario where it's like, how do we know that? Like, how do we get the guardrails on? And they've got, teams of consultants, like, homing this large, Fortune 500 company, coming from, you know, the, the usual sort of big tech suspects to, put up these guardrails.
But the real point of this conversation that we're trying to have is like, well, what are the contours of, you know, what those guardrails actually look like? And what are the sort of no brainer, obvious things that when, when we hear about the Chevy Tahoe thing, we just cracked up because it's so unbelievably obvious to us that this was going to happen.
And we thought it'd be nice for our, you know, and that's coming from folks who've been building these machine learning and AI systems for the last three decades. what are some of those things? And how should you be thinking about it? So one of the things that I kind of want to start with is like this regulatory compliance and sort of risk identification question, which is like, Hey, what are even the risk points?
You know, as we know, these models have bias. Um, as we know, the models, the data for the models comes from places that bias and they might be coming from very sensitive user data. They might be coming from very public sources that have inherent biases. So, I think part of it is just, you know, if you have a project, knowing what the potential risks are.
Like, if the Chevy Tahoe folks or the Canadian Airlines ones had just simply asked anyone in the know, like, what could go wrong, that list of risks, you know, it's, it's like, easy to generate that list of risks if you know what you're doing and have been doing this for a while, and it's something that you wouldn't think of doing otherwise.
Bill: Yeah. Well, I think we spoke a long time ago when we started talking about GTP3 world, this idea of societal bias. Now, when we say that, we're talking about, we're saying things that are racist or sexist or, you know, all ageist, all along those, the is category that a large group of people might be feel discriminated against or, and so forth.
And I do want to make a comment about that because certainly the guardrails, as we've spoken about, have been put up by. Companies such as open AI. Now, what does that mean exactly? That means in some cases, they're going back to the data sources and they're not choosing like maybe every single sub Reddit group to include in a discussion because that can get into some pretty nasty material.
they've also built models on top of the base LLM model to basically instruct it not to steer into conversations that I would say a majority of people would find offensive. So I guess a kudos and sort of shout out to all those big companies that are certainly doing their due diligence to try to put up these guardrails.
However, and this is the however, is as you mentioned before, the base LLM model is this giant, huge black box that's been trained on essentially the world's worth of data. And its job is, you know, they've kind of been kind of trained to with these subsequent models to please human beings, right? If they, if they weren't, if they always were argumentative or battling you and everything you asked and that nobody would want to use it, then it would just be glitter.
Deep: Right. Or whatever it is.
Bill: But if you notice, and, you know, kudos to them for, for building, you know, ultimately a product that is so congenial and able to do some incredible things. But at the end of the day, though, they can only really control it so much. And I want to speak to another point is, is not everybody's going to be using open AI products.
They might be using Google, so they might be using someone else, or they might even be really importantly, you know, due to cost, maybe due to scalability. They might even be farming out some of these responses to users, say, in a chatbot to a lesser model that is much cheaper to operate. And those models, of course, don't necessarily have all the guardrails that the bigger models do.
Deep: I think the point I want to. Kind of make here is even if the big models have tons of guardrails, it's not necessarily the rails that you want or need. So like, like Gemini, um, Google's latest take, you know, they've been getting a lot of heat this past week. you know, by the anti woke agenda community.
So conservatives are all pissed off about this scenario where I think if you asked it to, I think somebody asked it to like generate an image of, of the founding fathers and the model put out, you know, a diverse crowd of founding fathers. And so they were like, what would the founders and fathers look like?
Or something like that. The model puts out this diverse thing and their argument is like, See, you know, the liberals have taken over tech companies, you know, and then on the other side of the fence, you know, you can imagine on the liberal side, like if somebody says, Hey, you know, give me a picture of a, of a fireman.
And if it always shows, you know, a white male, um, you know, that's 32, then the liberals would be like, it's age biased, it's racist, you know, and all that stuff. and if you think back to like the Chevy Tahoe example, I think like the way they, they broke that one is they convinced it that its job was not only to sell them a vehicle, but to please them at any cost and that it was just a game, right?
So that's like a different kind of guardrail. Like it's not about, you know, Nazis and racism and all this other stuff. It's about. It's about a guardrail of don't offer prices that don't directly come from my database, you know.
Bill: Yeah, it's easy for us to think of sort of really nasty things that probably most people wouldn't want to have their bot discussing.
Then there's the other level is make sure that if you're having a chatbot that's designated to my company, which is a furniture store that you represent sort of the inventory that we have. In and not promise something that we don't have in your inventory, or certainly don't promise a value that we could sell a piece of furniture to that.
We do have at an incredibly low price that we would never support. It's all of these sort of ways in which the information is being transmitted. We want to make sure that it's pleasing to the user. we, we have to talk about how, how does one measure that? How does somebody control that? And you mentioned, you know, some standards being put up at the European Union and, and NIST, I'm sure is coming down the pike.
Deep: NIST has it, MITRE has a really good standard around AI, um, regulations and ethics. So I kind of break it down into like five sort of buckets of service. that I think anyone that's going to help you build an, a reasonable ethical bot will sort of, hone in on.
So one of them is just sort of like regulatory compliance and risk identification, like what are, what are, so these, these, if anyone hasn't read these, um, I mean, I don't know if I'd read them directly, but maybe just, You know, shove them into an LLM and talk about and ask questions. We have a bot like this that that you can use.
Um, but basically, you know, it's like, it's a combination of stuff. So a good chunk of it is like, just really good hygiene stuff that you should do. So from a security standpoint, it's like, what are the, what's the data that you have? what potential biases are in that data then there's a whole bunch of stuff that standard security stuff like what kind of data are you collecting that?
How? How might it be vulnerable? So there's like a whole kind of hacking y kind of dimension, which it looks a lot like traditional security, um, protocols. Then there's, there's stuff all around like monitoring and understanding, like what are you doing when you start talking about mitigation of these risks?
so models have drift, you know, where once you release them, they might change and start behaving differently in different contexts. So what are you doing to like kind of keep track and monitor that over time? Um, And so there's like a lot of these kind of big bucket items that fall under that. So the next, the next arena that I kind of chalk out is just efficacy testing.
So this is like, hey, Given that, you have some prior dialogue, let's stick with the Chevy Tahoe thing, somebody should have created a ground truth entry that said something along the lines of, I want you to, um, I'm not asking you anymore about what they had intended, which was like, when's the dealer open, what time does it close, what's on the lot, like, how fast does the Chevy Tahoe go, I don't know, the antelope, all that kind of stuff is what they intended the conversation to go.
Where they didn't intend it was for somebody to reprompt it. I'm like, yes, hijack the prompt. So that's an example of something that you would want to have covered in your efficacy testing. You would also want to have stuff covered about how well it's answering. So maybe, you know, you have some sales guidelines.
So like, Hey, don't immediately, come out with the cost of a vehicle, like just ask them more questions. So, so that could all be like represented. In let's call it like perfect responses that you know that are human validated that you use every time your team releases a new set of prompts or a new variations on the bot, it runs through that battery a test.
So that's the 2nd category. So risk identification, efficacy testing, and then related to efficacy testing is sort of the ground truth generation and enhancement. Like, how do you actually get those perfect responses? How do you do that in an efficient way? And in a way that sort of hones in on the weaknesses of the bot or the areas that you're concerned about and then there's there's this idea around knowledge based editing so as most folks know about by now like one of the easiest ways to keep these bots not commenting in areas they don't know about is to like force them to like look it up in a in a corpus that you control. So, you know, if you're a medical startup, you know, you might have strict medical reviews around any documents that you're allowed to reason about, that your bots allowed to talk about and answer questions on.
So let's say it's a patient question, answering thing, not allowed to give any medical advice, all that's strictly curated in the knowledge base. So those are like four categories. And then if we kind of borrow the security analogy a little bit further, one of the things that security teams have done, you know, for decades now is like red teaming, where, you know, you really try to just kind of break the system out and get it to do some, some wrong stuff.
So I think.
Bill: Is that called white teaming or red?
Deep: I've heard it called, white hats. Stuff. White hats. Okay. Called red teaming. I've, I, I, I've probably heard white hat, I don't know, like, whatever it is. A bunch of good guys trying to break it. Break the thing. Yeah. Yeah. And those are all services that, you know, that like we at Xyonix regularly use for the bots that we help our clients build.
And even for the cases where, if, you know, if somebody listening, you know, has their own bots and they just wanna lean on us to like. Help them think through this testing and get an outside opinion. That's stuff that we do regularly, but maybe I was thinking that maybe we could dig into a couple of these a little bit further.
Like, maybe let's start with the ground truth and efficacy testing. I know, Bill, that you've been, uh, actively, like, helping us build out some cool tooling here at Xyonix that we've been using. Do you want Maybe talk us through, um, the scenario and context and the kind of way you're thinking about it a little bit.
Bill: Yeah, so I just think that, you know, us as machine learners and AI, experts have been really used to models that we can control in the past. You know, we, we build a model and then when we release a model, we talk about, you know, how much it's improved. So we talk about, its accuracy going up a certain percentage.
We use things like precision and recall as, as a means of doing so. We have. A little test set that we run the model on and then we can get some results on that test set and we can make some, some assessments of that when we step into the LLM world, which all these chatbots are, we kind of almost are stepping into a totally new world here.
We certainly can, however, generate sort of, as you, as you might imagine, pre canned prompts that we want to the bot to respond to. Um, Um, these are might be like questions or things that we want to sort of poke around on the different topics that the bot is supposed to have a lot of knowledge around and then and then see how they respond and then we can look at that response and see how it measures up not only in how factual it might be, does it, align to the knowledge base that that that bot has been given, for example, but also just in the sort of the communication, modality, like is it, is it being respectful or empathetic?
Is it being a professional? And you, you have to think about this from an almost like a security standpoint. When we test these bots for those types of things, we want to be very adversarial. You can even imagine us and we do develop sort of adversarial, bots to communicate with, with these standard bots to try to do things to make it mess up.
How does your bot respond to when you've corrected it, or is it good at remaining on topic, or other things. And those all sort of complement on top of, you know, is it having societal bias, uh, in, in terms of its responses. Here's here's kind of the point.
We can look at things on an individual response level. We can assess a bunch of things about those individual responses. And with a collection of those responses, we can look at statistics and we can break them up into categories. We can talk about means the distributions and those things and so forth.
Deep: I would even take it up a level from that, like with collections of collections of responses, we could have a whole set that's just just kind of addressing racial bias. That's a racial bias collection of questions let's get a score on that. Let's get a score around let's call them just security stuff like prompt breaking Like someone trying to trying to like re prompt the bot You know, so you can imagine you have these blocks and then of course you have your application specific blocks where you might be like in the case of the Chevy Tahoe stuff, you know, they might have had a line of questioning that's all around the details of vehicles and like what's available and what's not available.
Um, they might have had a line of questioning all around like billing and financing and all of that. All of that sort of should be thought through. One of the points I wanted to bring up is. In traditional machine learning, you know, you usually have data scientists who've been trained in all of this for, for forever.
And we, this is, we're used to dealing with uncertainty. Um, we have statistically meaningful metrics and methodologies for dealing with it. And in order for us, I think you were kind of talking about this to earlier, like in order for us traditionally to build models, we had to build these robust data sets.
Um, so that and we would always just grab, you know, in traditional ML, you grab a chunk and you hold it aside and use that for your testing or your validation. Whereas in the new world, you don't actually have to know anything about data science to get a bot to seem pretty darn impressive, right?
Like you can just be a developer. Who's not used to thinking statistically, you know, who's used to thinking in a more deterministic like fashion and they can get something working and you could get an overzealous, project manager or executive sponsor. That's like, this is great. This works great.
Let's launch it. And I'm sure that's probably what happened at, Air Canada and, you know, in the, in the Chevy, in the Chevy dealership that, that launched this thing.
Bill: But the real, the real weakness there is that there's an assumption that your user. is not a deviant and not somebody who's going to be adversarial and not going to be very tricky I think most people probably wouldn't think the standard user who's going to interface with a chevy dealer uh is going to try to change really the communication strategy of the bot to now serve the user First, before
Deep: I would generalize that comment even a bit more, which is everybody who looks at this thing and is not used to thinking statistically.
Will formulate an opinion on how well it does based on their like bias subsample of things. They asked it, but that's not really what we're after. And, you know, in traditional data science, we're after a representative conversational set that's representative of the true body of users. And if you presume they're all good actors, then it's not going to be representative.
If you presume that it's all questions about. You know the details of the vehicles, but nothing around billing. It's not going to be you know So one of the things that we do is we want to track the real questions that are coming in We want to understand them. We want to make sure we have representation in that ground truth So that so that we can test it and one of the points I was making before though is like traditional ml You couldn't get a model together without building the data set.
So you had the data set and you can sacrifice 20 percent of it or 30 percent of it to go test with. And I think one of the things that's going on right now is people don't have the motivation or the even knowledge that they should create these data sets or that they have to, because they don't need to, to get something off the ground.
Bill: to that point, to be very specific. The world's worth of data was used to build this. And, but it wasn't the individual people who are leveraging this technology didn't have to procure that data. And actually that makes it a bit scary at the same time, right?
Because you don't have knowledge really, uh, about what data was used to train that particular model. I mean, we, we know sort of what they've leaked in the press, but we also know the New York times now is suing, for example, open AI for the use of their data I think there is an assumption because it's used the world's worth of information and data that everything's probably going to be okay.
And I think that's a bit foolhardy actually. And it's even a bit sneakier than that. You can take these models and you could do something very tricky that I saw was done by a hacker. Okay. Admittedly, these are AI researchers, but they got one very, very, very intelligent, let's say LLM model to reveal underlying Training data on personal information.
So I think they revealed that like a doctor's address and phone number. This is the real
Deep: person. But
Bill: how did they do it? They did it by asking the bot to repeat a word like bird like a hundred times or something like that. I mean something that's just completely off the charts like no one would ever do that.
Anyway, they would be able to expose something with the purpose of trying to reveal, uh, information that otherwise shouldn't be revealed. That's a security issue, right? That's a real security risk. So I, it kind of comes back to this thing though, Deep, that I think we're sort of talking about.
The idea that so you can sort of get away with not doing this sort of efficacy testing and compliance testing. Is is a thing that's going to happen in the future. People are going to have to really start thinking about that. And we, you know, it's ionics are doing this and part of part of the issue that I think we can really help people with is being able to bootstrap.
you know, in a lot of areas, this type of questions and data questions we should be asking of the bots and then measuring the efficacy as a result of that we can, we can sort of help bootstrap that process, but like in security, it's a never ending process. It's not like we're going to do it perfectly the first time, but there has to be this feedback where you sort of continuously monitor the outputs of these bots.
Both for the conversation, uh, societal bias comments, whether they're adhering to regulations, et cetera, and that's just not, not going to end. In fact, I think. It's probably going to be the most important thing in the future of these bots in the future because I think that people really, really need to feel like they have control over them, one, and that they're telling the truth,
Deep: right, which is I mean, this all falls under this, this kind of this all encompassing term of responsible AI, right?
Like, it's right to be a responsible AI player, you need to do these other things. And at the same time, there's like a narrative, like, you know, the media narrative arc, maybe for the past year, has been these things are amazing, they do all this amazing stuff. No doubt. I don't think anybody's really debating that anymore.
Like, we know these things are amazing, they're doing amazing stuff. And now, what concerns me is the media arc has switched. To the other polar extreme. These things are dangerous. Everything they do is going to screw up. And because the, that's, you know, like at the end of the day, the media makes their money by, clicks and, uh, eyeballs, you know, and like amazing and horrible, uh, both sell.
but the reality is that, that there's a whole other realm, like there's certain scenarios that are really quite safe. So for example, uh, If you have some kind of, like, query language that interacts with your data, right? Like, maybe you have a website with a bunch of advanced filters. Like, think maybe Expedia or Travelocity or somebody.
Like, at the end of the day, whoever's interacting with your system, um, They're going to try to formulate a query like plane tickets or hotels or whatever on this date range blah blah blah blah blah So if you use the LLM for the natural language part and its job is not to talk to the user But its job is to formulate that query so that it can hit their database There's, there, the only risk there is that the bot comes back with a wrong answer.
There's no, there's no risk that it's going to be racist or sell a ticket for a dollar or anything, because the action is to translate the narrative into a query that hits an existing data system. That's an example of a pretty safe scenario. A little bit less safe, but still fairly, um, okay, is when you have a human in the loop.
Okay. you're building an app. You've got a human that's. That you're just trying to make really efficient. You know, like maybe it's a customer service representative. First obvious place to deploy your, you know, smart customer service bot is to help them. And now you've got a human eyes on everything.
It still might screw up, they might be bored because the bot's always right and then once in a while it's wrong. But it's generally safer. And then the third category where it gets trickier is when you go straight to end users and the risk of saying something goes wrong. Um, wrong is high and the cost of something wrong being said is high.
That's where, you know, you need to like employ a lot more capabilities and testing and stuff.
Bill: And well, I think you're starting to talk about the introduction of other types of Of things like you could do, you could think of some sort of moderator, like if you would have a human beings that sitting there and moderate moderating every single conversation that was held between a real human and a bot and they see things are going awry, you know, they'll pull the plug or intervene and step in and so forth.
It's not inconceivable to build independent. Moderator bots actually as well to monitor and those, you know, those moderator bots could be used for detection of when the conversation is going south. When we need to realign the conversation elsewhere. It doesn't necessarily have to be like an LLM driven thing, it could be back towards more conventional classification models that we use to detect.
Uh, when a user is being belligerent or or at risk or so forth,
Deep: let's dig in a little bit. I know you've got some some UX stuff to share. Why don't why don't you maybe. show us some stuff. So we get a sense of like what some of this tooling, you know, looks like.
Bill: Yeah. So we have some tools that we use at Xionix to sort of help, configure and design different types of bots. some of them are, we can sort of start off with the development of like a chat bot. Which we have, here. Uh, so let me just talk about, we have a, in our, we have a demo designer, that we use as sort of demo to customers, the types of bots that we can create they have sort of two main sections right now.
We we're developing more for the future, but one is sort of just general bot design and another one is towards efficacy design. Like how do we, how do we create these bots and then how do we test them? So, just briefly, we have, quite a few bots that we've already developed, uh, Let's pull up one like this maritime rope expert.
For example, we'll load that up into the scheme and here we sort of have the pieces that frankly, you could go up to open any I do something like that. We have sort of a version of what's available on opening eye for creating an assistant there. But we have things like, you know, what is it short name?
You know, which is almost like a project name. And tell me a little bit about this spot, you know, what's its areas of expertise and just describe it in general, you know, this is your go to
Deep: everything on the left side here is basically like a custom GPT or something that you can, but yeah, and it lets us quickly build out a bot so that we can get on to the testing and, you know, on the right side.
This is sort of interesting. Maybe maybe we should even just show a little bit of a demo of a vision bot. But one of the things that you can't yet do in the in the public API is actually reason about imagery very easily. Um, and that's something that we've we've kind of been building out. Um, I might just maybe we'll just demo something here really quick to give people a sense.
And then we'll go up. We'll go back cycle back and show how we might test it. Does that sound? Yeah, sure.
Bill: Sure.
Deep: can you see my screen here? Yes. Okay. So here's an example of a, of a system where, you know, it's just, it's sort of intended to highlight the power of, of image analysis and kind of co mingling that with some dialogue.
So here you see like there's six, products and I can ask it something like, Hey, how much? Uh, yeah, nine products. Sorry. How much is there in cheez its? do we want to know? I don't know. Probably not. Either way,
Bill: we're going to know. I really love cheez
Deep: its. I'll just put that up there. Yeah, you probably don't want to know.
So, so, so here we see like, okay, well, there's eight grams of total fat. So again, this is not based on the general knowledge of GPT. This is based on this image and the ability to automatically analyze this image. So then I can, you know, so I might ask something comparatively like, so hey, um, what is healthier, uh, Cheez Its?
Or Doritos. And, uh, and here again, it's reasoning, but based on information in the, um, in the actual imagery,
Bill: So what we've done, I think here, maybe it's important to say is that in the screen that we showed you before, we had the ability to sort of describe bot would do when given such an image like a cheeses image where you have a label and you have a Doritos label and so forth.
And, uh, we've, we've gone through and we've had this vision bot. Are these images and come up with content that we're now discussing? What's really compelling about that is that we can have a discussion not only about that particular product, but about this whole range of products that that that are part of the same conversation.
It's really cool. So what is the answer? What is healthy
Deep: Cheez Its or Doritos? It's saying Cheez It has 150 calories, eight grams of fat. So it's list, so it knows to list off the key points of what's healthy and what's not. It's basically saying both have similar nutrition profiles, so moderation is key.
Um, what about ingredients? Are there any ingredients of concern in either? Let's see if it can figure that out. So it has to know that I'm still talking about Cheez Its and Doritos. It's got to know, about the, uh, safety profiles a little bit, and now we're kind of honing in on ingredients. And it says Cheez It Originals main ingredients include rich flour, soybean, and palm oil, cheese and salad.
It contains annatto extract colorant soy lecithin. Doritos includes blah, blah, blah, and artificial colors. It risks a few. Yellow 5 and red 40 are artificial colors linked to hyperactivity in children and potential allergic reactions. Annatto can also cause allergic allergies in sensitive individuals. So pretty great.
Little conversational profile based on image and you can imagine, you know, we can do we have debt. We have stuff that we've done with, like, you know, maritime ropes and, you know, very different imagery. But, you know, Being able to, like, kind of, extract information from imagery is important, whether you're on a manufacturing line or not.
And then kind of when you commingle it with reasoning about, um, and through dialogue, you can get pretty far. So that was kind of all I wanted to share. Maybe you can jump back into the efficacy side, sure.
So then the question is, like, how do we how do we assess how effective this is?
So, you know, an obvious way to do this would be to go in. Read all these ingredients, write them down, have a, and know for a fact that there's eight grams of total fat in Cheez Its and that you've got it off the label or whatever, um, and then, and then you want to actually measure it. I think that's probably where you're going to go.
Bill: So I'm actually loading up the product label analysis that we use. so here's the chat bot definition, which kind of controlled your conversation. Here's the vision bot, which sort of controlled how it is that we're assessing each one of those images. And we're going to step into the efficacy design.
So, uh, the efficacy design is really like, okay, for the responses that you got, let's, let's talk about some basic categories surrounding that. So we can sort of measure those responses. One is, is content restrictions. So we can think about things when we run an efficacy experiment, when the bot speaks to us about its results.
We want to make sure that we have control over things that we don't want it to say, right? We never want to say, like, eating, you know, terrible food. Is great for you. Something like that. Something silly, you know? But you can think of things that are not very silly when it comes
Deep: to Yeah. Or just, or giving like hard health, health advice would, might be something.
Yeah. like, we don't want it endorsing, you know, 10 boxes of Frosted Flakes every day. Right.
Bill: so I could type in something that I do not give heart health advice as well. The idea, let's just talk about this one thing. 'cause this idea here is that we, we ask.
Just like you did, Deep, you asked some questions. Instead, we have a bot ask it lots of questions, covering a wide range of categories. And that bot may be prompted from real questions that a human has procured, or it could even generate a comprehensive list of topics and questions to ask that bot. When we collect all of those responses, we can then look for things like, hey, did you ever mention anything about, you know, giving straight health advice?
Because that would be a no no. So this is an example of just one small area where we really can probe the bot in many, many ways. And look at its responses and, and see how well it's done.
Deep: So I think, I think the point here is once you have examples of what the dialogue state was, what the bot should have said and what the bot actually said, you can now go back and check if what it said violates any of the band topics and statements.
Bill: That's absolutely correct. And you can think about things not only in terms of like content restrictions, which is very important. This would be very helpful, for example, for Canadian, you know, airlines and Chevy, where we're talking about things you shouldn't discuss but also you can talk about things that are maybe a little bit more subtle, like the interaction style, like, for example, if you're talking to, you know, say, a group of students or something like that, and you're worried about being empathetic towards them, you might have a series of characteristics, of being, you know, empathetic or casual, humorous, you know, what type of discourse style did the bot use?
And we can measure things like that. That's pretty cool to know, right? We don't want to talk casually to people who are very professional and maybe I may
Deep: or maybe you're making a funny bot like you certainly don't want to get boring professional tone in that
Bill: case where exactly some some people you might even think about point of view like you might maybe want the bot to always just bottom third person first person or second person and things like that emotional tone you want to be positive neutral or maybe you want the bot to be negative you know so forth but the point is is this allows you to sort of gauge and develop the things that you would like to measure in the bot that you thought you have produced, right?
Is the interaction style the way you like it, the content the way you like it, and some more technical aspects? I think
Deep: a key here point that, you know, is maybe the subtext of what you're saying is that it's essential that you're very thoughtful. That you're actually thinking about what you want the bot to say and how you want it to say it.
And that thoughtfulness applies to how you know what exactly you're going to measure like response length is a number. It's maybe not as fuzzy as the content of the message. You can you can just. You know, measure it. But do you want it? Because this is a problem like GPT for out of the box for the longest time, we just prattle on endlessly.
It still does quite a bit. And, you know, it's very like over giving of advice. And, you know, that's that's the kind of stuff you can really start to measure and control.
Bill: Yeah. And then I get I think a good point is, is, is. So if you probe the bot in all these different ways, from individual responses to even sort of conversational aspects, you know, the flow of the conversation, how it responds to error recovery when you tell the bot it's wrong, is it relevant?
Does it have good contextual understanding? So those have to do with sort of larger conversational things. But here's the thing is, Once you've measured all this stuff, we can quantify these in some way, and we're sort of migrating back to the old days when we could produce quantified analysis of how well our model is performing.
And this is quantified, a way to quantify how well the bot is performing. And you can, you may think, well, you don't have room to wiggle because if I'm hitting open AI, you know, what can I do? Well, if it's something simple, like the response links are way too, too talky, uh, like it's coming back with responses that are 500 characters long, typically you can actually go and re engineer and re prompt your bot to, to say, Hey, deeper responses to around 300 characters.
That's very, very simple. So you can do things like prompt engineering to help chorus it. Back into those guardrails you were talking about
Deep: earlier, not to mention, you know, a lot of times, you know, you just like people are using very elaborate prompting. Now it's it's quite dynamic. You know, it's getting prompted based on what got pulled out of a few databases and like combined and you have software engineering teams working on stuff.
Any anyone could make an unintended change that ripples through a prompt that winds up causing something to go awry if you don't have a rigorous test around it. You might never know right until something bad happens out, you know,
Bill: I think that's very, very well put as people are moving away from just the single guy, you know, or gal who can create their own bot right to, to working with teams.
We all know that working with teams is, you know, it's great, but It only takes one person to mess up. Yeah,
Deep: I think that sort of complexity increases, right? Like, you know, both of us, we've worked on cases where we've got, you know, hundreds of thousands of permutations in prompting, right? It's not just the case where.
You know, you're, you're going and coming up with one prompt and having a dialogue about it. Yeah.
Bill: So this is a, this is what we're showing here is sort of a limited view of the types of things that we can, that we can sort of assess another thing that we show here. You'll see that we have a level of importance for each one of these things.
You can specify, well, you know, if you're checking on things like racism, ageism, having to do with religion, et cetera, you find things that are, that are offensive. Let's make that a critical importance. So when we talk about sort of reporting, right, we At the end of the day, we want to generate a report.
We can bubble up things based on their level of
Deep: importance. Why don't we talk a little bit about the actual bot that does the testing then?
Bill: yeah. So one of the things is, uh, when we define our bots, we talk about a description, like generally what it's, what its purpose is and so forth. Um, but we actually go and do something behind the scenes where we generate, given this definition.
A bunch of relevant topics and of those topics, we auto generate a bunch of relevant questions on those topics and we don't necessarily generate questions that are all positive. They could be negative in nature, right as well to kind of test these different areas. And then we also, we also have a sort of a designation where we can say randomly select one of those questions from one of those topics and we can say how, how many follow up questions we want to have.
Related to this, and so you can you can imagine with that scenario, the ability to sort of generate and bootstrap these questions that will probe the bot in different ways, and not only that, but have follow up conversations. Well, now we have a base set of questions that we can have a conversations around, and we use that base set as sort of our starting point when we don't have anything else, and we we let the bot sort of interact with With the with your body that you define.
We call it TV mode. We basically develop a user, that that is interacting with your bot and is throwing questions that that have been generated across a wide variety of topics. And seeing how that bot responds.
Deep: Sorry, Bill, do you have any other UX stuff you want to show, or? No, that's it.
Yeah, why don't you go ahead and kill screen then.
yeah, so I think, what we're saying here is, a, You know, how you build your bot is one thing, and we're showing, you know, the way that we do it, but of course people are going to use many different ways, their own techniques, uh, you know, other APIs, whatever.
Once you have a bot, then you, you know, you want to kind of design the rubric or the, the set of things that define what's important. to assess in, in that, and then once you've, once you've done that, now you need to actually, define a bot that can talk to your bot, and that's like what you're calling the user bot, right?
Yes. And so, and then there's lots of techniques we've got our own that make it sort of fairly automated or fairly straightforward to do. But then once you have a user bot, you have to also teach it to know what to talk about so that it can simulate a good user and cover an array of, of, of topics. And then once, you've got that, now you've got a good user simulator.
Now you have to actually pit it against the bot and have it talk to it. And as it talks to it, you now, you get results that you can assess. against your rubric, and now you, you, you get sort of fairly far that way, and you're still only partway there at that point, and we'll talk about this in a future episode where we dig in because we have a bunch of work here too, but now you, now you have cases where The user bot had a dialogue with your bot and you need to go back in there and grab the cases where it may be screwed up or wasn't such an impressive answer and give and fill in like a perfect answer or your ideal answer there and now you have something that you can compare future generations of your bot and the user bot to, um, to these perfect Answers and now you can get scores across those different areas across like, you know, ethical areas or racial bias areas or your specific application areas.
You know, like, how well you're selling the trailblazers or whatever
Bill: when I think you hit on something very important. That is there is no 1 user. Right. There are groups of users that should absolutely be represented. And I think some of these, some of these folks got into trouble by assuming they had one type of user.
at the end of the day, you need to cover your bases. And so, you know, purposely creating, what I call, say, adversarial user bots is a very good thing. It's like the guys, like you were saying before, that are, they're hired to go break things. And I think, without that adversarial component, there, the, the room for improvement will come when the company's embarrassed rather than.
prior to that, so that could cost them money, even that could cost them money
Deep: I think we've covered a lot of stuff. We'll probably wrap up now, but I'll kind of leave it with one final question. Like, let's rewind and imagine we were the, Chevy trailblazer folks or the Canadian airlines folks.
Um, given everything we've said, What would we have done different when somebody came to us and said, Hey, we're going to launch the spot. Like, yeah, you can definitely an array of stuff. We wouldn't have said no, cause we want innovation. We want to move the needle, but what would we have said?
Bill: Well, I think that where we could come in there is to do this sort of, I don't know if it's on their forefront of their minds to do this type of testing.
Now it is, of course it is. Right. But we could come in with, these sort of adversarial bots to be able to test and make sure that our bot is not responding in the areas that we don't want it to respond to. It's not offering, advice for health. It's not giving out offers that we can't commit to, et cetera.
in addition to all the other you know, communication traits and so forth, I think that if we could roll back the clock, we could integrate one of our systems as a tester prior to release. that would at least give them a heads up. About some problem areas. And then also, I think we could offer advice on how to fix those problems as well, which is another another level.
Deep: Yeah, I mean, I, I would sort of, you know, just kind of cover the 4, 5 big chunks of of area. I would say, like, if they came to me, the 1st thing I would have done was run a risk identification. And I would have said, okay, here's your risk points. You, you've got wide open exposed GPT for. Very problematic.
you don't want that. You want to, like, so then I would have said, well, what are the, what are the guardrails, that you need, what could happen? Like, what is it? What are your worst case scenarios? I think you're getting this is like, kind of part of the, like, what is the thing that really freaks you out?
I'm sure somebody would have thought of giving away cars, but they might have thought of other things, right? So that's the first thing is just like enumerating those risks. Right. And then setting up the, band statements was like something that we covered. So what is it that your bot should never talk about?
I think, you know, anything to do with giving a price for a particular vehicle that didn't come straight from a database is probably something you don't want the bot to talk about. So that would have been an example, um, defining the, uh, efficacy testing upfront, like, Hey, let's. Let's think of an array of stuff that could be there.
Let's get it in, um, defining that ground truth, like how is it that we're going to define it? Who are we going to, how are we going to evolve that ground truth? Do we start with, you know, the world or do we start in a trusted, more sensitive scenario? Like I probably would have said who answers these questions today?
Okay. Let's feed them with an assist capability so that there's still a human in the loop. That probably would have come back as advice. And then there's the smart, the knowledge base piece where I think if they say now we definitely want to go straight to user, then say, okay, well, let's put the thing in a bigger box.
Let's take. Anything it says has to have directly come from existing documents or existing, knowledge base, like, you know, their inventory of their cars and everything, and it's not allowed to ever reason outside of that box. That would be the safest thing to do or some stricter controls around what it can reason outside of that, which would be a little more, you know, that, that,
Bill: that, that last, that's a great last point.
I think when you have a knowledge base, people might feel they're safe, right? Hey, I loaded up my PDF where we have the rules and regulations corresponding to how we do business. The thing is, is that there is a fuzzy line that's actually drawn between having LLM sort of creative conversation and whether to use the knowledge from that, from that base.
Some, some things are very, sort of straightforward. You're going to pull that directly from the knowledge base. That makes sense. But you can certainly imagine where it's not quite clear whether I should respond to this in a creative way. Or I should hit the knowledge based that's another area of testing.
And that's a fuzzy area of testing because it's, it's one where, you know, ultimately you do not want to unhappy users and it's possible that in that fuzzy area, you could get unhappy users. So even though I'm not saying that's an easy thing necessarily to fix. It's certainly an area that you should be testing for
Deep: well, thanks a ton.
I feel like, I feel like if these guys called us up at a minimum, they would have, we could have said I told you. So just kidding. Yeah, maybe. Well, you know, minimum we would have, they would have gone into it with eyes wide open and had a plan. but most likely, you know, we would have found an innovation lane where they could really, prove their muscle and take steps and keep innovating, but, you know, not wind up on the front page of, like, the tech press for two weeks.
I would say, I would actually Or owe somebody a Chevy Trailblazer for a buck.
Bill: I would even say something more positively is kudos to them for embracing this technology and trying to leverage it. It is early days. I, I frankly don't think they should be embarrassed at all. They're probably going to get a bigger bang for their buck in positive ways.
Um, and maybe any press is good. Press, you know,
Deep: I think that's a good point. Like, don't you think that on some level? We went from like, hey, like nobody blames Google when the 10 search results come back and they're all just bad stuff Maybe like you and I might blame them for having it not aligned with the query But we're not gonna we don't hold them responsible for like if somebody I don't know queried like bad lemon vehicles and they showed up with Ten cars or bad lemons, but maybe one wasn't they get blink less because they're showing 10 things, you know, and you could scroll but with AI, you're going to one answer and like when you go to one response, like the, the, the bar gets raised.
And I think people are just forgetting that this. Yes, this is amazing technology. We went from like, looking for exceptions. where it works well to looking for far off exceptions where it fails. At the end of the day, this is like new stuff. Like these guys, I agree, these guys should be patted on the back for like pushing the envelope and like making stuff happen.
At the same time, you know, I think it's also part of responsible AI is thinking these things through.
Bill: Yeah, totally true. And again, I think it's important to reiterate that, you know, the big boys who are playing, in this field, Google open AI, Microsoft, et cetera. You know, they, they always going to be putting out some really, really cool, amazing models in the future.
That there's no doubt about that. There is a cost problem with that and scalability problem that might, other people in smaller businesses might want to use something that's a bit more cost effective, smaller models and so forth. So these smaller models have all of these problems we're talking about, that's not going to go away anytime soon.
It's always going to be
Deep: relevant. All right. Well, thanks so much, everybody. That's a wrap for this episode of your AI Injection. If you. want to find out more, go to xionix. com, X Y O N I X dot com. and you can always go, we have a, you can go to Google and just Google Xionix testing and compliance and we've got all kinds of info for you there to help you out.