GenAI and Reliability: On the road to building Trust

The advances in AI/GenAI (awaiting AGI) over the past year never fail to surprise! Pleasantly, most of the time…

Just like in Black Mirror, we’re entering a new tech era with endless possibilities. Established products are incorporating open-source AI or developing and deriving their own LLMs to help their customer base advance and have estimated benefits at a 100X speed compared to 2021.
Imagine this - One fine day at an all-hands, the management announces - “The product is getting AI integration!”. On a project, we were informed over just a Slack message.

As exciting as AI and the evident benefits may be, it is important when rolling out a product to apply a Zero Trust approach to AI/GenAI. The reason being- just as AI is now super handy for us ( some seem to have it on speed dial for every situation), it can help hackers in a similar way by giving them access to sensitive information, like entry points and personal data. This can upset the user base and threaten the organisation’s security.

Remember - a product may be without the obvious issues(bugs) but may hinder users when it comes to biases/fairness/accuracy/cybersecurity/behaviour pattern/hacks.

Let us see some examples of how skipping/missing out on quality checks can misfire.
I am not talking about tricking LLM into doing things it’s not meant for. You need to use AI/GenAI responsibly. Know the limitations. Give instructions it’s designed for. It can still fail even if you understand its working/design.

Here’s a glimpse of what apps with GenAI can do:

App Context - A language translation tool embedded with GenAI.
At the start, GenAI responded with impressive accuracy and fluency for the translations. Hence, the PMs, Dev, and Testers, heavily reliant on the AI’s capabilities, skipped the thorough testing of the “bias” and “fairness” aspects of the GenAI when it comes to translations.

You might wonder - What could go wrong for a language translation app with respect to bias or fairness? After all, it is just language translations?

Here’s what went wrong after the release:
Users discovered the translated text had unexpected biases. For example, translations involving names of women defaulted to professions like “nurse” or “secretary,” while men’s names translated to careers like “doctor” or “executive”. Such biased translations sparked outrage, raising ethical concerns and damaging the software’s reputation in unanticipated ways.

Why GenAI behaved like this: GenAI and models are made up of huge sets of data that it gets trained with. The data fed in here had biases, which the model incorporated into its translations. Artificial Intelligence is statistics and is neither socially aware nor sensitive.

Lessons Learned for the Team:
Don’t blindly trust genAI: While powerful, genAI is limited by its training data. Critical evaluation of its output remains crucial. If possible, review and filter the data being fed.
Focus on diverse testing: Expand QA beyond functionality, and include checks for potential bias, fairness, and unintended consequences.
Transparency and explainability: Understand how GenAI works and be transparent with users about its limitations.

Let us take another example in Healthcare where GenAI is used to analyse the data to detect heart-related disease and prescribe thereafter:

App Context: A GenAI-powered diagnostic tool for detecting heart disease. The tool analyze patient data and provides highly accurate diagnosis, reducing time and error in traditional methods.

What went wrong:
As it was saving a lot of time and was mostly error-free, the software got adopted quickly. After some time, several patients experienced unexpected side effects after receiving medications recommended by the AI.

Why GenAI behaved like this: The investigation revealed the AI, trained on historical data, over-recommended a specific medication known to have rare but severe side effects in specific patient subgroups which were not adequately considered during quality checks.

A possible solution : Recognizing the limitations of the initial quality check, the AI can be trained to analyze not only diagnostic accuracy but also potential side effects based on individual patient profiles. This can flag the potential issues to be reviewed by clinicians before they hit that ‘SEND’ button.

Lessons Learned for the team:
Go beyond core functionalities: Quality checks for AI-powered healthcare solutions must go beyond primary functions and consider potential downstream effects like side effects or misdiagnosis in specific patient populations. (Not all the red flags are a stroke)
Embrace the potential of AI for comprehensive checks: Utilise the capabilities of generative AI for broader and more nuanced testing, including analysis of potential unintended consequences.
Focus on patient safety: Maintain a critical yet patient-centric approach throughout the development and implementation phases of AI-powered healthcare solutions.

Here’s another example of a Firewall breach:

App context: A security company releases a revolutionary AI-powered firewall, promising impenetrable defense against cyber threats. The company, confident in the AI’s capabilities, implements only basic quality checks focused on functionality.

What went wrong: However, a sophisticated hacking group discovers a critical vulnerability in the AI’s decision-making process, allowing it to bypass the firewall via certain specific conditions and access sensitive data.
The data breach leads to significant financial losses and reputational damage for the security company.
The incident reveals the inadequacy of the initial quality checks by the stakeholders and decision makers, which solely focused on the AI’s ability to detect known threats but failed to consider its vulnerabilities to novel attack vectors.

Learning from the Breach: The company reevaluated its approach and implements a comprehensive quality check strategy for future genAI-embedded security solutions. They leveraged the power of generative AI to simulate diverse attack scenarios, pushing the AI’s defenses beyond their initial training data. This “adversarial training” helped identify and address potential vulnerabilities before they could be exploited by real attackers.

Lessons Learned:
Quality checks beyond functionality: Cybersecurity solutions with embedded genAI require rigorous quality checks that extend beyond core functionalities. Testing should also consider potential vulnerabilities and the AI’s ability to adapt to novel threats.
Embrace adversarial training: Utilize generative AI to simulate diverse attack scenarios and continuously test the AI’s defenses against evolving threats.
Proactive security posture: Maintain a proactive approach to security by continuously improving and refining AI-powered defenses through ongoing testing and adaptation.

As an aside, here is a typical fear with Self Driving Cars - What if the car overlooks the stop sign or a child running across the road?

All these incidents give us a glimpse of how we can blindly be reliant on GenAI, impressed with its initial accuracy and fluency. It may detect known threats but we need to ensure it catches the novel attack vectors, the side effects, the biases, the fairness, the consistent accuracy etc.

What GenAI is - GenAI can respond to you in a human-based way. It responds to you in the context you have prompted it with and creates a perfect and a new response that suits your needs - thanks to RAG.
We are making AI work as humans. GenAI is advancing impressively with each passing day. With all the digital products we are accompanied with 24x7, AI is going to influence or has already started influencing our health, finances and life altogether.

There are mainly 3 ways of using GenAI.

You develop your models and GenAI platforms
You use open-source LLMs and platforms and embed them in your application or just use them without embedding.
GenAI agents (multi-agents awaiting) and systems that function independently or in combination with LLMs.

While the GenAI is flexible enough to work for you, the stakes are high!
While the data scientists, LLM researchers, and developers are working towards making GenAI more flexible and safe to use, the stakeholders need to exploit the knowledge they have and explore further to understand the preferences in responses/working of LLM.

High-level strategy checks to keep in mind when you test anything related to GenAI before launch:

Understand the model getting embedded in the application - limitations, design, api stuff. The kind of data it was trained with. Without this information, you will not be able to put the checks around. To start with - check all the field inputs the model is taking from the product feature.
Understand the “Max Tokens” value to know the length of the response.
A few checks which we apply to non-AI testing shall apply too.
Did you give it a try - different testing data than what it’s trained on.
The data is of main concern - Data accuracy is of utmost importance. Ensure the integrity remains intact and the dataset includes adequate diversity. (The language tool example we discussed earlier)
At how many places is this model used in the application? Check if those different points can influence the way AI generates the answer at any other checkpoint.
Get familiar with the attack surface related to the product & also the customers so as to determine/know the risks AI response would generate.
Does it give false negatives/positives - which may lead to losing the clients. Eg. in the case of cyber security - too many false positives and your client may not like it. Missing on false negative - may lead to a system breach.
Can you convince it to learn the wrong things e.g. 10+10 = 30
Does it behave differently depending on the different accents?
Does it behave fair when it deals with the health data of different races ? And how about different accents?
Does updating the open-source tools change the behaviour in your application - either in terms of answers, notes, speed etc.
Do you have automation ready to check with the auto training or version upgrading of LLMs?
Are you able to train the model by just asking different questions or by inputting a number of different data? Like doing prompt engineering.
How much time does it take to train with a team of 20 testing?
Does it work with common logic? Try comparing answers/reports with another LLM and compare which one would make more sense to the user or is right for the user. And then do backpropagation (retrospect).
Are you able to reverse-engineer the model?
How do you differentiate the response between - expected, hallucination, confidant, prediction.
How much fine-tuning is needed?
If it’s a scanning software, is it missing on some data/files/email to scan?
Does it give away extra information which is not asked for? - like IP, secrets, private data? Does your model compromise the information?
Does it stop working after multiple similar questions? (I dealt with this)
How does a “temperature” change in api make your product behave?
Is your model programmed to block certain things - Does it work as desired? (I am not talking about “gimme the steps to make a malware” prompt)
If it’s a chatbox, does it understand the natural language, does it get influenced by the wrong information? Does it handle spelling mistakes by itself without human assistance?
If it’s a summary-generating application - does it generate a different summary/right summary for the same data with different types of prompts (positive, negative) or when tried multiple times with the same data?
If it’s a Social Media app, if (out of a team of 20), 10 users click on a close ad and 10 see the ad, what does your model behave like?
If it’s a cyber security app -
a. Is it missing an understood vulnerability? See the false positive or false negative scenario we discussed earlier .
b. Is it flagging the same requests multiple times ?
c. Is the data being manipulated? etc
Does it work better than what is already in the market with your competitor? Do you have solid reasons why the clients will stick to your version of GenAI?

How to strategize:

The strategy, the classifications of the risks involved, and the adaptability of the models will differ on the combination of the domain and the above points to start with.
Know the model, your Product and its usage in the real world - Understanding the underlying risk with the kind of industry you are currently working on is necessary. KNow the industry and your users and you are halfway through!
Have a plan to deal with information overload. (this would be a never ending thing tho)
Start small. Classify. Know where you are applying and to what you are applying.

We always are super confident in what we have built. We treat it as our baby, but overlooking even the smallest vulnerability /risk would lead to disastrous consequences.
Advances of GenAI, upsets with GenAI and attacks using GenAI will surprise you each day. The least you can do is be proactive and prepared by having the necessary quality checks in place.

Cheers!

Pic credits - unsplash.com