/
Caramel

Exploring dynamic insights from data using AI agents

With the irruption of prompting interfaces enabled by Large Language Models (LLMs), we wondered how possible it is to leverage AI agents to enable more dynamic decision-making through plain language interaction with data.

Participants:

Damian Calderon

Ignacio Orlando

The Challenge

Companies produce massive amounts of data that can enhance decision-making processes. With Business Intelligence (BI) tools, they have some flexibility in exploring and analyzing this data through dynamic plots and dashboards. But these tools fall short for those who seek to do deeper, unscripted explorations. Getting new information and insights beyond the initial parameters remains challenging and time-consuming. Like a bulky retro-submarine that users have to prepare, load, and drive each time they want to explore the unknown, it just hinders agile exploration.

The Challenge

Companies produce massive amounts of data that can enhance decision-making processes. With Business Intelligence (BI) tools, they have some flexibility in exploring and analyzing this data through dynamic plots and dashboards. But these tools fall short for those who seek to do deeper, unscripted explorations. Getting new information and insights beyond the initial parameters remains challenging and time-consuming. Like a bulky retro-submarine that users have to prepare, load, and drive each time they want to explore the unknown, it just hinders agile exploration.

— Bring those spreadsheets, sailor.

Typically, getting to a new “insight” requires collaboration with a Data Analyst to understand the new request thoroughly, followed by a team effort to analyze multiple data tables, integrate them, gather relevant statistics, and determine the best way to present this information within the application.  Decision-making tools are therefore quite rigid platforms, in which user interaction is reduced to consuming what’s available on a screen. This lack of flexibility limits their usability to very specific scenarios, limiting them extensively.

How might we make user interaction with business intelligence tools more dynamic?

The hypothesis

The recent development of AI agents supported by Large Language Models (LLMs) offers a promising new way for interacting data. Based on a predefined set of actions, AI agents take a request from the user as written in plain English, and, using language as a proxy for reasoning, devise a step-by-step plan to craft an answer using those actions. With experiments in benchmark toy tests showing quite surprising results, we cannot help but wonder: can we craft new business intelligence applications in which users can retrieve new insights from a prompt box right next to the dashboard? We assume that by allowing users to articulate direct questions to data in natural language, these tools (and their associated data) will become more useful and usable for them.

Can we improve flexibility in business intelligence tools with AI-based assistants?

Natural language allows you to move faster and leaner: you’re in much finer control of your movement.

Potential use cases

For every experiment we conduct, we dedicate time to think about how it can be applied beyond the initial scope. Here are the top ones we thought for Caramel:

More informed decision-making

  • Dynamic decision-making by asking questions to retrieve new insights directly from business dashboards.

  • Deeper data exploration, to discover missing data points, errors, and opportunities.

Financial analysis

  • Automated generation of financial reports, e.g. by simply dictating what you want to see.

  • Identify potential risks and financial anomalies not covered by the existing tool.

Strategic planning

  • Enable thinking partners for “what-if” questions about potential future outcomes and trends.

  • Generation and monitoring of new key performance indicators (KPIs)

The experiment setup

The experiment setup

The experiment setup

The input data

The input data

The input data

We performed our experiment using Carmelo, an internal business intelligence tool that allows our leadership teams to monitor the overall status of the company at a single glance. We used financial and strategic data automatically extracted from this tool to experiment with, obfuscated to ensure security and prevent leaks of sensitive information.

The experiment

The experiment

The experiment

Before implementing our AI-powered prototype, we performed interviews with the main stakeholders and users of Carmelo, to survey use cases for a free text interaction tool, and to map a wishlist of “futuristic” features. This helped us to collect common questions they’d like to ask to the data and map them to potential use case scenarios. 

We then used this information to implement a fully functional prototype. While doing so, we used that as an opportunity to evaluate the maturity of the existing frameworks for implementing AI agents, going beyond their benchmarks in toy sets and exploiting them in our own tables. All observations and decisions were documented on a Miro board, including an in-depth mind map with successes and failures, and ideas for further development and improvements.

After the experiment, we had a few live user testing sessions in which we left stakeholders to interact with the implemented tool and surveyed common interaction patterns, likes and dislikes, and potential improvements.

The prototype

To understand the technology and the use case scenarios in a functional setting, one of our Data Scientists implemented a fully functional yet scrappy POC, that takes tabular data as input and enables users to ask questions about it, including retrieving statistics and generating plots in the wild. The whole development process took 10 hs of coding and testing.

A screenshot of the live PoC with company names redacted.

for rapid UI development.

for rapid UI development.

as AI agents framework.

as AI agents framework.

GPT 3.5 as LLM.

GPT 3.5 as LLM.

for data management.

for data management.

for data visualization.

for data visualization.

The outcomes

Users challenged the tool

Our users did not take answers for granted. Instead, they challenged the tool by asking rephrased versions of the same question, in an effort to assess if the tool was being accurate or hallucinating. While this might have been a natural consequence of going through a testing phase, we believe this interaction should be considered at the time of designing a real-life application, e.g. by automatically presenting similar questions to the user and reducing the time invested in typing them.

Read more…

Users challenged the tool

Our users did not take answers for granted. Instead, they challenged the tool by asking rephrased versions of the same question, in an effort to assess if the tool was being accurate or hallucinating. While this might have been a natural consequence of going through a testing phase, we believe this interaction should be considered at the time of designing a real-life application, e.g. by automatically presenting similar questions to the user and reducing the time invested in typing them.

Read more…

Users challenged the tool

Our users did not take answers for granted. Instead, they challenged the tool by asking rephrased versions of the same question, in an effort to assess if the tool was being accurate or hallucinating. While this might have been a natural consequence of going through a testing phase, we believe this interaction should be considered at the time of designing a real-life application, e.g. by automatically presenting similar questions to the user and reducing the time invested in typing them.

Read more…

Agents need to know about business & data

Most of the observed interactions assumed that the agent knew the slang of the business. While some of those were actually solved by the agent thanks to concepts already learned by the LLM during training, some others were misinterpreted, leading to poor responses or completely inaccurate answers. Tools like PandasAI offer the possibility of customizing agents with predefined instructions, e.g. explaining what a specific word means, or what is the ideal way to solve a specific request. To better exploit these features, it is important to identify that slang early in the design process, before triggering the development of the tool.Without knowing the nature of each table and its fields, the AI agent also relies on its own knowledge to answer questions. We observed improvements in accuracy when adding descriptions to each of the fields, so mapping data with very detailed descriptions seems to be important as well.

Read more…

Agents need to know about business & data

Most of the observed interactions assumed that the agent knew the slang of the business. While some of those were actually solved by the agent thanks to concepts already learned by the LLM during training, some others were misinterpreted, leading to poor responses or completely inaccurate answers. Tools like PandasAI offer the possibility of customizing agents with predefined instructions, e.g. explaining what a specific word means, or what is the ideal way to solve a specific request. To better exploit these features, it is important to identify that slang early in the design process, before triggering the development of the tool.Without knowing the nature of each table and its fields, the AI agent also relies on its own knowledge to answer questions. We observed improvements in accuracy when adding descriptions to each of the fields, so mapping data with very detailed descriptions seems to be important as well.

Read more…

Agents need to know about business & data

Most of the observed interactions assumed that the agent knew the slang of the business. While some of those were actually solved by the agent thanks to concepts already learned by the LLM during training, some others were misinterpreted, leading to poor responses or completely inaccurate answers. Tools like PandasAI offer the possibility of customizing agents with predefined instructions, e.g. explaining what a specific word means, or what is the ideal way to solve a specific request. To better exploit these features, it is important to identify that slang early in the design process, before triggering the development of the tool.Without knowing the nature of each table and its fields, the AI agent also relies on its own knowledge to answer questions. We observed improvements in accuracy when adding descriptions to each of the fields, so mapping data with very detailed descriptions seems to be important as well.

Read more…

Asking questions in the wild is challenging

Questions with the exact same target but posed in a different way might trigger a different action plan and result in different outcomes. This is expected to happen in prompting interfaces, yet not a justification for mistakes. We believe future implementations might benefit from incorporating a third-party LLM either to rephrase the question and ask multiple ones in one shot, and/or as a critic, to evaluate the final response and decide if it is correct or not. This might be helpful to provide additional feedback to the user, e.g. through (un)certainty cues.

Read more…

Asking questions in the wild is challenging

Questions with the exact same target but posed in a different way might trigger a different action plan and result in different outcomes. This is expected to happen in prompting interfaces, yet not a justification for mistakes. We believe future implementations might benefit from incorporating a third-party LLM either to rephrase the question and ask multiple ones in one shot, and/or as a critic, to evaluate the final response and decide if it is correct or not. This might be helpful to provide additional feedback to the user, e.g. through (un)certainty cues.

Read more…

Asking questions in the wild is challenging

Questions with the exact same target but posed in a different way might trigger a different action plan and result in different outcomes. This is expected to happen in prompting interfaces, yet not a justification for mistakes. We believe future implementations might benefit from incorporating a third-party LLM either to rephrase the question and ask multiple ones in one shot, and/or as a critic, to evaluate the final response and decide if it is correct or not. This might be helpful to provide additional feedback to the user, e.g. through (un)certainty cues.

Read more…

Mind the cost before it’s too late

AI Agents used LLMs to provide action plans, meaning that, although the lengths of inputs and outputs might be short, they are talking (a lot) in the background. This in general elevates the cost of these kinds of tools, and requires an in-depth analysis of the ROI before implementation. In our case, we used GPT-3.5-turbo in all our experiments, whose cost remains low (U$D 0.5 for every million tokens). Using more powerful models like GPT 4o might have granted much better results and less need for prompting about the specific use case, yet their cost is significantly higher.

Read more…

Mind the cost before it’s too late

AI Agents used LLMs to provide action plans, meaning that, although the lengths of inputs and outputs might be short, they are talking (a lot) in the background. This in general elevates the cost of these kinds of tools, and requires an in-depth analysis of the ROI before implementation. In our case, we used GPT-3.5-turbo in all our experiments, whose cost remains low (U$D 0.5 for every million tokens). Using more powerful models like GPT 4o might have granted much better results and less need for prompting about the specific use case, yet their cost is significantly higher.

Read more…

Mind the cost before it’s too late

AI Agents used LLMs to provide action plans, meaning that, although the lengths of inputs and outputs might be short, they are talking (a lot) in the background. This in general elevates the cost of these kinds of tools, and requires an in-depth analysis of the ROI before implementation. In our case, we used GPT-3.5-turbo in all our experiments, whose cost remains low (U$D 0.5 for every million tokens). Using more powerful models like GPT 4o might have granted much better results and less need for prompting about the specific use case, yet their cost is significantly higher.

Read more…

Users want more than a single answer

From our interviews with our users, we noticed that they appreciated having, apart from the answer, a full explanation of the process used to retrieve the answer, and any possible visual output, such as plots and graphics. To help users obtain this without having to be explicit about it in the prompting interface, we could leverage LLMs that prompt the agents themselves.

Read more…

Users want more than a single answer

From our interviews with our users, we noticed that they appreciated having, apart from the answer, a full explanation of the process used to retrieve the answer, and any possible visual output, such as plots and graphics. To help users obtain this without having to be explicit about it in the prompting interface, we could leverage LLMs that prompt the agents themselves.

Read more…

Users want more than a single answer

From our interviews with our users, we noticed that they appreciated having, apart from the answer, a full explanation of the process used to retrieve the answer, and any possible visual output, such as plots and graphics. To help users obtain this without having to be explicit about it in the prompting interface, we could leverage LLMs that prompt the agents themselves.

Read more…

(Not yet) a plug-and-play technology

We observed that existing tools like CSV Agents provided within Langchain, or PandasAI toolkit, are not yet mature enough to be plug-and-play in a few hours. While PandasAI is likely the best one so far, there are still edge cases in which the tool might either crash or produce a bad response. Dedicating more time to solving those cases might overcome these limitations.

Read more…

(Not yet) a plug-and-play technology

We observed that existing tools like CSV Agents provided within Langchain, or PandasAI toolkit, are not yet mature enough to be plug-and-play in a few hours. While PandasAI is likely the best one so far, there are still edge cases in which the tool might either crash or produce a bad response. Dedicating more time to solving those cases might overcome these limitations.

Read more…

(Not yet) a plug-and-play technology

We observed that existing tools like CSV Agents provided within Langchain, or PandasAI toolkit, are not yet mature enough to be plug-and-play in a few hours. While PandasAI is likely the best one so far, there are still edge cases in which the tool might either crash or produce a bad response. Dedicating more time to solving those cases might overcome these limitations.

Read more…

Our key learnings

Our key learnings

Our key learnings

  • Integrating hard data with an LLM will never be a plug-and-play activity. There are mental models, system concepts, and ambiguities we will need to take care of through refinement.

  • Users won’t trust an AI-based answer as they trust a dashboard, at least for now. This means we need to make an extra effort to increase user’s trust both from the ML Layer and the UI. Expanding the answers with detailed data as support material is key for this.

  • Failing gracefully is much more important than it seems initially. Every time the system provided a wrong answer because of the lack of refinement, user trust was heavily impacted.

  • We shouldn’t expect users to know what to ask. The prompting experience is new and they aren’t aware of what it covers with precision. Some users will be more explorative, but adding actionable prompt examples,and taking care of the onboarding can be a good way of controlling the initial experience.