AI Agents Demystified: From Core Concepts to Web-Navigation Use Case

Heka.ai
13 min readJun 4, 2024

--

Introduction

Large Language Models (LLM) development has changed the AI landscape in a pace rarely seen before. ChatGPT got released at the end of 2022 opening the eyes of the public on AI advances and now, less than 2 years after, language models have improved across various dimensions, unlocking a multitude of new applications. Retrieval Augmented Generation (RAG) is starting to be a common practice in companies and multiples initiatives are launched to improve existing processes with recent advances in AI.

Meanwhile, among the multitude of github repos, papers on ArXiv and other pioneering initiatives with AI, one notion starts to get increasing attention: AI agents.

This article explores the definition of AI agents, from philosophy to technology. A practical use case is then presented: Auto-Crawler, an AI agent capable of accelerating the development of web scrapers, concluding then by considering the ethical implications of the rise of AI agents.

Figure 1 — “Figure 01” A humanoid robot powered by OpenAI

AI Agents, what are they?

AI agent definition

The notion of agent first appeared in philosophy. According to the Stanford Encyclopaedia of Philosophy: “an agent is a being with the capacity to act, and ‘agency’ denotes the exercise or manifestation of this capacity”. An AI agent would then be a computational entity with the capacity to decide and act on its environment given a sensory input.

LLMs bring novelty to what an AI agent can be. They have shown a remarkable aptitude for reasoning and making decisions and are very good candidate to serve as the brain of AI agents. They can take decisions based on their inputs without the help of humans. Through their attention mechanism, they can combine different information and make logical deductions from it.

LLM agents are then computational entities composed of an LLM brain, a perception layer and several actuators.

Figure 2 — Illustration of the different bricks making and AI agent [1]

Why multimodal LLMs are changing what an AI agent can be?

Before LLMs, AI capacities were limited, showed no real sign of autonomy and could hardly take rational explainable decisions. AI agents were heavily constrained with a set of rules in a very well-defined environment. The input was not very diversified and managed by “traditional” algorithms (Machine Learning, genetic algorithms and sometimes deep learning).

LLMs on the other hand are highly adaptable to any kind of input (even images now) and have the capacity to reason and plan. The best LLMs can give you a precise plan of actions given a set of rules in input. Those actions can then be executed as LLMs are able to interact with APIs and write code. Agents can then see most inputs type, think, decide, act and see the result of their action to correct their behaviour.

LLMs are taking AI agents to the next level, Andrew Ng (professor at Stanford and co-founder of Coursera) said in a recent tweet: “I think AI agentic workflows will drive AI progress this year — perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it.

Notable initiatives and papers

Among the many initiatives around AI, some of them show how far LLMs can go today. Some speak of “sparks of AGI” (Artificial General Intelligence) with LLM agents gaining full autonomy in their reasoning and actions.

AutoGen

AutoGen is an open-source library published by Microsoft. It is a framework in which you can create agents that can interact with each other and with their environment.

For example, a software developer agent can be created as well as a user proxy agent. The software developer agent will write code based on your prompt, then the user proxy agent will try and run that code and send the eventual errors back to the software developer agent who will send back a modified code which will be tested again by the user proxy until the code finally works.

Much more complex interactions are also possible with many more types of agents. A “Group chat manager” can be implemented deciding who will receive the information and who will speak next. This concept is referred to as “Agent Society”[1].

Figure 3 — Illustration of AutoGen Framework

AutoGPT

AutoGPT is another form of framework with a notable capacity to plan its own tasks. It came to light by showing that an LLM could take a complex task in input and then split this complex task in smaller tasks, building what is called a tree of thoughts[2].

All the reasoning is done by the agent itself and it comes up with its own ideas, like making a websearch about a certain topic, writing an action plan,… etc. AutoGPT shows the capacity of LLMs to plan and act[3].

ChatDev

ChatDev defines itself as a virtual software company. Much like a real software company, it has programmers, reviewers, a CTO and even a CEO. All of them are LLM-based agents interacting with each other to develop the game requested by the user. The interactions follow the waterfall model to properly organize the tasks and the flow of information[4]. Chatdev then releases simple but working games.

Figure 4 — Create customized software with Natural Language Ideas

Other recent projects

“Figure 01”, a humanoid robot whose brain is powered by OpenAI. The sheer versatility and capacity for real world interaction make it an obvious example of how far AI agents can go today in the physical world[5].

Devin, by Cognition, is an AI agent meant to be a software developer[6]. Devin reaches 13.86% on the SWE-BENCH (a github issues dataset) where GPT4 alone only reaches 1.74%. Devin shows how agents can improve the performance of LLMs applications. It also shows that they still have limitations and can’t address all the problems yet.

What it means for companies

The ability of large language models to handle complex and varied tasks has opened up opportunities for automation of activities that previously required humans. AI agents will reveal other types of possibilities with the automation of decisions and actions.

Companies should be on the lookout for those new possibilities and threats (for example, GPT-4 agent can automatically hack some websites through SQL injections [7]).

Ultimately, a business could invest resources to create an AI agents tailored for particular scenarios, which could then be made available for use by other organizations. Some companies might then start to sell AaaS (Agent as a Service)[1].

Limitations of LLM powered agents

Although LLM powered agent is a promising technology, it is obviously limited by the fact that it is computationally expensive. An LLM-agent with high refresh rate and millisecond reaction is not possible with today’s technology.

Moreover, since LLM agents necessitate numerous interactions with an LLM, this can hinder scalability. As agents become more capable and generate sophisticated responses, their computational demands will increase. Consequently, their scalability will be less extensive than simpler chat applications.

Our very own autonomous agent: Auto-Crawler

Auto-Crawler is an LLM-based agent that automates portions of web scraping development, significantly accelerating the process. It is a good example of what types of interactions can an LLM have today with a website.

Auto-Crawler demonstration

The business problem

Web scraping consists in getting public unstructured data into a structured format. Sia Partners sets up web scrapers for its own need or for the needs of its clients.

To develop such a scraper, 3 steps are usually necessary:

  1. Mapping — Map the existing buttons and select which one to click on,
  2. Crawling — Writing the script that will click on the buttons,
  3. Parsing — Getting the data from the HTML document.

The first two steps require a developer to manually establish an exhaustive list of all the buttons available and are time consuming activities. The parsing is then handled by another brick.

Auto-Crawler is an autonomous agent which aims to automate the mapping and crawling part, meaning that the developer would ideally only need to supervise the agent, and make sure it correctly fills the form to get the right data.

Today, Auto-Crawler is still in a development phase and works in collaboration with a human developer to make sure that 100% of the data is scraped. It could already save a lot of time by developing the simplest parts of the crawling scripts. The exact amount of time saved is yet to be determined but it will easily reach 50% based on our first experiments.

How it works

Auto-Crawler is a set of 3 sub-agents placed one after the other. The mapping process is cut in two parts (mapping agent and grounding agent) and the crawling phase is made by the developer agent.

In this first version, every sub-agent interacts with a human to improve reliability and be able to use it in actual missions.

Figure 5 — The different bricks of auto-crawler

Mapping agent

The mapping phase uses a multimodal LLM (GPT-4 Vision) to map all the buttons present on the form. It uses playwright to take screenshots of the website, and then feeds the screenshots to GPT-4 Vision. Using images instead of raw HTML code in this phase is interesting because websites are meant to be visually displayed and thus the information is rightly organised[8].

The LLM is then instructed to list all the buttons inside the screenshots using a few-shots approach with pictures of other websites. The few shots approach made an important difference on the performance. The different elements are then concatenated inside an Excel spreadsheet.

This Excel can be reviewed by a human to complete it with the missing buttons and other elements not found by the agent. On benchmarked websites, 77% of the buttons were correctly detected.

Grounding agent

The role of the grounding agent is to find the HTML code associated with the buttons detected by the mapping agent.

This agent first selects a subset of relevant HTML nodes, reranks them to get the most likely elements first[9]. Then, batches of five elements are sent to GPT-4 who’s prompted to extract the right elements if it present. The use of a reranker reduces the number of API calls to GPT-4 by a factor of 10.

At the end of this part the mapping excel is pre-filled and needs to be checked and completed by the developer. 86% of the times, the element is correctly identified.

Developer agent

The completed mapping excel is then passed into a traditional LLM which will transform each excel line into a few lines of code to crawl the website. Here, the LLM is given a pre-prompt with a dozen examples of lines of code.

The agent generates a code called actions.py that is inserted into a boilerplate code which completes the code of the scraper.

Like for the previous agent, the code is checked by a developer to ensure it has the desired behaviour. The success rate on this part is 90%.

Benchmark results

The solution was benchmarked on 4 different websites that were not seen in the pre-prompt examples. Below are the percentages of elements rightly detected by the agents. Each agent was given a perfect set of data in input, meaning the errors did not propagate from one sub-agent to another.

If those separate sub-agents were to be combined in a single agent, the percentage of success would reach approximately 60% (multiplying those 3 percentages). However, in practice, the elements that are not detected by the first agent are also the one failing in the other agents. The performance of the complete agent is then somewhere between 60% and 77%.

By dealing with a significant part of the work, auto-crawler could have a significant impact on the development time of web scrapers. For the developer, this change in workflow is already significant. It takes away the repetitive tasks and allow the developer to focus on more complex tasks where he has more added value.

Next steps

Auto-Crawler will be put to the test on the field to measure how much time it can save when developing a web scraper.

This agent is in its first version, many improvements can be done. The main step toward full autonomy is interacting in a dynamic way with the website.

In its second iteration, Auto-Crawler will interact with the website to reveal all the hidden pop-up menus and dropdown menus and it will try and test the scraper code to see if the desired page is loaded. If it fails, it will use the failed results to re-iterate and enhance the results.

This development phase will aim to take the previous percentages as close to 100% as possible to achieve full autonomy of the agent.

Key takeaways from Auto-Crawler

Auto-Crawler is a good example of what an LLM agent can do with relatively small development efforts.

It also shows that LLMs can’t really tackle complex solutions in one shot. Thus, the collaboration between a human and an AI agent is what works best for now. The AI agent deals with all the repetitive simple tasks while the human will focus on the more complex tasks.

Many improvements can still be achieved, the biggest one being interacting with the website and the scraper code to boost the performance of the agent.

Although Auto-Crawler has a very small impact in its action, the capabilities of the AI agents will grow and make them take increasingly important actions.

Ethical considerations around AI agents

When working with Machine Learning algorithms, explainability quickly comes as a fundamental aspect of the project. Business owners tend to prefer a simple algorithm that has high explainability than a complex algorithm with low explainability.

AI agents will have to endure the same constraints, which is a good thing.

Giving AI the capacity to act is a step into the unknown

Traditional LLM applications involve searching, summarization, chatbots, content creation, content extraction, … Those applications have their own risks as they can induce human decision into error, but the human is ultimately responsible for its actions, even if fooled by an AI system.

AI agents on the other hand, can theoretically take decisions and act without the help of a human which can be irreversible. Wrongly biased LLMs could then create unfair decisions and treatments.

As LLM agent actions grow, so will the impact of their biases. Hence the increased need for mechanisms to monitor decisions and biases.

AI agents will be trusted or won’t be

Trust is a fundamental aspect for AI agents. To earn it, AI will have to build a set of values on which humans will be able to count.

Honesty and transparency

Explainable AI systems is a must have, especially if those systems take important decisions that have real life consequences. To trust AI, it needs to be honest and transparent, and these values must be integrated at every level.

Some simple prompt engineering with LLMs can ensure that the AI explains its reasoning and takes decisions based on it. Building RAG applications which cite their sources (such as SiaGPT) is also a good example of honesty and transparency[10].

Adversarial robustness

When faced with noise in its input data, LLMs tend to output a different result than the one expected which is a problem for agents as they will need to be reliable.

There are also some forms of attacks like dataset poisoning, backdoor attacks and prompt-specific attacks that can force the LLM to output wrong answers and potentially take dangerous decisions and actions.

It is then of upmost importance for agents to be resilient to any form of prompt hacking or any other technique that might alter their judgement.

Risks and threats of AI agents

Many new risks and threats are being created by AI. One notable concern is unemployment. Adapting education so that individuals acquire valuable skills and knowledge will be crucial in an AI powered world. Appropriate policies should also be set up in place to ensure safety nets during this transition[11].

The Pandora’s box of AI is rapidly being opened with the ability to do both good and evil and multiple threats will emerge from this technology. The Stanford University released a paper in 2022 about the opportunities and risks of Foundation models (including GPT-3) and describing many possible threats[12].

Finally, AI technologies are not dangerous per se, the use we make of them is what is dangerous. The responsibility is then in everyone’s hand to make sure that AI has a positive impact on society. To retake the famous words of François Rabelais: “Science without conscience is but the ruin of the soul”.

Conclusion

Agents are a new way of approaching AI applications. By interacting with their environment, they can achieve more than traditional LLM applications. Such technologies exist today and are an interesting angle for companies to find new AI use cases.

Auto-Crawler is a good example of LLM based agent. By using multimodal LLMs, it can tackle complex tasks that required human intervention before. Today it makes development more efficient, tomorrow it could become fully autonomous.

Ultimately, the advancement of AI agents demonstrates the potential of artificial intelligence and raises important ethical considerations. Initiatives must be taken so that AI systems have the right set of values to benefit society.

--

--

Heka.ai

We design, deploy and manage AI-powered applications