Beyond Rankings: Measuring Website Visibility in the Age of LLMs

Paradox - Shape Image
Paradox - Shape Image
Paradox - Shape Image
How To Manage Your Brand in the Age of AI

During the good old days of search engine optimization (SEO), calculating a webpage’s visibility in search engines was quite easy. All you had to do was calculate its mean position for the keywords relevant to the webpage. However, this has drastically changed following the advent of large language models and their incorporation into Search Experiences commonly referred to as LLM-powered Search Engines (LSE), such as ChatGPT, Copilot, and Google AI Overview. Using traditional impression metrics to indicate how far a site has reached in these ASEs is no longer possible.

Unlike traditional search engines that deliver ranked lists of links, LLM-powered AI Search Engines transform the traditional search journey by offering information from a range of multiple places in a single view. There’s no longer a need for the user to navigate through several web pages from search to find what they’re looking for.

This process creates a unique challenge: accurately measuring the visibility of a website referenced within AI summaries. References can appear in different formats, and user behavior within these environments is not fully understood.

This article proposes a novel framework for analyzing website visibility within LSEs. We introduce the "AIM Score” for different types of prompts (akin to keywords in SEO), which combines several key metrics weighted based on their importance to reflect a website's accurate exposure within LSE responses. These metrics can be broadly categorized into:

1. Probability of Invoking AI Summaries (PIA)

At one point, around 85% of search queries triggered an AI summary. This trend has undoubtedly reversed, and according to current estimates, around 14% of searches trigger an AI Summary.

It is also worth noting that queries from different domains have different likelihoods of triggering AI overview. For some domains, such as law or medicine, it can be as low as 1%, and for others, it can be as high as 28%.

However, experts claim that Google is going for a slow rollout of this feature amid huge backlash on some of the results generated by AI summary. However, slow rollout or fast - the future belongs to AI summaries, and the likelihood mentioned above is bound to change shortly. We are keeping a close eye on this metric and adjusting it’s weight in our final score accordingly.

2. Brand Sentiment Score (BSS)

This metric measures the sentiment behind the brand mentions in the LSE response, with a floating point score between -1.0 (negative) and + 1.0 (positive) assigned to the sentiment.

3. Word Count Position Adjusted Score (WCPAS)

Some researchers recommend calculating this metric by analyzing the number of words in sentences that mention a site while adjusting for their placement within the LSE response, as outlined here. Higher word counts coupled with early placement indicate higher user exposure.

Nevertheless, analyzing the anonymous usage and click data from our own RAG solution and partner platforms made it obvious that the real user behavior does not match the above-mentioned recommendation. However, due to their confidentiality, they cannot be shared.

4. Visible Bibliographic Reference Rank (VBRR)

This metric measures how prominent a website is by tracking whether it has been explicitly referenced within the responses generated by AI search engines.

5. Expandable Bibliographic Reference Rank (EBRR)

This metric goes beyond VBRR to accommodate instances where a website is referred, but the full citation or links are hidden behind a "click-to-expand" functionality. It is obvious that such links are less valuable than links appearing under VBRR.

6. Nature of Query

The different kinds of user queries, such as informational, transactional, etc., can influence the relevance of a website referenced within the LSE response.

7. Cumulative User Behaviour (CUB)

While the factors mentioned so far provide a solid foundation, they do not explain an individual user's subjective perception of a website mentioned in the summaries. We would, therefore, recommend including a metric that analyzes user perception of links and citations provided in the LLM-powered summaries.

G-Eval is currently considered a state-of-the-art LLM evaluation tool known for its high correlation with human judgment in subjective tasks. The original paper can be accessed here.

The factors that influence user perception of a website in a summary include:

  1. Relevance: How well does it quote the material suited to the user's query?
  2. Influence: To what extent does the LSE response rely on the cited website?
  3. Uniqueness: Does the citation offer fresh information that is not readily found elsewhere?
  4. Subjective Position & Count: These metrics go beyond simple word count to capture how prominently a website is presented within the response (position) and how much content is perceived from the website (count).
  5. Click Probability: This metric estimates the likelihood of a user clicking on the website reference within the LSE response.
  6. Material Diversity: This metric captures the information from the website within the LSE response.

Conclusion

AIM Score is the most advanced and comprehensive rating system that helps digital marketers evaluate how visible their web content is in response to various user prompts (similar to keywords in SEO). This score combines several essential metrics, weighted based on their significance, to provide a holistic view of a site's visibility in AI summaries created by LLM-powered search engines.

By leveraging the metrics offered by AI Monitor, content creators can now anticipate how well AI summaries will receive their content. This tool helps creators optimize their content for modern search engines, providing deep insights into their content's performance and effectiveness in the age of LLM-powered search.

Avinash Tripathi Profile pic

Avinash

I'm a lawyer and foodie who loves tech, and AI 🤖! For the past 10 years, I have been making law & tech play nice with each other 🤝

Frequently Asked Questions

The advent of LLMs has transformed LSE experiences such as ChatGPT, Copilot, and Google AI Overview. These platforms integrate information from various sources in one view, eliminating users’ need to browse through multiple pages on the web. Consequently, conventional impression metrics are less effective and necessitate fresh ways of assessing a website's reach.

AIM Score is a new framework for measuring website visibility on LLM-powered search engines. It combines several vital metrics for each given weighting depending on their importance to provide an accurate picture of how much exposure a website gets in LSE responses. These include the probability of invoking AI summaries, brand sentiment score, word count position adjusted score, visible and expandable bibliographic reference ranks, nature of the query, and cumulative user behavior.

The probability of any user prompt invoking AI summaries is tracked by searching a fixed number of prompts daily on various LLM-powered AI search engines. After that, the results are evaluated to determine the final percentage of search queries that invoke AI summaries. The metric varies across different domains, like law or medicine, where some have much higher probabilities than others.The PIA metric is continually adjusted to reflect changes in user behavior and AI summary trends.

The Brand Sentiment Score (BSS) assesses brand mentions in LSE responses, from -1.0 (negative) to +1.0 (positive). It is essential as it showcases a brand’s voice and perception across AI-generated summaries, influencing a brand’s reputation in users' minds.

WCPAS counts the number of words in sentences that mention a site and adjusts for their placement relative to other statements or phrases within the LSE response. When word counts are higher, and they appear at early positions, they signal more user exposure, suggesting greater visibility. Thus, this process helps us to determine how prominently a website is featured in an AI summary.

VBRR aims to gauge how prominently a website is referenced within AI-generated responses, while EBRR accounts for references hidden behind "click-to-expand" functionality. These metrics help determine how visible or accessible website references would be. VBRR signifies comparatively direct visibility compared to EBRR, which indicates information hidden behind “click-to-expand” buttons.