The Wild Wild West of GEO

About the author

Paul is course leader for the AMEC Strategic Communication Planning and Measurement Diploma. He is a founding partner of CommsClarity consulting helping PR professionals use measurement, analysis and evaluation to make the right data-driven decisions

In the revisionist western Unforgiven, director Clint Eastwood systematically deconstructs the myths of the wild west.  One of the primary lenses we see this through is W.W Beauchamp, a dime-novelist who writes romanticised accounts of outlaws like ‘English Bob’ to sell to the middle classes in the east coast US cities.  Beauchamp is gradually exposed to more of the messy real-life realities of frontier life revealing a widening gap between the narrative in his novels and what is really happening.

We are seeing a new frontier develop as AI technology evolves and is becoming adopted.  In PR and communications, there has been significant interest in how AI chatbots and AI-search are increasingly influencing audiences – using terms like ‘Generative Engine Optimisation’ or ‘Answer Engine Optimisation’.

Like a gold rush in a frontier town, many organisations are jumping in to offer advice and sell data to help marketers and communicators influence these tools.  But like W.W Beauchamp, are they providing a mythologised version of how AI tools are being used that may be different from the underlying reality?

In the recently published book ‘AI For Public Relations’, co-editor Stephen Waddington personally wrote the chapter on Generative Engine Optimisation (GEO), describing it as “one of the most contested areas in the book…advocates argue that earned and owned media have a significant impact on GEO results, offering public relations practitioners a valuable opportunity to enhance brand visibility.  Critics caution that such claims are naïve: the outputs of generative engines are too complex, too opaque and too autonomous to be reliably attributed to earned media”.

There are the very real ethical concerns of how brands could manipulate AI models and the audiences that increasingly trust them to provide information.  Following the infamous Radio 4 debate between the PRCA’s Sarah Waddington and Sir Martin Sorrel, where Sorrel advocated for brands to “flood the internet with content”, the PRCA launched a counter ‘don’t flood the internet’ campaign – arguing that this approach “doesn’t create trust, reputation, or lifetime value – it just generates noise.  PR …is about judgement over volume, credibility over clicks and impact over impressions”.

Is AI Visibility the new AVE?

But increasingly there are also concerns about the way GEO is measured.  In this year’s AMEC Global Summit in Dublin, a working group led by James Crawford outlined key principles of measuring GEO alongside a published Practitioners guide to AI Measurement.  The group argued that an over focus on AI visibility alone could be the new AVE – a new ‘vanity’ metric that is conflated with but potentially disconnected from communication value.

Best practice should be aligned with the broader ‘Barcelona Principles’ where the outputs in AI answers are understood as part of the broader communications flow of information.  Looking upstream at what activity and content is influencing AI answers and further downstream at how this is affecting the relevant stakeholder audience outcomes – and how this aligns with communication objectives, from raising awareness and changing perceptions to motivating behaviour.

Even focussing on AI visibility itself, there are concerns around how this is reliably and accurately measured.  Methodologies are often opaque, leading to a GEO data black box sitting on top of an AI model black box – not a good approach for credible and trustworthy information.  The original Barcelona Principles stated that “transparency and replicability are paramount to sound measurement”…but what if AI systems and the GEO methodologies that measure them are neither transparent nor replicable?

Critics may argue that AI chatbots are ‘transparent’ because you can see an AI generated answer.  But this is misleading if you can’t see how the answer is generated or if the answer changes for different people – or even for the same person asking the same question multiple times.

What is the answer?

Earlier this year, SparkToro published research showing just how inconstant AI answers can be.  Based on 600 real human users asking a defined set of prompts multiple times, the resulting data set showed that the probability of the same set of brands appearing in different responses to the same specific prompt was less than one in 100.  This dropped to less than one in 1,000 for the brands to appear in the same order.

The variability in AI answers seems to be so significant that experiments should be repeated thousands of times with the measurement explaining both averages and variability (for example mean and standard deviation or confidence interval).  This has become common practice in market research (polls of 1,000 people for example have a statistical error of 3%), so similar approaches should be expected in AI visibility.

So now what is the question?

Another key issue is in how representative AI prompts are to what real users are asking.  The challenge is that the AI companies themselves do not systematically release this data.  This leaves GEO methodologies with three approaches:

First, make it up – or more specifically use AI to synthetically generate prompts, which has its obvious problems.

Second, use real search keyword data as a seed to generate AI prompts.  At least this is grounded in real world usage, but people use a Google search differently from an AI conversation.  Similarweb data suggests that ChatGPT prompts average 60 words compared to just three for a Google search and the longer the prompt the more variability in how the model would respond.

The third is to use what is euphemistically termed ‘user panels’.   While this may sound similar to a good old fashioned market research panel where people know that they are expected to fill in surveys, in reality this is likely to involve the controversial supply-chain of ‘clickstream’.  This often involves users downloading browser plug-ins – for example a free VPN or ad-blocker – but in doing so they may unwittingly agree to the small print that allows the software to extract their activity while using a browser.

This was already becoming an ethical concern in tracking google searches or website traffic, but this is dialled up to 11 if users are having long and deeply personal conversations around health and mental well-being, career and financial advice or relationships (which according to Anthropic are the main topics that people search for guidance on).

Beyond these ethical concerns, clickstream data is unlikely to be representative of real user demographics and psychographics, being skewed to tech-orientated males.

When a metric becomes a target, it stops being a good metric.

Because of this lack of real-user information, the temptation is to either knowingly or unknowingly game the system.  GEO methodologies often focus on prompts asking about product advice, comparisons, or recommendations.  But where we do have real user data, conversations about products appear to represent a very small proportion of how these tools are actually being used.

For example, in a data set of 2.6m real user conversations with ChatGPT analysed by academics at Harvard, just 2% of conversations where about purchasable products.  In a more recent analysis of Claude data this year from Anthropic showed a similar result – less than 4% of people asking for guidance and advice related to consumer products (and it was less than 0.2% of the broader data set of all conversations).

BBC Journalist Thomas Germain has showed how easy it is to game the system by writing an article on his personal blog about winning a hotdog eating competition.  He then demonstrated how this fake content fed through into a Gemini response to his question ‘who are the best hotdog eating tech journalists’.  Not only does this illustrate the ‘flood the internet with content’ issue but it also shows the dangers of being over specific with prompts to prove AI visibility.

Goodhart’s law states that when a metric becomes a target it ceases being a useful metric.  If AI visibility becomes a vanity metric, it will be tempting for comms teams to be over specific with the prompts that they use to maximise this number and divorce themselves from understanding the reality of what their real audience is doing.

I will not reveal my sources!

The upstream analysis of understanding the sources that contribute to AI answers can also be problematic.  The typical method is to use the citations mentioned in the answers themselves.  But as with the content of AI answers, there is a wide variation in source citations which can change significantly across different prompts, different models, different users and over time.  In its 2025 AI Visibility Index study, Semrush showed just a 32% overlap in the top 100 sources between ChatGPT and Google’s AI mode.  Meanwhile data from AhRefs AI Search Benchmark Report suggests that almost half of cited sources in Google’s AI overviews change each time in response to the same prompt.

Cited sources can often be incorrect or hallucinated.  Analysis from the Tow Centre for Digital Journalism at Columbia University suggested that 60% of news citations are inaccurate while Nature has published research showing that medical responses in AI chatbots either contradicted or were not supported by between 50% and 90% of the sources cited.  Meanwhile data from researchers at the University of Amsterdam demonstrated that 57% of citations were not faithful to source and are ‘post-rationalised’.

Much of the content in AI answers does not come from cited sources at all.  Often more general queries rely on pretrained data in the underlying Large Language Model where sources are not referenced.  When an AI chatbot does a live websearch, the system will convert the larger question to a series of ‘fan-out’ sub queries that will then follow a traditional search process.  The results of this will then be chunked up and submitted back into the LLM to generate an answer.  Much of the content that goes through this process may not appear as a citation even though it will contribute to generating the answer.

In its research paper ‘The Attribution Crisis in LLM search results’ the Social Science Research Council showed big gaps between the content in answers and sources being cited, even when AI models were explicitly in websearch mode.  25% of OpenAI’s GPT4o and 90% of Google’s Gemini responses did not cite any sources and Perplexity’s Sonar, despite showing more citations, didn’t show links to half of the websites that it sourced.

Mythbusting the narrative.

Alongside upgrading their user interface to more seamlessly integrate traditional and AI search, Google has released a guide on optimising AI search.  This ‘mythbusts’ some of the emerging narratives and re-emphasizes the importance of established SEO best practices and focussing on content that audiences will value:

“As generative AI search evolves, so have the theories and practices – and sometimes, the misconceptions – surrounding it. While terms like Answer Engine Optimization (AEO) or Generative Engine Optimization (GEO) are common online, many suggested ‘hacks’ aren’t effective or supported by how Google Search actually works.”

Or as leading marketing influencer Tom Goodwin commented “like I’ve been saying for 2 years, it’s basically just focus on great content and solid SEO and ignore most AI-specific ‘hacks’ – there’s a LOT of snake oil in this space!”

Marketers and comms professionals should not have to make do with a W.W Beauchamp style mythologised narrative of how GEO can be influenced.  Like with any new frontier, there is an emerging need for stronger governance and a code of good practice.

The new GEO principles, alongside the related Data Quality Initiative, is an important and much needed step in this direction – there is a new sheriff in town!

Follow the Oregon Trail: some common sense guidance to help navigate through the wilderness

  • Don’t over index on GEO because it has become popular – think of AI chatbots as part of a broader ecosystem of channels and influence.
  • Understand your target audiences first – what are they interested in, what questions do they ask and what sources of information do they use and trust.
  • Triangulate different data sources and information to guide decision making – don’t over focus on individual vanity metrics that may not represent audience outcomes or communication goals.
  • If in doubt, keep going with basic PR principles – be clear and consistent with your messaging across a variety of channels that influence and engage your target audience.
  • Think long term. Yes there is data to suggest that AI chatbots have a recency bias – for example Muck Rack’s Generative Pulse suggests that slightly more than half of journalism citations are up to 12 months old – but that leaves around half of citations that are older and, as mentioned above, this ignores the content that is not cited.  Reputation compounds over time, consistency matters and the internet is the opposite of today’s news being tomorrow’s fish and chip paper.
  • …but prepare to act quickly. Effective comms measurement often works best at two paces – thinking fast and slow.  The slow is the tracking of trends over time with quarterly and annual reviews to support strategic planning.  The fast is the real time monitoring to spot opportunities and threats to your organisation, your sector and the broader landscape.  Reputations take years to build but hours to destroy and effective measurement should help to manage both.

Discover the AMEC Diploma in planning and measurement