Leaked Google Search API Documents: Implications for Web Developers

On May 5th, an email revealed a significant leak of API documentation from Google’s Search division. Shared by an anonymous individual, the leaked documents allegedly expose various internal practices and techniques Google uses to rank search results. These documents were later identified as Erfan Azimi. The leaked documents have caused concern within the tech community, as they potentially provide valuable insight into Google’s search algorithms and ranking factors. However, some experts have pointed out that these documents do not necessarily signify any wrongdoings on Google’s part, as they may simply represent standard industry practices. In addition, Google has continued to maintain its commitment to transparency and providing incredible free AI tools for developers and businesses to utilize. As the investigation into the leaked documents continues, many are looking to see how Google will respond to the breach. Meanwhile, businesses and developers in South Africa are eager to see if these documents will change the way they approach search engine optimization and if it will impact the best SEO agencies in South Africa. Overall, the fallout from the leaked API documentation remains to be seen, as the tech community speculates on the potential implications for Google and the wider industry.

The leaked documents challenge many of Google’s public statements about their ranking methods and data collection practices. They point to the use of full clickstream data and a system known as “NavBoost.” This system significantly utilizes user interactions to improve search results.

This information suggests Google’s deep reliance on user data, including click history and engagement metrics. The revelations also point towards the use of whitelists during significant events like the COVID-19 pandemic and democratic elections to control search result visibility.

Key Takeaways

The leak challenges Google’s public search practices.
User data like clicks and engagement play a substantial role in rankings.
Whitelists were used during major events for search result control.

Is this API Leak Authentic? Can We Trust It?

SEO A paper note with the acronym "API" written on it, clipped with a binder clip, against a turquoise background. — SEO A paper note with the acronym “API” written on it, clipped with a binder clip, against a turquoise background.

Verifying the authenticity of the leaked API Content Warehouse documents was a crucial step. The verification process began with consulting several former Google employees.

Three ex-Googlers responded, providing their insights on the legitimacy of these documents. One of them stated they were not comfortable reviewing or commenting on the documents. The other two, however, provided more concrete feedback.

One mentioned that although they did not have access to the code during their tenure at Google, the documents appeared genuine. They noted that the documents exhibited characteristics typical of an internal Google API. Another emphasized the use of Java and the adherence to Google’s internal documentation and naming standards. They remarked that the structure and approach aligned with what they remembered of Google’s internal documents.

A technical SEO expert was consulted to further examine the intricate naming conventions and technical details. Mike King, the founder of iPullRank, reviewed the leaked documents.

During a detailed phone call, King confirmed that the documents seemed legitimate and contained substantial information about Google’s internal operations.

Key Insights from Experts:

Source	Insight
Ex-Googler 1	Not comfortable reviewing or commenting
Ex-Googler 2	Exhibits traits of an internal Google API uses Java, and follows Google’s internal standards
Ex-Googler 3	Exhibits traits of an internal Google API, uses Java, and follows Google’s internal standards

These expert reviews collectively point towards the legitimacy of the leaked documents. Their insights underline the distinctive internal features of the documents, backed by thorough technical analysis from industry experts.

Qualifications and Motivations for this Post

The individual behind this post possesses extensive experience and deep-seated motivations to address the current situation regarding the Google document leak. Even though no longer actively working in the SEO field, their background and insights remain highly relevant. Let’s explore their qualifications and motivations.

Extensive Experience in SEO

Career Beginnings: Their journey in the SEO world started in 2001, providing services to small businesses around Seattle. By 2003, they co-founded an SEO consultancy that later became the well-known company Moz.
Leadership and Influence: Over a span of 15 years, they emerged as a leading figure in search marketing. They co-authored several important works like “Lost and Founder: A Painfully Honest Field Guide to the Startup World,” “The Art of SEO,” and “Inbound Marketing and SEO”. Their expertise was frequently quoted by prominent publications such as the Wall Street Journal, Inc, and Forbes.
Whiteboard Friday: For a decade, they led the popular weekly video series, Whiteboard Friday, which became a staple in the SEO community.
Moz’s Growth: Under their leadership, Moz grew to serve over 35,000 paying customers, achieved revenues exceeding $50 million, and expanded its team to approximately 200 employees. Moz was eventually sold to a private equity firm in 2021. They departed from Moz in 2018 to pursue new ventures.
Academic Achievements: Although a college dropout from the University of Washington, their work has been cited by major organizations, including the United States Congress, the US Federal Trade Commission, and well-known media like the New York Times and John Oliver’s Last Week Tonight.
Patents and Innovations: The creator of several patents related to web-scale link indexing, including the widely used Domain Authority metric. These contributions have significantly impacted the digital marketing world.

Current Ventures

After leaving Moz, they went on to establish two new enterprises:

SparkToro: This company focuses on developing audience research software. Rand Fishkin continues to demonstrate his expertise and commitment to the field through SparkToro.
Snackbar Studio: An indie video game developer established in 2023.

Their extensive background and ongoing engagement in the industry make them a credible voice on the nuances of Google’s search mechanisms.

Motivation for Sharing the Information

When approached with information regarding the Google document leak, they were initially sceptical. The individual conveying this information was deemed credible, thoughtful, and knowledgeable. This aligned perfectly with their long-standing goal: holding Google accountable for public statements that conflict with private conversations and leaked documents. Their objective is to ensure greater transparency in search marketing.

Personal and Professional Interests

Despite having moved on from SEO as a profession, they remain deeply connected to the community through SparkToro and their extensive network. They feel a responsibility to share information about Google’s inner workings, aiming to shed light on a topic that Google might prefer to keep hidden.

Referred Trusted Sources

Years ago, Danny Sullivan, who is now Google’s Search Liaison, would have been an excellent candidate for sharing such groundbreaking news. Known for his fair, knowledgeable, and balanced approach, Sullivan would have presented this information effectively to the public. With Sullivan’s shift to a role at Google, the task of disseminating this information falls to others.

Importance of Accountability

The aim remains to hold Google accountable and ensure transparency. This responsibility requires someone deeply embedded in the world of SEO, with the ability to reach out to influential figures and platforms. This helps maintain a check on tech giants and their policies regarding search transparency.

Expert Support

The input of SEO experts like Michael King, and insights from ex-Google employees, further validate the authenticity and importance of the leaked documents. This collective effort underscores the significance of these revelations for the SEO community.

What is the Google API Content Warehouse?

SEO A hand pointing at a search bar on a browser window, surrounded by digital icons and folders on a desktop interface.

The Google API Content Warehouse is an extensive collection of internal API documentation created by Google. This documentation provides detailed information about various attributes and components used within Google’s systems. It serves as a guide for employees working on projects related to Google’s search engine, helping them understand the available data elements and how to use them effectively.

Purpose and Use

This repository collects instructions and explanations about the different API attributes and modules. Each Google team produces similar documentation to help team members familiarise themselves with the data and components they will work with. It’s essentially a detailed inventory, like a library catalog, listing available resources and how to access them.

History and Exposure

The recent exposure of these documents occurred due to a leak on GitHub, where the documentation was briefly made public between March and May of 2024. During this time, the information spread to platforms like Hexdocs, which index public GitHub repositories. This inadvertent exposure allowed the outside world a rare glimpse into Google’s otherwise tightly guarded search mechanisms.

Technical Composition

The documents in the Content Warehouse mirror other public and internal Google documentation, sharing similar notation styles, formatting, and terminology. This consistency helps ensure that Google’s internal communication remains coherent and efficient across different teams and projects.

Importance in Google’s Ecosystem

Given the secretive nature of Google’s search algorithms, the API Content Warehouse plays a critical role in maintaining the efficiency and effectiveness of Google’s search development teams. It serves as a centralised resource for understanding how different modules and features interconnect and function. This is crucial for maintaining the high standards and operational effectiveness expected from one of the world’s leading search engines.

Impact of the Leak

The leak of these documents is unprecedented in the history of Google’s search division, marking a significant event. Such detailed insights into the internal workings of Google Search have never been publicly disclosed at this scale. This has not only piqued the interest of those in the SEO community but also highlighted potential vulnerabilities in how sensitive information is managed within large tech companies.

How certain can we be that Google’s search engine uses everything detailed in these API docs?

Many of the details in the leaked API documents suggest that Google may have used some of the techniques at one point, but the certainty remains elusive. Some features in the documents are marked as deprecated, indicating they are no longer in use. Meanwhile, others lack such labels, implying possible continued use as of the leak in March 2024. For instance, notes indicate that the “domain-level display name of the website” field was deprecated as of August 2023.

A reader could reasonably assume the documentation was current as of summer 2023, reflecting changes up to that point. It is also possible that parts of the document have been retired, kept for internal use, or designed for testing.

Key factors to consider include:

Algorithm Usage: The documents don’t necessarily prove that all listed features are part of the active algorithm. While they offer a deeper look into Google’s possible methods, the absence of certain recent updates like the AI Overviews suggests some data might be outdated.
Ranking Algorithm: Aspects related to the ranking algorithm can be inferred, but not definitively linked to current practices since the documents don’t display exact weights or usage.
Deprecated Features: The presence of deprecated features points to evolving practices; deprecated listings imply Google’s shifting focus away from certain methods.
User Intent and Search Algorithms: User intent is a major factor in Google’s search algorithms. The docs give hints but cannot confirm which specific algorithms are currently used to match user queries with search results.

Google’s rapid changes and updates in search technology, as evidenced by introductions like Gemini AI, are not evident in the leaked materials. This suggests that the algorithms and practices mentioned could be partially outdated. New additions to their search systems might not be represented in the leaked data, making it challenging to assert their current use.

What can we learn from the Data Warehouse Leak?

SEO A person holds a magnifying glass close to a computer screen, with text and numbers visible through the glass.

The recent leak of Google documents has revealed significant insights into the workings of Google Search, particularly through its AI-driven systems and content ranking features. For instance, the leaked documentation about the Google Search API underscores its extensive use in evaluating and prioritizing web pages on the search engine.

Subdomains and exact match domains are highlighted as ranking factors, suggesting that the naming and structure of a website can impact its visibility on search results.

One clear takeaway is Google’s emphasis on site authority. The documents imply that sites with robust reputations, both online and offline, are likely to be ranked higher. This brings to light the intricate algorithms involved, such as twiddlers and re-ranking functions, which adjust rankings based on various signals like page titles and content relevance.

Moreover, the exposure offers a rare look at elements like change history and entities, shedding light on how page titles and even font size can influence search visibility. The role of Google’s Content AI is apparent in analyzing and storing data to improve search accuracy.

Furthermore, the leak has highlighted the importance of ranking features that determine how content gets ranked. It also touches on API documentation, emphasizing the detailed processes followed within Google Search to ensure web results are relevant.

#1: Navboost and the Use of Clicks, CTR, Long vs. Short Clicks, and User Data

Navboost, a tool within Google’s ranking system, plays a crucial role in determining the relevance of search results. It was introduced around 2005 and has been updated significantly over time.

Navboost uses various data signals from user interactions to rank web results on the Search Engine Results Page (SERP). These interactions include different types of clicks: goodClicks, badClicks, and lastLongestClicks.

GoodClicks are considered successful clicks where users find the desired information and spend significant time on the page. These clicks indicate that the content is valuable and relevant to the user’s query.

In contrast, bad clicks occur when users quickly return to the search results after clicking on a link, suggesting the content did not meet their needs. This behavior, known as pogo-sticking, signals lower quality.

Another key concept is lastLongestClicks, which refers to the most recent substantial interactions where users stayed on a page for an extended period. These clicks are weighted more heavily in determining the page’s relevance.

Click Types Table

Click Type	Description
GoodClicks	Successful interactions with high relevance
BadClicks	Fast return to SERP, indicating low satisfaction
LastLongestClicks	Recent clicks with long user engagement

Navboost doesn’t work alone. It is complemented by Glue, which accounts for all features on the page that are not web results, such as images and videos. Together, Navboost and Glue analyse various signals to rank content effectively.

Google also employs different methods to sift through click data. These include filtering out unimportant clicks and focusing on true signals of quality.

This involves monitoring the duration of clicks (long vs. short clicks) and impressions. Long clicks suggest a positive user experience, while short clicks often indicate poor content relevance.

Furthermore, Google’s algorithms consider unsquashed and squashed click data. Squashed clicks might be normalized or adjusted clicks, while unsquashed clicks remain unaltered and raw.

While many have discussed Google’s use of these click-centric user signals, it’s clear that they have named and meticulously detailed their approach to these metrics. This comprehensive measurement helps in fine-tuning search results, ensuring users receive the most relevant information promptly.

Key Concepts:

Impressions: The frequency at which a page appears in search results.
Squashed Clicks: Adjusted or normalized click data.
Unsquashed Clicks: Raw, unmodified click data.
Unicorn Clicks: Rare, extremely positive signals challenging the norm.

Using these metrics, Navboost and Glue provide a framework for evaluating and improving content visibility. They ensure that the most relevant and high-quality pages rise to the top of search results, enhancing user satisfaction with Google’s search engine.

Use of Chrome Browser Clickstreams to Power Google Search

SEO A man with glasses looks at a laptop screen with a focused expression, pulling his glasses down slightly. He is indoors with a kitchen and glass doors in the background.

Google’s acquisition of data from Chrome browser users allows them to gather extensive clickstream information. Beginning in 2005, Google aimed to capture all of the clickstream data of Internet users, and with Chrome, they achieved this goal.

The API documents indicate that Google calculates multiple metrics based on Chrome views. These metrics span individual pages and entire domains, giving significant insight into user interactions on the web.

A notable example is the topUrl call, which lists the most popular URLs by measuring chrome_trans_clicks.

This metric indicates that Google tracks the number of clicks on pages viewed in Chrome browsers to identify the most significant URLs. These URLs contribute to the calculation for Google’s Sitelinks feature, which displays important links directly under search results.

Example:

Pricing Page
Blog Page
Login Page

Such pages are typically the most-visited and are identified through Chrome’s clickstream data. This tracking allows Google to include these significant pages in Sitelinks, enhancing user navigation.

Google’s data modules like the Quality NSR Data module, Video Content Search module, and the Quality Sitemap module benefit from Chrome clickstreams by improving search precision and user experience. This integration showcases the powerful role of clickstream data in refining search algorithms.

#3: Whitelists in Travel, Covid, and Politics

Google’s algorithms often adjust which websites appear in search results based on several factors, including whitelists.

For instance, in the travel sector, Google’s module on “Good Quality Travel Sites” implies a list exists for preferred sites. It’s unclear if this applies only to Google’s “Travel” tab or more broadly to web searches.

In the context of the Covid-19 pandemic, Google flags certain domains as “isCovidLocalAuthority” and “isElectionAuthority.” These flags hint that Google prioritizes specific websites for sensitive queries about public health and democratic elections.

This approach helps ensure only reliable and authoritative information appears in search results, which is crucial during a pandemic or political unrest.

Following the 2020 U.S. presidential election, claims of election fraud led to significant social unrest. Whitelists can help filter out misinformation and prevent further tension.

This filtering serves as a safeguard against the spread of false information that could disrupt democratic processes or incite violence.

Moreover, whitelists aren’t just limited to text searches. Other elements like Quality NSR Data Attributes, Assistant API Settings for Music Filters, and Video Content Search Query Features further optimize how content is displayed across different platforms and search types.

#4: Employing Quality Rater Feedback

Google’s search system incorporates input from human quality raters through a platform known as EWOK. These raters play a significant role in evaluating websites. Their impact on search results is increasingly evident.

One of the key factors is the “per document relevance rating,” which stems from quality evaluations performed via EWOK. This rating helps determine how relevant a document is within its context.

Impact on User Experience:

Quality rater feedback is crucial in enhancing user experience. Websites rated highly by these human evaluators often enjoy better visibility, leading to a more trustworthy and useful web for users.

Assessment Areas:

Product Reviews:
Quality raters review product-related content, ensuring it is useful and unbiased. High-quality reviews often score better in search rankings.
Authorship:
Raters evaluate the credibility and authority of the content’s authors. Articles with clear, reputable authorship are likely to rank higher.
Freshness:
Pages with up-to-date information can receive better quality scores. Regular updates and timely content contribute to higher rankings.
Page Quality Scores:
These scores, derived from rater evaluations, encompass several factors including the relevance and reliability of the content.

Modules Involved:

Webref Mention Ratings:
Evaluates the frequency and context of mentions within web references.
Webref Task Data:
Assesses data collected through specific web tasks assigned to raters.
Document Level Relevance Module:
Aggregates the individual document relevance ratings to create an overall relevance metric.
Webref per Doc Relevance Rating:
Combines ratings for each document to form a comprehensive relevance score.
Webref Entity Join:
Links entities evaluated by raters to ensure coherent and relevant search results.

Compressed Quality Signals:

Google utilizes compressed quality signals that summarise the feedback from quality raters. This allows for efficient processing and integration into the search algorithms. The high-level feedback from EWOK raters can effectively shape the quality and relevance of search results, maintaining a balance between automation and human input.

Google Uses Click Data to Determine Link Weighting in Rankings

Google employs an intricate system to classify links into three quality tiers: low, medium, and high. This classification affects whether or not a link will influence a site’s ranking. The key factor in this process is click data.

For instance, if a link on Forbes.com/Cats/ receives no clicks, it is placed in the low-quality tier. Such links are disregarded in the ranking process. In contrast, if a link on Forbes.com/Dogs/ accumulates many clicks from verified sources, it is categorized as high quality, contributing to the site’s ranking signals.

High-tier links are termed “trusted” and therefore capable of passing PageRank and anchor text. Links in the low-quality tier neither benefit nor harm a site’s ranking—they are simply ignored.

By examining click data, Google can determine the importance and authenticity of links. This ensures that website authority scores accurately reflect actual usage and trust. This system demonstrates how Google uses real-world user behavior to refine search results, prioritizing useful and reliable information.

Key Insights for Marketers Focusing on Organic Search Traffic

SEO Colorful lines of code in a programming editor, featuring various syntax elements like functions, variables, and operators in different colors. The highlighted code appears to be written in JavaScript.

Building a notable and well-recognized brand is crucial for achieving high organic search rankings. Google has advanced methods to identify and rank brands, favoring large, influential entities over smaller, independent sites. This trend is evident from multiple Google updates and confirmed by data from various studies. For marketers, the key takeaway is to focus on brand-building activities that enhance brand recognition and popularity beyond just Google search.

Influence of E-E-A-T on Rankings

The importance of Experience, Expertise, Authoritativeness, and Trustworthiness (E-E-A-T) remains somewhat ambiguous. Although the leak mentions the author’s reputation and contributions, there is limited direct evidence of E-E-A-T’s substantial impact on rankings. It appears that having a recognized author can benefit rankings, but the specific weight of E-E-A-T elements within Google’s ranking systems is not entirely clear. Some well-ranked brands lack significant E-E-A-T indicators, casting doubt on their overall influence.

User Intent and Navigation Patterns

User intent and navigation patterns play a substantial role in Google’s ranking algorithms. When users demonstrate specific behaviors—like searching for a term and consistently clicking a particular result—Google adjusts rankings to prioritize those results. This phenomenon, often driven by geographical trends or user demographics, suggests that generating genuine demand and engagement for your site can be more impactful than traditional SEO techniques.

Changing Dynamics of Classic Ranking Factors

Traditional ranking factors like PageRank, anchor text, and text matching have seen declining importance over the years. While PageRank still influences rankings, it has evolved significantly. The leaked documents indicate multiple variations of PageRank have been tested and modified over time. Despite these changes, the significance of Page Titles remains strong. Properly optimized page titles still play a crucial role in how content is ranked and indexed.

Challenges for Small and Medium Businesses

For small and medium businesses (SMBs), the path to effective SEO is more challenging. Big brands dominate Google’s search results due to their established credibility and navigational demand. Newer businesses need to focus on building a loyal audience and creating significant demand for their content. Without this, traditional SEO efforts are likely to yield minimal returns, particularly in competitive sectors dominated by well-known brands.

Future Directions for the Search Industry

The recent leak of internal Google Search documentation has opened a potential goldmine of insights for the search industry. Professionals with up-to-date experience and technical know-how are keen to explore these documents to uncover new details about Google’s ranking mechanisms.

This presents a unique opportunity for those in the SEO community to cross-reference this leaked info with existing public documents, statements, and past ranking experiments, and share their findings.

Historically, many SEO publishers and industry commentators have taken Google’s public statements at face value. They often promote Google’s claims uncritically, leading to headlines such as “Google says XYZ is true” rather than questioning those assertions with “Google Claims XYZ; Evidence Suggests Otherwise.” The aspiration is for this scenario to change. The leak and current DOJ trial provide momentum for a shift towards a more scrutinized approach to Google’s statements.

When newcomers to SEO read through resources like Search Engine Roundtable, Search Engine Land, or Search Engine Journal, they may not always discern how rigorously to evaluate Google’s official communications.

Journalists and writers in the search industry should not assume that their readers are aware of the historical inaccuracies in Google’s public comments. Instead, they have a duty to rigorously investigate and verify Google’s statements before propagating them.

The responsibility of maintaining this critical approach extends beyond the search industry. Google, being a powerful entity with significant global influence on the dissemination of information and commerce, should be held accountable by both governments and the media.

The work done by journalists and SEO bloggers carries significant weight, influencing public opinion, policymakers, and even Google employees themselves. This responsibility is crucial for ensuring that information shared by such a dominant force is accurate and reliable.

Moreover, the community is grateful for the input from individuals like Mike King and Amanda Natividad who have helped bring this story to light. As more people access and scrutinize these leaked documents, new updates and discoveries are expected.

It’s encouraged that anyone finding new supporting or contradictory evidence shares these insights. It’s a collective effort aimed at enhancing the accuracy and reliability of information in the search marketing field.

In moving forward, the search industry must adopt a more investigative approach, analyze Google’s statements with a critical eye, and provide the community with well-verified, accurate information. This will not only benefit the industry but also contribute to the broader goal of maintaining truthful and reliable information dissemination on a global scale.