Technical SEO is no longer just about ranking on Google. Your site also needs to be legible to AI engines like ChatGPT, Perplexity, Gemini, and Claude.

This article is a comprehensive guide to technical SEO in 2026, covering everything from crawling and indexing to Core Web Vitals, JavaScript rendering, and the tools that help you diagnose issues.

1. Introduction to Technical SEO in 2026

Definition and scope: what is technical SEO?

Technical SEO involves optimizing a website’s infrastructure so that search engines can crawl, render, and index it—and, in the case of AI systems, also interpret the content correctly.

In contrast to content SEO, which focuses on “what is said” and link building—”who links to you”—technical SEO focuses on how the site works from the inside out (like looking under the hood of a car).

In 2026, there is much more to the world beyond Googlebot; therefore, we can say that the scope has expanded, and we are no longer focusing solely on search engines but also on generative response engines such as ChatGPT, Perplexity, Google AI Overviews and Google Mode, Gemini, and Claude.

Pillars of technical SEO in 2026

We no longer view “technical aspects” as merely essential for appearing in rankings; rather, we must also consider them a necessary condition for being cited, since AI-generated responses synthesize answers using various sources.

Therefore, we can identify several key pillars within technical SEO to build upon, serving as the foundation for everything else:

  • Crawlability
  • Indexability
  • Performance
  • Rendering
  • Security
  • Architecture
  • Readability by AI “machines.”

For the latter, we’ll need to pay special attention to ensuring HTML isn’t dependent on JavaScript, optimizing semantic content, and utilizing structured data that provides context, among other things.

The impact of AI on web infrastructure

One established fact is that content discovery no longer takes place solely on Google; there are other vertical platforms, such as YouTube or TikTok, as well as platforms based on response engines, like those mentioned earlier (ChatGPT, Perplexity, etc.).

Despite the debate within the SEO community over what to call this new reality, GEO (Generative Engine Optimization) has emerged as a discipline that some defend tooth and nail, while others simply accept it but view it as a natural evolution of SEO in today’s world.

The vast majority of the steps involved in a project aimed at appearing in AI-generated responses draw on best practices that have been refined over decades through SEO strategies developed by top professionals.

The name itself is of little importance here, although perhaps in another article we could discuss at length whether these naming wars actually lead anywhere, and whether the narrative sometimes stems from vested interests that influence project owners.

Personally, I believe that more than 80% of the SEO we’ve been doing for so long still forms the foundation for appearing effectively in both search engines and answer engines. We can set aside specific aspects for GEO (or whatever we choose to call it), but the reality is that having a presence on Google is a solid and reliable path to appearing in the answers.

AI has enhanced certain areas, as the way traditional search engines work differs from how answer engines operate: Google has a vast index of sites curated over decades and updated in real time, in addition to possessing a massive technical capacity for rendering.

Response engines do not have an information retrieval system like the ones we’ve been working with since the dawn of search engines.

Another area that has been significantly impacted is digital PR strategies or traditional link building. It is now very feasible to optimize content to be cited by AIs, and in many cases, adding statistics, citable sources, and certain layers of clarity and fluency can be key to increasing visibility.

However, we are in a dynamic phase where there are still no established models regarding answer engines; therefore, we must continue to stay abreast of the advancements, changes, and improvements they implement. Will we one day see them develop their own index or invest in rendering infrastructure? That remains to be seen.

What is certain is that we already have a solid technical foundation in SEO based on the topics we’ll cover here; however, we’ll need to stay on top of major innovations or updates from these search engines so we can incorporate them into our SEO projects.

2. Crawling

Crawling is the first step in our technical SEO strategy. If a bot cannot access a page, it can impact both traditional search visibility and visibility in AI-powered search engines.

With the rise of AI-powered search engines, the landscape of crawlers has diversified, and beyond Googlebot and Bingbot, many other AI bots with different purposes have emerged.

HTTP status codes

The first signal a crawler receives when requesting a URL is the HTTP status code. Let’s look at what each code means:

  • 200 OK: The page exists and the content is available. This is the expected response for any page that should be crawled and indexed.
  • 301 (Permanent Redirect): Indicates that the URL has permanently changed. It transfers most of the authority to the destination.
  • 302 (Temporary Redirect): Indicates a temporary change. Google may choose to index the original URL or the destination URL, or even keep both indexed.
  • 404 (Not Found): The page does not exist; this may be intentional or accidental. Google will gradually remove these URLs from its index as it crawls them.
  • 410 (Gone): Indicates that the removal is likely intentional, and Google will proceed to remove the URL from its index.
  • 503 (Service Unavailable): Indicates a temporary server issue. Googlebot will retry later, understanding that “maintenance” is underway.

Important: If a site relies on client-side JavaScript to display content on pages with 404 or 5xx errors, Googlebot may never see that content.

Robots.txt

The robots.txt file remains a key component of any SEO strategy aimed at improving the efficiency and control of the technical infrastructure related to crawling. Here are some considerations that should always be kept in mind:

  1. Crawl Management: The robots.txt file manages access for the many different bots that crawl the website. These bots can include search engine crawlers, such as Googlebot; tools like Ahrefsbot; AI bots like GPTBot; and many others. Using the Disallow and Allow directives, access instructions are specified by URLs, directories, or patterns, thereby preventing the waste of resources or crawl time on pages that have no value.
  2. Does not handle indexing: It is important to remember that this tool does not handle page indexing; therefore, if you want to prevent a page from being indexed, you must use the appropriate directives (noindex).
  3. Resource accessibility: It is essential not to block CSS or JavaScript files, as search engines like Google require them to render the page correctly and fully understand the content.
  4. Declare the sitemap: The file can also link to the sitemap file to provide access to the priority pages listed there.
  5. Meet the specific requirements:
    • The file is located in the domain’s root directory
    • Each file controls crawling for each subdomain and protocol
    • The file must be named exactly robots.txt (case-sensitive)
    • UTF-8 encoded text file
    • The maximum size is 500 KiB
  6. Be aware of certain conflicts that could affect your tracking:
    • If you add a global block (user-agent: *), keep in mind that the rules may be ignored if a specific block (user-agent: Googlebot) already exists. Sometimes the solution is to duplicate the rules for both user-agents.
    • When you want the same policy or rules to apply to all user agents, there’s no need to duplicate blocks specifically for each bot. Using a specific user agent is a good idea when you’re certain you want to add rules tailored to a particular bot.
    • Don’t be too redundant or detailed with the rules and sections you want to block. For example, consider blocking by prefix rather than by individual URLs:
      Disallow: /category/shoes/color/red/
      Disallow: /category/shoes/color/blue/
      Disallow: /category/shoes/color/green/
      

      Or use accepted patterns such as * or $ (end of the URL). Duplicating entries like this unnecessarily inflates the size of your robots.txt file.

  7. Finally, the response codes in robots.txt and their potential impacts:
    • It may have a redirect, in which case the final content would be considered valid for robots, but if there are too many redirects, this can lead to inefficiencies.
    • It may return a 404 or 410 response, which would mean that there are no instructions or rules blocking anything, so everything could be crawled.
    • However, if the response is 403 (Forbidden) or 401 (Unauthorized), this means that access is denied, and it should be interpreted as a complete block, meaning that nothing can be crawled.
    • If the response is 429 (Too Many Requests), crawling is paused and retried later.
    • For server errors (5xx), crawling is paused and retried later. If this behavior continues, Google gradually stops crawling.

For more information, check out Google’s robots.txt documentation.

AI crawlers

Beyond traditional search engines, we also need to consider the rise of platforms like ChatGPT, Claude, and Perplexity in terms of crawling and robots.txt.

Why do these tools matter? AI systems crawl website content and contribute to the growth of the ecosystem of bots that access and consume web resources. Therefore, these points are worth considering:

  1. There are crawlers designed for training and others for search. The former collect large amounts of data to train models, while the latter crawl the web in real time to find sources that they can then cite in their responses.
  2. Blocking out of fear that your content will be stolen. This is a legitimate and normal approach, as long as you realize that blocking users means you won’t gain visibility through their responses. If you block them, they won’t use your content, but you also won’t be recommended. Allowing access is the first step toward gaining visibility. This is a major strategic dilemma, so: it’s up to you.
  3. Not all bots are required to follow the rules in robots.txt, which reduces its reliability as a control mechanism. And no, llms.txt is of no use in this regard.
Company User-Agent Type Impact
OpenAI GPTBot Training It’s the same as “providing your content to train models”
OAI-SearchBot Search / Indexing for SearchGPT It’s the equivalent of “Googlebot within the ChatGPT ecosystem”
ChatGPT-User Real-time access It is a browser agent
Perplexity PerplexityBot Indexing for search It’s more of a reader than an indexer.
Perplexity-User Real-time access It is a browser agent
Anthropic ClaudeBot Training It’s the same as “providing your content to train models”
Claude-SearchBot Search / Indexing It’s the equivalent of “Googlebot inside Claude”
Claude-User Real-time access It is a browser agent
Google Googlebot Search (but also AI, such as AI Overviews) It’s the equivalent of “having an online presence”
Google-Extended Training It’s the same as “providing your content to train models”
GoogleOther Support tracking (not Search) It is the equivalent of “Google’s internal crawling outside of Search”
tip

Understanding how each crawler behaves opens up interesting possibilities for SEO projects. Whenever a company uses bots for different purposes, we have more strategic options: we could appear in AI responses without sharing our data to train the AI.

You can learn more about their exact behavior and details on the official websites:

  • OpenAI bots
  • Perplexity crawlers
  • Anthropic crawler
  • Google common crawlers

XML sitemaps

XML sitemaps remain essential for facilitating the discovery of URLs, especially on large sites. Best practices include:

  • Include only canonical URLs with a 200 status code.
  • Keep <lastmod> dates accurately updated, as Google, Bing, and AI bots all use them as a freshness signal.
  • Segment sitemaps by content type (products, articles, categories) to facilitate prioritization.
  • Remove URLs that return 404 errors, are blocked by robots.txt, or are duplicates.
  • List the sitemap in robots.txt, though it can also be submitted directly via Google Search Console and Bing Webmaster Tools.

Automatically updating sitemaps when publishing or modifying content is becoming increasingly important in an environment where content freshness is a critical signal for both crawlers and large language models (LLMs).

tip

Consider analyzing data from Google Search Console and Bing Webmaster Tools. Since these two tools have different approaches and features, you’ll gain additional technical insights that can help you refine your analysis.

Crawl budget

Having examined several of the technical aspects involved in crawling, it is worth noting that the concept of a crawl budget helps us better understand the impact of a sound technical SEO strategy.

The crawl budget refers to the number of URLs a search engine crawls from a site within a specific period. It depends primarily on two factors:

  • Server crawling capacity
  • Crawling demand

Therefore, Google assesses how much of a site it can crawl without overloading the server, but it also determines how relevant the content is to it, based on popularity, recency, or backlinks.

This concept is particularly relevant for large or complex websites.

The main risk associated with this concept comes from situations such as duplicate content, an excess of parameterized URLs, or technical errors. All of this can cause bots to waste time on URLs that are of no value or not intended for ranking. This reduces crawl frequency and could even affect subsequent indexing, at the very least delaying it.

Other factors that help with crawling—and thus improve crawl budget management—include optimizing your site architecture and internal linking, along with properly configuring the technical aspects mentioned in this guide (robots.txt, noindex directives, canonical tags, sitemaps, etc.).

Finally, having a fast and reliable server also helps Google increase crawl frequency, so slow response times are not a good option for our crawl budget and frequency.

Cases that complicate crawling

In many situations, certain challenges can lead to crawling issues:

  • Filters in e-commerce: If the store’s configuration allows filters to be combined in multiple ways, this can generate vast numbers of URLs with similar or duplicate content, which could potentially waste crawling resources.
  • Infinite scroll or “load more” pagination: Since bots do not scroll and typically do not interact with content without links, this could compromise the ability to crawl parts of a website configured in this way.
  • Content that depends on JavaScript execution: In certain situations, websites that rely on JavaScript require specific configurations to ensure that content is properly accessible after rendering. These cases must be carefully understood and addressed to configure the project correctly and prevent crawling-related issues.
  • Redirect chains: If redirects involve 3, 4, 5, or more redirects, we may be negatively impacting the behavior of crawling bots. The limit set by Google has always been Google itself, but in an era where traditional information retrieval, algorithms, and other systems used by Google incorporate AI in one way or another, it is advisable to specifically monitor this type of redirect behavior.
  • Intermittent server errors: Another major issue is the server’s inconsistent availability; when it’s down, our crawl frequency can suffer.
  • URLs with parameters: Sometimes parameters are added to URLs for various purposes: measuring campaign traffic, pagination, filters, feature concatenation, etc. Some of these situations can lead to duplication, and with large volumes of URLs, they result in wasted crawl resources.

3. Indexing

A page may be crawled but not indexed. Indexing is the process by which a search engine decides to store a page in its index and make it eligible to appear in search results.

Optimizing indexing ensures that important pages make it into the index and that irrelevant ones don’t generate noise.

HTTP status codes and content indexing

While HTTP status codes determine whether a bot can access content during crawling, they determine what the search engine does with the information obtained during indexing.

  • A 200 indicates that the page should be evaluated for inclusion in the index.
  • A 301 indicates that authority should be transferred to the destination.
  • A 404/410 indicates that the page should be removed from the index.
  • A soft 404 (a page that returns a 200 status code but displays “not found” content) is particularly problematic because it confuses Google and wastes crawl budget.
  • A 500 (server error) is initially treated by Google as a temporary issue, and it will retry later. If the error persists for an extended period, the page may be removed from the index in a manner similar to a 404. Additionally, frequent 5xx errors cause Google to reduce the crawl frequency for the entire site. During scheduled maintenance, the correct code is 503, which explicitly indicates that the unavailability is temporary.

Meta robots

The meta robots tag is the primary tool for controlling indexing at the page level. The most important directives are:

  • noindex: tells the search engine not to index the page. This is the most reliable way to prevent a page from appearing in search results.
  • nofollow: tells the search engine not to follow the links on the page (does not pass link equity).
  • nosnippet: prevents Google from displaying a snippet of the page in search results.
  • max-snippet:[n]: limits the length of the snippet displayed.

It’s crucial to remember that blocking in robots.txt does not prevent indexing (if there are external links pointing to the URL, Google may still index it, displaying only the title). To reliably prevent indexing, you must use noindex.

tip

Anything that isn’t indexed is potentially invisible in AI responses, since most generative systems (ChatGPT, Perplexity, Google AI Overviews) rely on traditional search indexes—such as those from Google or Bing—to retrieve sources in real time. Without indexing, there is no retrieval; and without retrieval, there is no citation.

Canonical tags

The rel="canonical" tag is the signal that tells search engines which is the primary or original version, in cases where there are multiple identical or partially duplicated pages.

The purpose of this instruction is to spare Google the task of determining which URL version is correct; it achieves this by consolidating indexing and ranking signals into a single URL, preventing those signals from being fragmented.

It’s important to note that this instruction is not a mandatory requirement, so Google treats it as a suggestion.

Some recommendations:

  • It is good practice to use self-referencing canonical tags when there is no duplicate content across URLs.
  • It is recommended to use absolute URLs.
  • The canonical attribute should always point to indexable pages (never to 404 pages, redirects, or pages with the “noindex” tag).
  • In specific cases such as pagination, it is preferable not to use this method to point to the first page (it can have negative consequences).
tip

Poor canonicalization not only has negative implications for traditional search engines, but also increases the risk of an incorrect or unwanted URL being cited on platforms like ChatGPT.

Depth levels

When it comes to the concept of depth, it’s an aspect that helps us improve the website by focusing on its structure and internal linking.

We define depth as the number of clicks a user or a bot needs to reach a specific URL from the homepage.

From a technical perspective, this can affect both the crawling and indexing layers, especially if the site is large.

If there are too many levels, we could be making it harder for users or bots to find content, with the resulting consequences: a poor browsing experience for users or a risk of reduced crawling frequency for bots.

It could also impact the crawl budget if important pages are buried too deep in the site, as this increases the time it takes to discover them.

When it comes to architecture, the key is to focus on internal linking strategies that connect pages within structures that aren’t overly complex.

This also promotes a better distribution of link equity, reinforcing authority signals toward the main URLs. Here are some ideas:

  • Do not link to the same destination URL more than once
  • Use elements such as breadcrumbs to provide semantic and structural context
  • Add blocks of related links, always keeping the user experience in mind
  • Add horizontal, vertical, and cross-site link blocks based on the site’s theme

In summary, deep pages are marked with indicators of low structural hierarchy, which means they may be crawled less frequently and may even be interpreted as less important.

Mobile-first indexing

Here is one of the key factors for understanding modern technical SEO—one that doesn’t always get the attention it deserves. Google uses the mobile version of a website to crawl, index, and rank content.

Therefore, differences between the two versions in terms of content, internal links, semantic markup, and other tagging can lead to a loss of visibility if the approach is incorrect or if this concept is overlooked.

We need to ensure a certain level of consistency across versions and always prioritize the mobile version; the most important elements must be present in that version.

Therefore, if the mobile version is slower, incomplete, or has usability or UX issues, it could negatively impact crawling. We can also add other factors, such as loading speed, resource size, and JavaScript loading, among other things, which become key technical elements that affect performance.

If we apply this concept to the context of AI, we must keep the mobile version in optimal condition to prevent misinterpretations by AI systems and, if we serve as a source for responses, to minimize errors.

Issues that complicate indexing

The most common issues that can cause problems or prevent a page from being indexed are:

  1. Content quality. If the content has little value, is superficial, or does not address the search intent, it may be completely ignored by Google and other search engines, making it less likely to be indexed.
  2. Duplicate content. Identical or very similar content generates negative signals and dilutes authority, to the point that important pages could be left out. We are already seeing international websites being deindexed for using the same content across different markets, without providing any differentiation or added value.
  3. Orphaned or deeply nested pages. From the perspective of page discovery issues—or even simply because they are unlinked or located at very deep levels—these pages may be considered unimportant.
  4. JavaScript. If the main content depends on rendering and the site configuration is not appropriate, this can increase the risk of non-indexing.
  5. HTTP errors and redirect chains. If URLs have persistent server errors or mistakenly return a 404 response, it will be difficult for them to be indexed or remain indexed. On the other hand, if there are multiple redirects that end at a URL containing a noindex tag or returning a 404 response, this will also contribute to pages being deindexed.
  6. Dynamic loading in pagination. Systems such as infinite scrolling or “load more” buttons can make it difficult to discover content further down the page. If that content isn’t linked to from other parts of the site, it can lead to indexing issues.

Important: Content that isn’t indexed won’t be visible in search engines, nor in ChatGPT and other AIs that rely on search engine results from Google and Bing when they can’t generate a response without accessing the internet.

4. Performance and Core Web Vitals

Website performance affects both Google rankings and the user experience, and with the rise of zero-click results, websites that do receive clicks need to deliver an exceptional experience right away.

Critical user experience metrics

  • LCP (Largest Contentful Paint): This is the time it takes for the main element of the page—whether an image, a block of text, etc.—to appear. Ideally, it should load in less than 2.5 seconds. This metric impacts the user’s initial perception.
  • INP (Interaction to Next Paint): This metric replaces the previous one called FID and aims to evaluate the website’s responsiveness to user interaction, such as clicks, taps, or input. The target is to achieve a value below 200 ms, which indicates a smooth, frictionless experience.
  • CLS (Cumulative Layout Shift): This metric measures visual stability during loading; therefore, it aims to ensure that elements do not move unexpectedly, to prevent interaction and navigation errors. For this, the ideal value would be less than 0.1.

Load speed optimization

  • Image compression (WebP/AVIF). Using modern formats results in higher compression, which drastically reduces image file sizes without compromising visual quality. This improves the LCP and reduces data usage, especially on mobile devices. You can visit the caniuse website to check which modern formats are supported by each browser version.
  • Code minification: Another way to improve speed is through code optimization using minification. This involves removing spaces, comments, and unnecessary characters from HTML, CSS, JavaScript, and other code. All of this helps reduce file size and speed up download and execution in the browser.
  • Use of CDN: The service that enables content distribution is the CDN, which, thanks to its global network of servers, reduces latency and loading times by serving resources from locations closest to the user.