SEO Fundamentals: How Search Engines Actually Work

How search engines actually work

A search engine does three things: it crawls the web to discover pages, it indexes the pages it finds, and it ranks the indexed pages in response to user queries. Each of these stages is a separate engineering problem with its own failure modes, and SEO that ignores any of the three will fail. A page that cannot be crawled will not be indexed, no matter how good its content is. A page that is crawled but not indexed cannot rank. A page that is indexed but poorly ranked might as well not exist.

Crawling is the process of following links from known pages to discover new ones. The crawler (Googlebot in Google's case) starts with a list of known URLs, fetches each one, extracts the links, and adds the new URLs to its queue. The crawler respects robots.txt, which tells it which parts of the site it may not visit, and it respects the crawl-delay hint, though Google no longer uses the explicit value. The crawler also respects HTTP status codes: a 301 redirect updates the URL in the index, a 404 eventually removes the URL, and a 503 tells the crawler to come back later. Pages that are not linked from any other page (called orphan pages) are unlikely to be discovered unless submitted directly via a sitemap.

Indexing is the process of parsing a crawled page, extracting its content and metadata, and storing it in a form that can be searched. The indexer extracts the title, the meta description, the headings, the body text, the images, the links, and structured data. It identifies the canonical URL when multiple URLs serve the same content. It detects the language and the locale. It identifies entities (people, places, organizations) mentioned in the text. The indexed representation is what the ranking algorithm works with, not the raw HTML; if your content is not in the index, it cannot rank.

Crawling, indexing, and ranking

Ranking is the process of ordering indexed pages in response to a query. Google's ranking algorithm uses hundreds of signals, weighted and combined in ways the company does not fully disclose. The well-known signals include the relevance of the page to the query (does the page contain the words, in what positions, in what context), the authority of the page (how many other pages link to it, and how authoritative those pages are), the freshness of the content (when was it last updated, is the topic time-sensitive), the user's location and language (a query for pizza returns different results in New York and Naples), and the user's search history.

The original Google algorithm, described in the 1998 paper by Larry Page and Sergey Brin, was based primarily on PageRank, a measure of link authority. PageRank counted the number and quality of incoming links to a page, treating each link as a vote. The algorithm assigned each page a score from 0 to 10 (in the publicly visible Toolbar PageRank, discontinued in 2016) based on the scores of the pages linking to it. Modern Google uses many more signals, but link authority remains a fundamental input, which is why link building is still a core SEO activity.

What has changed since 1998 is the rise of machine learning in ranking. Google's RankBrain, introduced in 2015, used machine learning to interpret queries it had not seen before. BERT, integrated in 2019, used natural language processing to understand the context of words in a query. MUM, announced in 2021, was multimodal and could understand images and text together. These systems mean that keyword stuffing — the old practice of repeating a target word dozens of times — is not just ineffective, it is counterproductive, because the algorithms can tell when content is written for engines rather than humans and demote it.

On-page fundamentals that still matter

The title tag is the single most important on-page element for SEO. It appears as the clickable headline in search results and as the browser tab title. Keep it under 60 characters (Google truncates longer titles), put the primary keyword near the start, and write it as a human would read it. "How to Brew Pour-Over Coffee: A Step-by-Step Guide" is better than "Pour Over Coffee | Brewing Guide | Coffee Tips", because the first reads naturally and the second looks like keyword stuffing.

The meta description is the snippet of text below the title in search results. It does not directly affect ranking, but it affects click-through rate, which affects ranking indirectly. Write it as ad copy: under 160 characters, with the primary keyword, ending with an action verb or a benefit. Google may rewrite your meta description to match the user's query, which it does about 70 percent of the time, but a well-written original is still worth providing because it is the baseline Google starts from.

Headings (H1, H2, H3) structure the page for both readers and crawlers. Use one H1 per page, matching or closely related to the title tag. Use H2s for major sections and H3s for sub-sections. The hierarchy should be logical and nested, not chosen for visual size. Search engines use headings to understand the topic structure of the page and to extract featured snippets, the boxed answers that appear at the top of some search results. A page with clear, descriptive headings is more likely to earn a featured snippet than a page with vague or skipped headings.

The URL should be short, descriptive, and stable. Use hyphens to separate words, not underscores or spaces. Avoid query parameters when possible, because they create duplicate content. Once a URL is published and indexed, do not change it; if you must, set up a 301 redirect from the old URL to the new one, so the link authority is preserved. Changing URLs without redirecting is one of the most common ways established sites lose their search traffic overnight.

Technical SEO: speed, structure, schema

Page speed is a confirmed ranking factor, both for desktop and mobile. Google's Core Web Vitals are the specific metrics that matter: Largest Contentful Paint (LCP) under 2.5 seconds, Interaction to Next Paint (INP) under 200 milliseconds, and Cumulative Layout Shift (CLS) under 0.1. These are not arbitrary targets; they are derived from user research on perceived performance. A site that meets all three feels fast to most users. A site that misses them feels slow, and Google will rank it lower, especially on mobile.

The most common causes of poor Core Web Vitals are large uncompressed images, render-blocking JavaScript, layout shifts from late-loading ads or images without dimensions, and slow server response times. The fixes are well-known: serve images in modern formats (WebP, AVIF) at the correct size and resolution, lazy-load images below the fold, defer or async-load non-critical JavaScript, set explicit width and height on every image and iframe, and use a content delivery network (CDN) to reduce server response time.

Site structure matters for crawling and for user experience. A flat structure, where every page is reachable in three or four clicks from the home page, is better than a deep structure with many levels. Internal links should use descriptive anchor text (not click here), and they should connect related pages so that link authority flows from the home page to the deepest content. XML sitemaps help search engines discover pages, especially on large sites; they are not a substitute for good internal linking, but they are a useful safety net.

Structured data, also called schema markup, is a way to tell search engines what a page is about in a machine-readable format. The Schema.org vocabulary, developed jointly by Google, Microsoft, Yahoo, and Yandex in 2011, defines types for articles, products, events, recipes, reviews, organizations, and many other things. Adding JSON-LD structured data to a page does not directly improve its ranking, but it can make the page eligible for rich results (the starred reviews, the recipe cards, the event listings) that increase click-through rate. Use the Schema.org validator and Google's Rich Results Test to verify your markup.

Content and links

Content is what ranks. Every other SEO tactic is in service of getting good content in front of searchers. Good content means content that thoroughly answers the searcher's question, is more useful than the alternatives already ranking, and is presented in a way that is easy to read. Length matters only insofar as it takes to cover the topic; a 500-word article that fully answers the question will outrank a 2,000-word article padded with filler. The right test is not is this long enough but would a searcher who landed here leave satisfied.

Keyword research is the practice of finding what searchers actually search for, in what volume, and with what intent. Tools like Ahrefs, Semrush, and Google's own Keyword Planner provide search volume and difficulty estimates for millions of keywords. The goal is to find keywords with reasonable search volume, manageable competition, and clear intent that your content can serve. Long-tail keywords (specific, lower-volume queries like how to fix a leaky Moen kitchen faucet) are often easier to rank for than head keywords (plumbing) and convert better because the intent is clearer.

Links remain a major ranking signal. Internal links (within your own site) distribute authority and help crawlers discover content. External links (from other sites to yours) are votes of confidence that pass authority. The quality of the linking site matters more than the quantity; one link from a major news site is worth more than a hundred from low-quality directories. Link building through genuine value — original research, useful tools, distinctive writing — is more durable than link buying or link exchange, both of which Google penalizes when detected.

What to measure, and what to ignore

The four metrics that matter are organic traffic, rankings for target keywords, click-through rate from search results, and conversions from organic traffic. Organic traffic is the bottom line: are more people finding your site through search? Rankings are a leading indicator: if your target keywords are moving up, traffic will follow. Click-through rate tells you whether your titles and meta descriptions are compelling; if a page ranks third but has a 1 percent CTR, the title or description is the problem. Conversions tell you whether the traffic is doing what you want it to do.

The metrics that do not matter much, despite frequent obsession, are Domain Authority and Page Authority (proprietary Moz scores that correlate with rankings but are not used by Google), Toolbar PageRank (discontinued in 2016), and the raw number of backlinks (quality matters more than quantity). Alexa Rank, once a popular traffic estimate, was discontinued in 2022. Keyword density (the percentage of words on a page that are the target keyword) is a relic of the 2000s; modern algorithms understand synonyms and context, so density is irrelevant.

The tools that matter are Google Search Console (free, direct from Google, shows impressions, clicks, and average position for every query that returned your site), Google Analytics 4 (free, shows traffic and behavior on your site), and a third-party rank tracker like Ahrefs or Semrush (paid, tracks your rankings over time and analyzes competitors). Bing Webmaster Tools is worth setting up as well, since Bing has a non-trivial market share in some markets. Set these up before you start doing SEO, so you have a baseline to measure against.