Home / Blog / Internal linking

How to automate an internal link audit at scale

Internal linking is the most mechanical high-value job in SEO, and the worst one to do by hand across a content library. This is how we run an internal link audit at scale so it stays verifiable instead of fabricated, written from someone who runs the system in production across client sites. The trick isn't the AI. It's refusing to trust the AI on anything it can't prove against the live site.

What an internal link audit actually checks

An internal link audit finds the links a site should have between its own pages but doesn't, and flags the ones it has that are broken, redirected, or pointing the wrong way. At scale the job is matching: for every page, which other pages on the site are the right targets, and what anchor text describes them honestly. That's it. It's a matching problem dressed up as a creative one.

People treat internal linking as a craft, and at the level of a single article it is. You read the piece, you remember a related post, you drop a link. Fine. Now do that across 800 articles where every new post should link back to the right four older ones, every old post should pick up a link from the new one, and each anchor has to name a real concept the target page covers. The craft framing falls apart. What's left is a problem with a clear input (the page graph plus the content), a clear output (a list of link placements), and a rule for what counts as correct. That shape is exactly what an agent is good at, and exactly what a human is slow and inconsistent at.

Google's own documentation is blunt about why the anchor side matters: anchor text should be descriptive and relevant, and you should "resist the urge to cram every keyword" into it (Google Search Central, 2026). That single rule, applied honestly across hundreds of pages, is most of the audit. The other half is reachability: every page you care about should have a link from at least one other page on the site.

Why doing it by hand breaks at scale

Manual internal linking doesn't scale because the work grows with the square of the content, not in a line. Each new page is a potential link target for every existing page and a potential source for every future one. A 50-post blog has thousands of possible internal links to reason about. No one holds that graph in their head, so links get dropped, duplicated, or pointed at the wrong page.

The failure is quiet, which is what makes it expensive. Nothing errors. The site just slowly accumulates orphan pages no other page links to, money pages that never get the internal links they need to rank, and anchor text that's vague because the person placing it couldn't remember the exact angle of the target post. The cost isn't the time spent placing links. It's the ranking left on the table by the links that never got placed. Ahrefs puts the mechanism plainly: an orphan page passes no PageRank through internal links, because there are none, and Google may never find it in the first place (Ahrefs, 2026). A page can be live, indexed, and still invisible to its own site's link graph.

Across client sites the pattern was always the same. The team knew internal linking mattered, they just couldn't keep up with it, so it became the thing that got done in batches every few months when someone had a free afternoon. Batched, occasional, and inconsistent is the worst way to run a job that compounds.

The verifiable way to automate it

The safe way to automate internal linking is to make every step checkable against the actual site, never against the model's memory. The agent reads the real page graph, proposes links, then each proposed link gets validated: does the target page exist, does it return 200, does the anchor text appear verbatim, does the anchor name a concept the target actually covers. A link that fails any check is dropped, not shipped.

This is the distinction that matters, and it's where most AI SEO tools get it wrong. There are two ways an AI can place an internal link. It can reason from what it "knows" about your site and assert that page A should link to page B with anchor C, which is fast and produces confident, plausible, and sometimes completely fabricated links to pages that don't exist. Or it can be wired to call tools that read the real site, propose against real data, and verify every claim before it counts. The first mode hallucinates. The second can't, because nothing is trusted on the model's word. Every link is checked against ground truth.

That's the whole design principle behind how we built it. The internal-linking work runs as an agent that calls tools, the same agent-plus-tools shape covered in our Claude Agent SDK guide, and the tools enforce the rules: verbatim anchors only, a hard cap on links per page so a post never gets stuffed, anchors that have to name a concept the target page genuinely covers, and no near-duplicate anchors competing inside the same article. The model proposes. The tooling disposes. If a proposed link can't be verified against the live site, it never reaches the output. Our internal linker runs at 100 percent anchor validity for exactly this reason: validity isn't a quality target it aims for, it's a gate every link has to pass before it counts.

The verify-every-link pipeline 1. Crawl real page graph 2. Propose match + anchor 3. Validate gate vs live site 4. Approve human final call fails a check -> dropped, never shipped scalably.io ●
The validate stage (green) is the gate: a proposed link that fails any check is dropped, not shipped.

The audit pipeline, step by step

The pipeline is four stages: crawl the real site to build the page graph, generate candidate links by matching each page against the others, validate every candidate against the rules and the live URLs, then output a placement list a human can approve. Each stage has a clear input and output, which is what lets the whole thing run unattended and still stay correct.

Here's what each stage does in practice.

Crawl and map

The agent pulls the real list of pages, their content, and their existing internal links. This is the ground truth everything else checks against. No assumptions about what's on the site, just what the crawl actually returns. Orphan pages and broken existing links fall out of this stage for free, because they show up the moment you have the real graph.

Match and propose

For each page, the agent finds the pages that are genuinely related and would make honest link targets, and proposes an anchor that names a concept the target covers. This is the part that looks like judgment and is actually pattern matching over the content, which is why a model does it well at volume and a tired human does it inconsistently.

Validate

Every proposed link runs the gauntlet: target exists and returns 200, anchor text appears verbatim in the source page or can be placed cleanly, anchor names a real concept on the target, no duplicate or near-duplicate anchor already lives in that post, and the link count stays under the per-page cap. Anything that fails is dropped with a reason. This stage is the difference between an automated audit you can trust and a pile of plausible guesses.

Output and approve

The result is a placement list: source page, target page, anchor text, where it goes. A person reviews it, the same way a person reviews everything before it ships, and approves or trims. The agent did the thousand-item matching job. The human does the final call, which now takes minutes instead of days.

The rule: the validate stage owns correctness, the human owns priority. If a link can't be proven against the live site, it doesn't make the list. If it can, a person still decides whether it serves this quarter's plan.
Want the link gaps, not the theory? Send one client domain and we will map the missing internal links free, with every finding tied to a real URL you can verify. Get a free audit.

What you still keep a human for

A person still owns the strategy and the final approval. The agent decides which links are valid; it doesn't decide that a particular money page is the quarter's priority and should get extra internal links pointed at it. That's a business call. The audit surfaces the options and proves they're valid. A human picks which ones serve the plan.

I'll be blunt about the boundary, because the honest version is the useful one. The agent is better than a person at the mechanical parts: holding the whole graph, matching at volume, never getting bored, never shipping a link to a 404. A person is better at the parts that need context the site can't tell you: which pages matter most to the business this month, when a technically valid link is strategically pointless, when the anchor that passes every check still reads slightly off for the brand. Automate the matching and the verification. Keep the judgment and the final read. That split is the same one that makes the rest of our AI agents for SEO work in production: the agent does the operational grind, the specialist does the thinking.

Where to start

Start by crawling your own site and listing every page with no meaningful inbound internal links. That orphan list is the fastest win in internal linking, and you don't need full automation to act on it. Once you've seen how much was sitting there unlinked, the case for automating the ongoing audit makes itself.

You don't have to build the whole pipeline to get value on day one. Run a crawl, find the orphans, find the money pages starved of internal links, and fix those by hand first. That alone usually surfaces more opportunity than a team expects. The automation earns its place when you realize the audit isn't a one-time cleanup, it's a job that has to run every time content changes, forever, and stay correct each time. That's the part no one wants to do by hand and the part a verifiable agent does well.

If you'd rather not build it, this is one of the jobs we run as a done-for-you audit on client sites. Send one domain and we'll map the missing internal links, with every anchor validated against the live pages, so the gap map you get back is a list of placements that are already proven, not a list of guesses to go check. Get a free audit on one client site - one domain, every finding checkable against the live site, no contract.

P

Pavle Lazic is the founder of Scalably, where he builds and runs multi-tenant Claude agent platforms in production for real businesses. He writes about the Claude Agent SDK, MCP servers, and what it actually takes to put AI agents to work on SEO. See the platform.