← All workflows

scalably.io the work

How website classification works

Drop in a list of websites. Get each one back labelled with what it actually is, its niche, by reading the site rather than skimming for keywords. Useful when you have a thousand domains and no idea what's in the pile.

A look under the hood: what it reads, the one rule that keeps it from being fooled, and why every domain comes back labelled, none dropped.


The short version

You hand it a list of domains and, if you like, your own set of categories. For each site it reads the homepage and the about page, works out in plain words what the business actually does, and from that gives it a primary niche, a secondary one, and a confidence level. The result is your list back with a niche column filled in, every domain labelled. It doesn't remove anything, rank anything, or look for contact details. It answers one question: what is each of these sites?

That's the whole job. The rest of this page is how it reads a site, and the rule that stops it from being fooled.

It reads the site, then names the niche Read the sitehome + about page Name the businessin a few words first Tag the nicheprimary + secondary Mark confidencehigh, med, or low scalably.io

The green steps are the judgment. Notice the order: it names the business in plain words first, and only then picks a niche. That sequence is the whole reason it's accurate.

It reads the site, it doesn't skim for words

For each domain it pulls the two pages that tell you what a business is, the homepage and the about page, and works from what they actually say. The classification is made from the real content of the site, not from its name or a guess.

The order of operations is the trick, and it's deliberate. Before it's allowed to choose a category, the agent has to first state, in a few plain words, what this business actually does. Only then does it pick the niche that matches that description. Forcing the plain-English answer first is what stops it reaching for a label that merely sounds right. If a site can't be read at all, it's classified from its domain name alone and clearly marked low confidence, so you always know which labels to trust.

The one rule: what it is, not what it mentions

The classic mistake in this kind of work is tagging a site by a word that appears on it, rather than by what the business actually is. The whole method is built to refuse that. A company is classified by what it does, not by what it talks about.

Take a hedge fund whose homepage is covered in talk of machine learning and AI. The lazy call, and the one a keyword-counter makes, is to tag it "AI". But it isn't an AI company; it's a finance company that uses AI as a tool. So it's Finance. The same logic runs everywhere: a marketing agency that writes about crypto is still Marketing; a law firm with a page about software is still legal services. The category follows the business, not the vocabulary.

What it is, not what it mentions A fund that uses AIAI all over the page Tag it “AI”?the keyword trap No, it’s Financethat’s the business tempting correct scalably.io

One example, the whole philosophy. When the keyword and the business disagree, the business wins. And when nothing fits well, it says so with low confidence rather than forcing a label.

Your categories, or sensible defaults

You can hand it your own list of niches and it'll classify against exactly those. If you don't, it uses a sensible default set that covers most of the web, and it confirms the list with you before it starts, so you're never surprised by the buckets it used.

The built-in categories cover the common ground: SaaS, Finance, Marketing, iGaming, Crypto, E-commerce, Business, Technology, AI, Health, Education, Travel, Lifestyle, Home and Decor, Productivity, and Personal Development. But the more useful mode is your own. If you care about a distinction the defaults don't draw, say "split fintech from general finance," or "I only care about these six," give it your categories and it builds its understanding around them. Either way, before a single domain is tagged, it shows you the list it's about to use.

Built to handle a big pile, without dropping any of it

A list of a few thousand domains is normal, so the work is split into batches that run in parallel and a big list comes back in minutes, not hours. The guarantee that matters: the number of rows that come out equals the number you put in.

Two things keep a large run honest. First, the reading is done by a bounded fetcher that can't get stuck, even on sites that fight back, so one slow or hostile page never freezes the batch. Second, nothing is allowed to silently vanish. If a chunk of the list comes back malformed, it's retried; if it still can't be resolved, those domains are marked "unclassified" rather than dropped. So you can always reconcile the output against your input, one to one. At the end you get a short summary too, how many landed in each niche, how many were high confidence, how many need a second look.

What it isn't This labels. It doesn't judge. It won't tell you a site is good or bad, remove the ones you don't want, or find anyone's email. Sorting a list down to the keepers, or pulling contact details, are separate jobs with their own tools. Keeping classification to the single question "what is this?" is what makes it fast and dependable.

What you get, and what you do with it

You get your own list back, unchanged except for a clean niche label on every row, plus a confidence flag so you know which ones to trust at a glance. It's the raw material for whatever you do next: filtering, prioritising, reporting, or just finally knowing what's in a list you'd lost track of.

That narrowness is the value. By doing only the labelling, and doing it by reading rather than keyword-matching, it gives you a layer of truth you can build on. The machine does the patient reading across hundreds or thousands of sites and the careful "what is this, really" call on each. You decide what the labels are for.

How website classification works scalably.io