How to Find All Current and Archived URLs on a Website

There are plenty of motives you may want to discover all of the URLs on a web site, but your specific aim will decide Everything you’re searching for. For illustration, you may want to:

Detect every indexed URL to investigate challenges like cannibalization or index bloat
Collect recent and historic URLs Google has seen, especially for site migrations
Uncover all 404 URLs to recover from put up-migration faults
In each circumstance, only one Instrument gained’t Provide you with every little thing you may need. Sad to say, Google Research Console isn’t exhaustive, and also a “internet site:example.com” research is proscribed and difficult to extract information from.

In this particular put up, I’ll wander you thru some equipment to build your URL listing and prior to deduplicating the info using a spreadsheet or Jupyter Notebook, determined by your website’s measurement.

Aged sitemaps and crawl exports
Should you’re on the lookout for URLs that disappeared in the Are living website a short while ago, there’s a chance a person with your team may have saved a sitemap file or even a crawl export before the adjustments had been created. Should you haven’t previously, look for these documents; they could typically present what you may need. But, should you’re studying this, you almost certainly did not get so Blessed.

Archive.org
Archive.org
Archive.org is an invaluable tool for Search engine optimisation tasks, funded by donations. When you seek for a site and select the “URLs” alternative, it is possible to accessibility approximately 10,000 listed URLs.

However, Here are a few restrictions:

URL Restrict: You are able to only retrieve as many as web designer kuala lumpur ten,000 URLs, which is insufficient for greater web pages.
High quality: Many URLs might be malformed or reference resource data files (e.g., illustrations or photos or scripts).
No export solution: There isn’t a created-in method to export the listing.
To bypass The dearth of the export button, use a browser scraping plugin like Dataminer.io. On the other hand, these limitations suggest Archive.org might not offer a whole Option for larger sized sites. Also, Archive.org doesn’t suggest no matter whether Google indexed a URL—however, if Archive.org located it, there’s an excellent opportunity Google did, much too.

Moz Pro
Though you may perhaps normally utilize a hyperlink index to search out exterior websites linking to you, these equipment also find out URLs on your internet site in the method.


The way to use it:
Export your inbound hyperlinks in Moz Professional to get a speedy and easy listing of target URLs out of your web page. In case you’re coping with a huge Web-site, consider using the Moz API to export facts further than what’s workable in Excel or Google Sheets.

It’s essential to Observe that Moz Pro doesn’t verify if URLs are indexed or found by Google. However, because most websites implement the identical robots.txt guidelines to Moz’s bots because they do to Google’s, this technique usually performs very well as a proxy for Googlebot’s discoverability.

Google Search Console
Google Search Console provides several useful resources for setting up your listing of URLs.

Backlinks reports:


Just like Moz Professional, the Back links section supplies exportable lists of target URLs. Regretably, these exports are capped at one,000 URLs Every single. You can use filters for distinct internet pages, but since filters don’t utilize on the export, you could possibly must depend on browser scraping applications—limited to 500 filtered URLs at any given time. Not perfect.

Overall performance → Search Results:


This export will give you an index of internet pages acquiring search impressions. While the export is limited, You need to use Google Lookup Console API for more substantial datasets. There are also no cost Google Sheets plugins that simplify pulling additional comprehensive info.

Indexing → Internet pages report:


This segment delivers exports filtered by problem form, though these are typically also limited in scope.

Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is a wonderful resource for amassing URLs, which has a generous Restrict of 100,000 URLs.


A lot better, you may apply filters to create diverse URL lists, successfully surpassing the 100k limit. As an example, if you want to export only website URLs, abide by these steps:

Stage 1: Add a phase into the report

Move 2: Click “Make a new section.”


Action 3: Determine the section using a narrower URL sample, for instance URLs containing /web site/


Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply useful insights.

Server log data files
Server or CDN log data files are Maybe the ultimate Instrument at your disposal. These logs seize an exhaustive record of each URL path queried by customers, Googlebot, or other bots through the recorded period.

Concerns:

Data sizing: Log information can be large, numerous sites only retain the last two weeks of knowledge.
Complexity: Examining log data files may be demanding, but many tools are available to simplify the process.
Incorporate, and superior luck
When you’ve collected URLs from these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for greater datasets, tools like Google Sheets or Jupyter Notebook. Assure all URLs are constantly formatted, then deduplicate the listing.

And voilà—you now have an extensive listing of present-day, aged, and archived URLs. Fantastic luck!

Leave a Reply

Your email address will not be published. Required fields are marked *