SiteOne Crawler — website analyzer and cloner

Ján Regeš
8 min readDec 9, 2023

--

Greetings to all web developers, QA engineers, DevOps, SEO specialists, website owners or consultants in the online environment.

I would like to introduce you all to a very useful and free open-source tool that I believe you will quickly come to love and will be a useful tool for you in the long run. The goal of the tool is to help improve the quality of websites worldwide.

It analyzes your entire website, every single file found, provides you with a clear report and has additional functions, such as complete export of your website to offline version, where you can view your website from a local disk or USB stick.

This tool can be used as a desktop application (for Win/macOS/Linux) or just as a command-line tool with clear and detailed output in the console, also usable in CI/CD pipelines. Note: In the next few days we will set up an Apple and Microsoft developer account so that we can properly sign the desktop apps and the installation will be trusted. At the same time, to get the applications into the official App Store or Microsoft Store.

If you don’t like reading, scroll to the end of the article with videos where there are practical examples.

SiteOne Crawler — https://crawler.siteone.io/

SiteOne Crawler — screenshots from desktop application, command-line tool and HTML report
SiteOne Crawler — screenshots from desktop app, command-line or HTML report

Main functionalities

For developers and QA engineers

  • No one is perfect and I don’t know of a single developer or company that, even across different levels of testing and checklists, runs a really perfect website. Websites are usually not about the homepage, but a bunch of different pages. This makes it difficult to really check the entire website for SEO, security, performance, accessibility, semantics, content quality, etc. This tool will crawl every single page, every URL contained anywhere in the content, including JS, CSS, images, fonts or documents. Depending on the type of content, it performs various analyses and reports imperfections.
  • Works well for development versions of websites on localhost and specific ports, or with HTTP proxy or HTTP authentication required.
  • It can also generate a fully functional (usually) and viewable offline static version of the website, even when dynamic query parameters are used in the URL. However, the problem are some modern JS frameworks that use JS modules, and unfortunately these are disabled by CORS with local file:// protocol.
  • Can generate sitemap.xml and sitemap.txt with lists of all URLs of existing pages
  • It can also serve as a stress-test tool, as it allows you to set the max number of parallel requests and the max number of requests per second. But please do not abuse the tool for DoS attacks.
  • It’s really consistent in searching and crawling URLs — it pulls and downloads e.g. all images listed in srcset attributes, in CSS url(), even e.g. for NextJS websites it detects build-manifest and creates from it URLs to all JS-chunks, which it then downloads.

A list of analyses that the crawler performs and reports deficiencies:

  • for each URL HTTP status code, content type, response time and size, title, description, DOM elements count, etc.;
  • checks inline SVGs and warns when there are large inline SVGs in the HTML, or a lot of duplication, and it would be better to insert them as an extra *.svg file that may be cached;
  • checks the validity of the SVG from an XML perspective (very often manual editing of SVGs will break the syntax and not all browsers can fix this with their autocorrect);
  • checks for missing quotes in HTML text attributes (can cause problems if values are not escaped correctly);
  • checks the max depth of DOM elements and warns if the depth exceeds 30;
  • checks the semantic structure of headings, the existence of just one <h1>, warns about details;
  • checks that phone numbers contained in the HTML are correctly wrapped in a link with href=”tel:”, so that they can be clicked on to make a phone call;
  • checks the uniqueness of titles and meta descriptions — it will alert you very quickly if you don’t add the page number to the title, or the name of a filtered category, etc.;
  • checks the use of modern Brotli compression for the most efficient data transfer;
  • checks the use of modern WebP and AVIF image formats;
  • checks for accessibility and that important HTML elements have aria attributes, images have alt attributes, etc.;
  • checks HTTP headers of all responses and warns about the absence of important security headers and generates statistics of all HTTP headers and their unique values;
  • checks cookie settings and warns about missing Secure flags for HTTPS, HttpOnly or SameSite;
  • checks OpenGraph metadata on all pages and displays their values in the report;
  • checks and reports on all 404 pages including URLs where non-existent URLs are located (also monitors links to external domains);
  • checks and reports all 301/302 redirects including the URL where the redirected URL is located (also monitors links to external domains);
  • checks and reports DNS settings (IP address(es) to which the domain is resolved, including visualization of possible CNAME chain);
  • checks and reports SSL/TLS settings — reports the validity of the certificate from-to, warns about support of unsafe SSL/TLS protocols, or recommends the use of newer ones;
  • if enabled, downloads all linked assets from other domains (JS, CSS, images, fonts, etc.);
  • downloads robots.txt on every domain it browses and respects the prohibition of crawling on pages forbidden in robots.txt;
  • displays all unique images found on the website in the Image Gallery report;
  • displays statistics of the fastest and slowest pages, which are best to optimize, add cache, etc.;
  • displays statistics on the number, size and speed of downloads of each content type and then a larger breakdown by mime-type (Content-Type header);
  • displays a summary of all findings, sorted by severity;
  • allows you to also set response HTTP headers to be included in the URL listing (in the console and HTML report) via the ‘ — extra-columns’ setting — typically e.g. ‘X-Cache’;
  • has dozens of useful settings that can be used to influence the behavior of crawling, parsing, caching, reporting, output, etc.;
  • in the future we want to implement a lot of other controls and analyses that will make sense within the user community — the goal is to create a free tool that will be very useful and versatile.

For DevOps

  • Especially for Linux users, the command-line part of SiteOne crawler is very easy to use, without having to install any dependencies. Included is the runtime binary for x64/arm64 and the crawler source code. Just git clone, or use the crawler in tar.gz to where you need it. By default, crawler saves files in its ‘tmp’ folder, but any paths for caching or reports/exports can be set with a CLI switch. In the coming weeks we will also prepare public Docker images for the possibility to use Crawler in CI/CD environments with Docker or Kubernetes.
  • Very useful is the possibility to have the whole website rebuilt during some pre-release phase in CI/CD. Using CLI switches you can have the resulting HTML report sent to one or more emails via your SMTP server.
  • Crawler allows you to configure the use of your HTTP proxy, set up HTTP authentication or crawl the website on a special port, e.g. http://localhost:3000/
  • By setting the number of parallel workers or max requests per second, you can test your DoS protections, or perform a stress test to see how much load the target server(s) are producing what traffic.
  • You can use CLI switches to turn off support for JS, CSS, images, fonts or documents, and you can use the crawler to immediately warm up the cache after a new release, which usually includes flushing the cache of the previous version.
  • In addition to the HTML and TXT report (output as in the console), the crawler also generates output to a JSON file, which then contains all the findings and data, in a structured and programmable form. So you can integrate the output from the crawler further, according to your needs.

For website owners and consultants

  • General quality audit of website processing — website owners should be aware of what reserves their website has and where improvements could be made. Some improvements are not trivial and can be quite costly to implement. Some, however, take tens of minutes to implement and their impact on output quality can be high.
  • Audit on-page SEO factors — checks all titles and descriptions as well as headings on all pages pointing out lack of uniqueness, or missing <h1> headings or incorrect semantic structure. Most of the findings can usually be corrected by the website owner themselves through the CMS.
  • Link Functionality Audit — goes through every single link in the content on all pages and alerts you to broken links or unnecessary redirects (typically due to missing slashes at the end of the URL).
  • Audit various UX details, such as whether all phone numbers found in the HTML are wrapped in an active link with href=”tel:” so that a visitor can click on them and dial the call without having to rewrite or copy the number.
  • Overview of all images on the website — the HTML output report contains a viewable gallery of absolutely all images found on the website. You may notice, for example, low-quality or unwanted images.
  • Overview of page generation speed — the website owner should strive to have all pages on their website generate ideally in tens, max hundreds of milliseconds, as slow sites discourage visitors and are statistically proven to have lower conversions on slow sites. In fact, often only the homepage is measured, which is often optimized by the developers, but the other pages may be neglected from the perspective. If the website is slow and optimization would be expensive, it is often possible to move the website to a more powerful hosting with a slightly higher price. SiteOne Crawler stores all reports on your hard drive, so you can then use it to measure and compare the website before/after optimizations or moving to faster hosting.
  • You can tell the Crawler what other domains it can also fully crawl — typically subdomains or domain extensions with other language mutations, such as *.mysite.tld or *.mysite.*.
  • Crawler offers the possibility to have the entire website, including all images or documents, exported to offline form. The site is then fully, or almost fully functional in terms of browsing and crawling even from a local disk, without the need for the Internet. Great functionality for easy archiving of web content at any given time. It can also help in a situation where some institution requires you to keep an archive of your website on different days, for legal purposes.

Feedback is welcome

We would be very happy if you try our tool and give us your feedback. Any ideas for improvement are also very welcome. The tool is certainly not perfect today, but our goal is to make it perfect in the coming months.

And, of course, we will also be happy to share this article or the website crawler.siteone.io with your colleagues or friends who the tool could help. On the homepage you will find sharing buttons.

Thank you for your attention and we believe that our tool will help you in improving the quality of your website.

Videos

Desktop application

Command-line tool

Output HTML report

--

--

Ján Regeš
Ján Regeš

Written by Ján Regeš

Head Of Development & Infrastructure at SiteOne, s.r.o.

No responses yet