DoveaIA | Efficient Web Crawling with Sitemap Indexes

Efficient Web Crawling with Sitemap Indexes

Introduction

Web crawling is essential for extracting data from websites, but it can quickly become inefficient if not properly structured. A common approach is to navigate through category pages, paginate through product listings, and retrieve data. However, this method can be resource-intensive and time-consuming. A more efficient alternative is leveraging sitemap indexes and sitemaps to directly access relevant content, such as product pages and their last update timestamps.

The Problem with Traditional Crawling

Traditional web crawlers typically follow this sequence:

Extract main categories from the homepage.
Loop through each category to get subcategories.
Paginate through product listings in each subcategory.
Extract product details.

While this method ensures comprehensive data collection, it has several drawbacks:

High resource consumption: Crawling unnecessary pages increases server load and bandwidth usage.
Slower processing: Fetching and parsing each page takes time, especially for large e-commerce sites.
Increased risk of blocking: Websites may impose rate limits or block IPs that generate excessive requests.

Leveraging Sitemap Indexes for Efficient Crawling

Most websites provide an XML sitemap index, which references multiple sitemap files. These sitemaps often contain direct links to important pages, such as product listings, blog posts, or other structured content. Using them for crawling offers significant advantages:

1. Faster Access to Product Pages

Instead of crawling category pages and paginating through results, the crawler can directly extract product URLs from the sitemap. This eliminates unnecessary requests and speeds up data retrieval.

2. Tracking Content Updates

Sitemaps often include the <lastmod> tag, indicating when a page was last updated. This allows the crawler to:

Fetch only newly updated products instead of re-crawling the entire website.
Optimize API calls by focusing on fresh data.

3. Reduced Risk of Detection

Since the crawler mimics search engines by retrieving sitemaps, it reduces the likelihood of being flagged as a bot or blocked by the website.

Implementation Strategy

Retrieve the Sitemap Index: Identify the root sitemap (e.g., https://example.com/sitemap_index.xml).
Extract Individual Sitemaps: Parse the sitemap index to get links to product sitemaps (e.g., https://example.com/sitemap_products.xml).
Process Product URLs: Loop through each sitemap, extract product links, and check the <lastmod> date.
Fetch Only Updated Products: Compare the update date with the last crawl timestamp and process only new or modified items.

Example: Parsing a Sitemap in Python

Here’s an optimized Python script using BeautifulSoup and requests to fetch and parse a sitemap efficiently. The scrape_products_links function scrapes product links from a website’s XML sitemap, extracts product details, and updates or inserts them into a database. It first retrieves a list of sitemaps from a given sitemap index and logs the number found. For each sitemap, it fetches and parses its content, extracting elements and filtering those ending in .html, which are assumed to be product pages.

def scrape_products_links(self) -> list[str]:
    sitemaps = self.get_sitemaps_from_index("https://www.example.com/sitemap.xml")
    logger.info(f"Found {len(sitemaps)} sitemaps.")
    links_crawled = []
    for sitemap in sitemaps:
        logger.info(f"Scraping sitemap: {sitemap}")
        r = requests.get(sitemap)
        soup = self.b4s_parse(r.text, 'xml')
        links = soup.find_all('url')
        logger.info(f"Found {len(links)} links.")
        # Find all <url> tags where <loc> ends with .html
        links = [
            url for url in links
            if url.loc and url.loc.text.endswith(".html")
        ]
        logger.info(f"Found {len(links)} product links.")
        new_products = []
        old_products = []
        for link in links:
            product = Product()
            # Extract required fields
            product.url = link.loc.text if link.loc else None
            # Skip if URL is not found
            if not product.url:
                continue
            hash = self.create_hash(product.url)
            p = Product.get_by_hash(hash=hash)
            product.website_updated_at = link.lastmod.text if link.lastmod else None
            # skip if product already exists in db or has been updated
            if p:
                logger.info("Product already exists")
                logger.info(f"Product date on website: {product.website_updated_at}")
                logger.info(f"Product date in database: {p.website_updated_at}")
                if p.website_updated_at != product.website_updated_at:
                    p.website_updated_at = product.website_updated_at
                    old_products.append(p)
                continue
            else:
                links_crawled.append(product.url)
            product.hash = hash
            product.website = CategoryWebsite.BOTICINAL
            # Extract product name from PageMap
            data_object = link.find("DataObject", {"type": "thumbnail"})
            if data_object:
                name_attr = data_object.find("Attribute", {"name": "name"})
                if name_attr:
                    product.name = name_attr["value"]
            image_loc = link.find("image:loc")
            if image_loc:
                product.image = self.get_clean_url(image_loc.text)
            product.crawler_updated_at = product.website_updated_at
            new_products.append(product)
            logger.info(f"Product: {product}")
        
        if len(new_products) > 0:
            Product.bulk_insert(new_products)
            logger.info(f"{len(new_products)} products inserted.")
        if len(old_products) > 0:
            Product.bulk_insert(old_products)
            logger.info(f"{len(old_products)} products updated.")
    return links_crawled

Conclusion

Using sitemap indexes for web crawling is a powerful optimization strategy. It allows for faster data extraction, better resource efficiency, and improved accuracy by focusing only on updated content. By implementing this approach, businesses can maintain up-to-date product information while reducing unnecessary load on target websites and their own infrastructure.

Adopting this technique will significantly enhance the performance of any web crawling solution, making data collection both smarter and more sustainable.