Skip to content

Web Crawler and Scraper - Rust Implementation

  • scraper vs a crawler?
  • Scraping is the process of turning unstructure web data in structured data.
  • Crawlerin is the process of running through a lot of interlinked data, e.g., web pages.
  • Practically, it makes the most sense to use them together.

What do you need?

  • Starting URLs
  • Spider (specific implementations for different websites, e.g., GitHub, or Wikipedia)
    1. Scraper that fetches URLs and parses the data, structures it, and gets the next links to continue the crawling
    2. A processing function or loop, that is run in a potentially separate thread.
  • Control loop

Rust features

  • Associated types, allow you to simplify the definition and usage of traits.
    • Generic types defined in a trait to allow
  • Atomic types: Types that can be safely shared across threads.

Implementing the crawler

  • visited_urls