Web Crawler and Scraper - Rust Implementation
- scraper vs a crawler?
- Scraping is the process of turning unstructure web data in structured data.
- Crawlerin is the process of running through a lot of interlinked data, e.g., web pages.
- Practically, it makes the most sense to use them together.
What do you need?
- Starting URLs
- Spider (specific implementations for different websites, e.g., GitHub, or Wikipedia)
- Scraper that fetches URLs and parses the data, structures it, and gets the next links to continue the crawling
- A processing function or loop, that is run in a potentially separate thread.
- Control loop
Rust features
- Associated types, allow you to simplify the definition and usage of traits.
- Generic types defined in a trait to allow
- Atomic types: Types that can be safely shared across threads.
Implementing the crawler
- visited_urls