Bounded Concurrent Crawler (Rust)
Topics: graphs, BFS, concurrency, worker pool, visited dedup
Problem
Crawl every page reachable from a seed by following its links, fetching pages concurrently but
without hammering the site: cap the number of in-flight fetches, and never fetch the same page
twice.
pub fn crawl<F>(seed: &str, workers: usize, fetch: F) -> Vec<String>
where F: Fn(&str) -> Vec<String> + Sync;
fetch(page) returns the pages linked from page; it is the only way to discover the graph.
- Visit every page reachable from
seed (including seed itself).
- Each distinct page is fetched at most once, even when it is linked many times or sits on a
cycle.
- At most
workers fetches run concurrently.
- Return the set of visited pages, sorted (the crawl order is nondeterministic; the result is
not).
{a:[b], b:[c], c:[]}; seed=a → ["a","b","c"]
{a:[b,c], b:[a], c:[]}; seed=a → ["a","b","c"] (cycle visited once)
{home:[about], island:[home]}; seed=home → ["about","home"] (island unreachable)
This is a concurrent flood fill: workers pull pages from a shared frontier, fetch their links
outside the lock, and enqueue any not-yet-seen page. A BTreeSet of visited pages dedups, stops
cycles, and yields the sorted result; an in-flight counter (plus a Condvar) lets a worker tell
"more links may still arrive" apart from "the graph is exhausted, exit." thread::scope lets the
workers borrow the shared state directly.
Grading
Your solution.rs is compiled together with a trusted tests.rs (which include!s it) using
rustc --test. Only solution.rs is yours to edit.
Sign in to submit your solution.