Bounded Concurrent Crawler (Go)
Topics: graphs, BFS, concurrency, worker pool, visited dedup
Problem
Crawl every page reachable from a seed by following its links, fetching pages concurrently but
without hammering the site: cap the number of in-flight fetches, and never fetch the same page
twice.
func Crawl(seed string, workers int, fetch func(page string) []string) []string
fetch(page) returns the pages linked from page; it is the only way to discover the graph.
- Visit every page reachable from
seed (including seed itself).
- Each distinct page is fetched at most once, even when it is linked many times or sits on a
cycle.
- At most
workers fetches run concurrently.
- Return the set of visited pages, sorted (the crawl order is nondeterministic; the result is
not).
{a:[b], b:[c], c:[]}; seed=a → [a, b, c]
{a:[b,c], b:[a], c:[]}; seed=a → [a, b, c] (cycle visited once)
{home:[about], island:[home]}; seed=home → [about, home] (island unreachable)
This is a concurrent flood fill: workers pull pages from a shared frontier, fetch their links
outside the lock, and enqueue any not-yet-seen page. The visited set both dedups and stops
cycles; an in-flight counter (plus a sync.Cond) lets a worker tell "more links may still arrive"
apart from "the graph is exhausted, exit."
Run
go test -v ./challenges/bounded-crawler/go/
Sign in to submit your solution.