Examples of modifying crawl scope

Introduction


Before running a test crawl, the crawl must be scoped so it acquires desired content and excludes undesired content. There are a variety of ways to refine the crawl scope.

  1. Seed types can limit the crawl

    1. Standard: default crawl scope
    2. Standard+: Will crawl seed site, plus one layer of links out from the seed page
    3. One page: Only the first page of each seed
    4. One page+: Only the first page of each seed page, plus one layer of links out from the seed page

  2. Trailing slashes at the end of the URL will limit the crawl to a single directory of a site

  3. Trailing slashes at the end of the URL will limit the crawl so it excludes sub-domains

  4. Enable "Ignore robots.txt" rule to crawl YouTube links.

  5. Set collection-level or seed-level scope rule to block URLs that match RegEx: ^.*lang=(?!en).*$ (for Twitter feeds)

  6. Set collection-level or seed-level scope rule to block URLs that have "jcalpro" in URL (to block calendars that trap the web crawler)

  7. Set seed-level scope rule to expand scope to include URLs from a related site that would otherwise be blocked by a standard crawl (to capture PDFs or text files hosted in a different domain)

Archive-It Help Center


The Archive-It help center has additional information on modifying the crawl scope: