Examples of modifying crawl scope
Introduction
Before running a test crawl, the crawl must be scoped so it acquires desired content and excludes undesired content. There are a variety of ways to refine the crawl scope.
- Seed types can limit the crawl
- Standard: default crawl scope
- Standard+: Will crawl seed site, plus one layer of links out from the seed page
- One page: Only the first page of each seed
- One page+: Only the first page of each seed page, plus one layer of links out from the seed page
- Trailing slashes at the end of the URL will limit the crawl to a single directory of a site
- Trailing slashes at the end of the URL will limit the crawl so it excludes sub-domains
- Enable "Ignore robots.txt" rule to crawl YouTube links.
- Set collection-level or seed-level scope rule to block URLs that match RegEx: ^.*lang=(?!en).*$ (for Twitter feeds)
- Set collection-level or seed-level scope rule to block URLs that have "jcalpro" in URL (to block calendars that trap the web crawler)
- Set seed-level scope rule to expand scope to include URLs from a related site that would otherwise be blocked by a standard crawl (to capture PDFs or text files hosted in a different domain)
Archive-It Help Center
The Archive-It help center has additional information on modifying the crawl scope:
- Pre-Crawl Scoping (video tutorial): https://support.archive-it.org/hc/en-us/articles/216489103-Archive-It-Video-Curriculum-#gettingstartedPreCrawl
- PDF only crawl (video tutorial): https://support.archive-it.org/hc/en-us/articles/216489103-Archive-It-Video-Curriculum-#gettingstartedPDFonly