How to Read the URL Report
The URL Report is a great way to troubleshoot errors in bulk jobs or general logging purposes.
The report is a comma-separated-values (CSV) file, and is available to download from your bulk job status page or crawl job status page on the Dashboard as soon as a job has begun.
The URL Report may also be retrieved programmatically via the Bulk Job Data API or the Crawl Job Data API.
The Bulk API service shares much of its underlying architecture with Crawl API. In addition to this URL Report, other similar operational conventions will also be shared between both APIs.
Each row of the URL report corresponds to a single URL evaluated and provides the following information:
Column | Description |
---|---|
URL | Web page URL (normalized). Note that due to URL normalization, URL Report values may not match submitted URLs exactly. |
Doc ID | Document ID of the crawled page. This corresponds to the parentUrlDocId field returned in crawl or bulk job JSON data. |
URL Discovered Time | Time the URL was first seen/encountered. |
Crawled Time | Time the URL was crawled (downloaded and its source spidered for links). |
Content Length | Number of characters comprising the HTML source. |
Duplicate Of | If the page source is an exact duplicate of another page, the Doc ID of the duplicate page will be returned. |
Redirects | Number of redirects pursued before arriving at the final destination URL. |
Redirected To | Ultimate destination URL if redirected. |
Robots.txt Crawl Delay (ms) | If the page is subject to a robots.txt "crawl delay" the value in milliseconds will be returned. |
Crawl Round | If the bulk job is a repeating/recurring job, the crawl "round" in which this URL was evaluated. Note: URLs will be duplicated for each round in which they are processed. |
Crawl Try # | Crawl Only: If there is an error crawling the page (spidering for links), any retries will be enumerated. |
Hop Count | Crawl Only: This indicates the page's distance from seed(s): "1" indicates the URL was linked-to from a seed; "2" indicates the URL appeared on a page that itself was linked-to from a seed. Hops can be used to narrow crawling via Crawlbot's maxHops argument. |
Crawl Status | Returns "Success" if the page was successfully crawled (spidered for links). |
Diffbot URI | If the page was processed via a Diffbot API, and an object—product, article, image, discussion, etc.—found, the object's diffbotUri will be returned. |
Process Attempted | Indicates if the page was sent to a Diffbot API for processing. |
Process Response | Indicates whether or not the Diffbot processing was successful. |
Proxy Used | Indicates whether or not a proxy IP address was used for the URL. Read more on proxies. |
Updated about 1 month ago