Manage a Crawl Job

A single endpoint allows both control and status requests for one or more active crawl jobs with any given token.

View the Status of Crawl Jobs

Get a list of active crawl jobs (including any active bulk jobs) returned in a jobs object when a GET request supplying just a token parameter is made to the base crawl endpoint — https://api.diffbot.com/v3/crawl.

Note that this endpoint without any query parameters returns exactly the same output as its Bulk Job equivalent.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN \
     --header 'Accept: application/json'

To retrieve a single crawl job's details, provide the job's name in addition to your token in your request.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlText \
     --header 'Accept: application/json'

Pause a Crawl Job

To pause a crawl job, send a GET request to this endpoint supplying your token, the name of the crawl job to pause, and the pause parameter set to 1.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&pause=1 \
     --header 'Accept: application/json'

To resume a paused crawl job, pass pause=0 in the same GET request.

Delete a Crawl Job

To delete a crawl job, send a GET request to this endpoint supplying your token, the name of the crawl job to delete, and the delete parameter set to 1. Job deletions are irreversible.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&delete=1 \
     --header 'Accept: application/json'

Restart a Crawl Job

To restart a crawl job, send a GET request to this endpoint supplying your token, the name of the crawl job to restart, and the restart parameter set to 1. This will erase all previously processed data and re-process all of the submitted URLs.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&restart=1 \
     --header 'Accept: application/json'

Response

All requests will return a JSON response. The following is a sample response.

{
  "jobs": [
    {
      "name": "crawlJob",
      "type": "crawl",
      "jobCreationTimeUTC": 1427410692,
      "jobCompletionTimeUTC": 1427410798,
      "jobStatus": {
        "status": 9,
        "message": "Job has completed and no repeat is scheduled."
      },
      "sentJobDoneNotification": 1,
      "objectsFound": 177,
      "urlsHarvested": 2152,
      "pageCrawlAttempts": 367,
      "pageCrawlSuccesses": 365,
      "pageCrawlSuccessesThisRound": 365,
      "pageProcessAttempts": 210,
      "pageProcessSuccesses": 210,
      "pageProcessSuccessesThisRound": 210,
      "maxRounds": 0,
      "repeat": 0.0,
      "crawlDelay": 0.25,
      "obeyRobots": 1,
      "maxToCrawl": 100000,
      "maxToProcess": 100000,
      "onlyProcessIfNew": 1,
      "seeds": "http://docs.diffbot.com",
      "roundsCompleted": 0,
      "roundStartTime": 0,
      "currentTime": 1443822683,
      "currentTimeUTC": 1443822683,
      "apiUrl": "https://api.diffbot.com/v3/analyze",
      "urlCrawlPattern": "",
      "urlProcessPattern": "",
      "pageProcessPattern": "",
      "urlCrawlRegEx": "",
      "urlProcessRegEx": "",
      "maxHops": -1,
      "downloadJson": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_data.json",
      "downloadUrls": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_urls.csv",
      "notifyEmail": "[email protected]",
      "notifyWebhook": "http://www.diffbot.com"
    }
  ]
}

Status Codes

The jobStatus object will return the following status codes and associated messages:

Status	Message
0	Job is initializing
1	Job has reached maxRounds limit
2	Job has reached maxToCrawl limit
3	Job has reached maxToProcess limit
4	Next round to start in _ seconds
5	No URLs were added to the crawl
6	Job paused
7	Job in progress
8	All crawling temporarily paused by root administrator for maintenance
9	Job has completed and no repeat is scheduled
10	Failed to crawl any seed Indicates a problem retrieving links from the seed URL(s)
11	Job automatically paused because crawl is inefficient. Successfully downloaded 10000+ consecutive pages without a single successfully processed page