Pause, delete, restart, or view the status of a crawl job.
A single endpoint allows both control and status requests for one or more active crawl jobs with any given token.
View the Status of Crawl Jobs
Your token's active crawl jobs (along with any active bulk jobs) will be returned in a jobs object when a GET request supplying just a token
parameter is made to this endpoint.
Note that this endpoint without any query parameters returns exactly the same output as its Bulk Job equivalent.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN \
--header 'Accept: application/json'
To retrieve a single crawl job's details, provide the job's name
in addition to your token in your request.
Pause a Crawl Job
To pause a crawl job, send a GET request to this endpoint supplying your token
, the name
of the crawl job to pause, and the pause
parameter set to 1
.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&pause=1 \
--header 'Accept: application/json'
To resume a paused crawl job, pass pause=0
in the same GET request.
Delete a Crawl Job
To delete a crawl job, send a GET request to this endpoint supplying your token
, the name
of the crawl job to delete, and the delete
parameter set to 1
. Job deletions are irreversible.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&delete=1 \
--header 'Accept: application/json'
Restart a Crawl Job
To restart a crawl job, send a GET request to this endpoint supplying your token
, the name
of the crawl job to restart, and the restart
parameter set to 1
. This will erase all previously processed data and re-process all of the submitted URLs.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&restart=1 \
--header 'Accept: application/json'
Response
All requests will return a JSON response. The following is a sample response.
{
"jobs": [
{
"name": "crawlJob",
"type": "crawl",
"jobCreationTimeUTC": 1427410692,
"jobCompletionTimeUTC": 1427410798,
"jobStatus": {
"status": 9,
"message": "Job has completed and no repeat is scheduled."
},
"sentJobDoneNotification": 1,
"objectsFound": 177,
"urlsHarvested": 2152,
"pageCrawlAttempts": 367,
"pageCrawlSuccesses": 365,
"pageCrawlSuccessesThisRound": 365,
"pageProcessAttempts": 210,
"pageProcessSuccesses": 210,
"pageProcessSuccessesThisRound": 210,
"maxRounds": 0,
"repeat": 0.0,
"crawlDelay": 0.25,
"obeyRobots": 1,
"maxToCrawl": 100000,
"maxToProcess": 100000,
"onlyProcessIfNew": 1,
"seeds": "http://docs.diffbot.com",
"roundsCompleted": 0,
"roundStartTime": 0,
"currentTime": 1443822683,
"currentTimeUTC": 1443822683,
"apiUrl": "https://api.diffbot.com/v3/analyze",
"urlCrawlPattern": "",
"urlProcessPattern": "",
"pageProcessPattern": "",
"urlCrawlRegEx": "",
"urlProcessRegEx": "",
"maxHops": -1,
"downloadJson": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_data.json",
"downloadUrls": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_urls.csv",
"notifyEmail": "[email protected]",
"notifyWebhook": "http://www.diffbot.com"
}
]
}
Status Codes
The jobStatus
object will return the following status codes and associated messages:
Status | Message |
---|---|
0 | Job is initializing |
1 | Job has reached maxRounds limit |
2 | Job has reached maxToCrawl limit |
3 | Job has reached maxToProcess limit |
4 | Next round to start in _ seconds |
5 | No URLs were added to the crawl |
6 | Job paused |
7 | Job in progress |
8 | All crawling temporarily paused by root administrator for maintenance |
9 | Job has completed and no repeat is scheduled |
10 | Failed to crawl any seed Indicates a problem retrieving links from the seed URL(s) |
11 | Job automatically paused because crawl is inefficient. Successfully downloaded 10000+ consecutive pages without a single successfully processed page |