Pause, delete, restart, or view the status of a crawl job.
A single endpoint allows both control and status requests for one or more active crawl jobs with any given token.
View the Status of Crawl Jobs
Get a list of active crawl jobs (including any active bulk jobs) returned in a jobs object when a GET request supplying just a token
parameter is made to the base crawl endpoint — https://api.diffbot.com/v3/crawl
.
Note that this endpoint without any query parameters returns exactly the same output as its Bulk Job equivalent.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN \
--header 'Accept: application/json'
To retrieve a single crawl job's details, provide the job's name
in addition to your token in your request.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlText \
--header 'Accept: application/json'
Pause a Crawl Job
To pause a crawl job, send a GET request to this endpoint supplying your token
, the name
of the crawl job to pause, and the pause
parameter set to 1
.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&pause=1 \
--header 'Accept: application/json'
To resume a paused crawl job, pass pause=0
in the same GET request.
Delete a Crawl Job
To delete a crawl job, send a GET request to this endpoint supplying your token
, the name
of the crawl job to delete, and the delete
parameter set to 1
. Job deletions are irreversible.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&delete=1 \
--header 'Accept: application/json'
Restart a Crawl Job
To restart a crawl job, send a GET request to this endpoint supplying your token
, the name
of the crawl job to restart, and the restart
parameter set to 1
. This will erase all previously processed data and re-process all of the submitted URLs.
curl --request GET \
--url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&restart=1 \
--header 'Accept: application/json'
Response
All requests will return a JSON response. The following is a sample response.
{
"jobs": [
{
"name": "crawlJob",
"type": "crawl",
"jobCreationTimeUTC": 1427410692,
"jobCompletionTimeUTC": 1427410798,
"jobStatus": {
"status": 9,
"message": "Job has completed and no repeat is scheduled."
},
"sentJobDoneNotification": 1,
"objectsFound": 177,
"urlsHarvested": 2152,
"pageCrawlAttempts": 367,
"pageCrawlSuccesses": 365,
"pageCrawlSuccessesThisRound": 365,
"pageProcessAttempts": 210,
"pageProcessSuccesses": 210,
"pageProcessSuccessesThisRound": 210,
"maxRounds": 0,
"repeat": 0.0,
"crawlDelay": 0.25,
"obeyRobots": 1,
"maxToCrawl": 100000,
"maxToProcess": 100000,
"onlyProcessIfNew": 1,
"seeds": "http://docs.diffbot.com",
"roundsCompleted": 0,
"roundStartTime": 0,
"currentTime": 1443822683,
"currentTimeUTC": 1443822683,
"apiUrl": "https://api.diffbot.com/v3/analyze",
"urlCrawlPattern": "",
"urlProcessPattern": "",
"pageProcessPattern": "",
"urlCrawlRegEx": "",
"urlProcessRegEx": "",
"maxHops": -1,
"downloadJson": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_data.json",
"downloadUrls": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_urls.csv",
"notifyEmail": "[email protected]",
"notifyWebhook": "http://www.diffbot.com"
}
]
}
Status Codes
The jobStatus
object will return the following status codes and associated messages:
Status | Message |
---|---|
0 | Job is initializing |
1 | Job has reached maxRounds limit |
2 | Job has reached maxToCrawl limit |
3 | Job has reached maxToProcess limit |
4 | Next round to start in _ seconds |
5 | No URLs were added to the crawl |
6 | Job paused |
7 | Job in progress |
8 | All crawling temporarily paused by root administrator for maintenance |
9 | Job has completed and no repeat is scheduled |
10 | Failed to crawl any seed Indicates a problem retrieving links from the seed URL(s) |
11 | Job automatically paused because crawl is inefficient. Successfully downloaded 10000+ consecutive pages without a single successfully processed page |