October 24th, 2016

2016-10-24

by Jerome Choo

The diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
Various improvements to caption detection and parsing in the Article API.
Crawlbot now adheres to the "Diffbot" user-agent in robots.txt directives, so that our crawling can be whitelisted when crawling partner or other sites.

October 4th, 2016

2016-10-04

by Jerome Choo

September 1st, 2016

by Jerome Choo

Numerous improvements to normalizedSpecs in the Product API.
Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling.
Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds.
Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a "success" message.
Improved handling of UTF-8 encoded characters within Crawlbot.

June 24th, 2016

by Jerome Choo

Fixed an issue where large Crawlbot and Bulk job downloads would prematurely terminate.
Added beta support for executing custom Javascript before processing a page via an extraction API. See Analyze API example (works with all Automatic and Custom APIs).

June 17th, 2016

by Jerome Choo

Added support for custom headers to the Crawlbot and Bulk Job interfaces.
Added beta field inferredCategory_beta to the Product API, which provides an automatically-determined category for any extracted product.
The Bulk Service will no longer normalize URLs before processing them—all pages will be sent to the specified Diffbot API as-is.
Various improvements to performance and quality of Diffbot's rendering engine.
Corrected an issue in Custom APIs where a replacement rule would errantly trim blank spaces.

May 24th, 2016

by Jerome Choo

Crawlbot now supports custom headers while crawling; Bulk Processing jobs now support custom headers for all URLs.
Fixed an issue in Crawlbot where internal JSON objects were sometimes being returned in JSON data downloads.
Various improvements to date-parsing and normalization in the Article API.
Improvements to "Replacement" and "Ignore" filters in Custom APIs and manual rules.

May 12th, 2016

by Jerome Choo

Added column to the Crawlbot and Bulk API URL Report indicating if a proxy IP was used.
Added the argument useProxies to Crawlbot, which allows for proxy IPs to be used for specific crawls

May 5th, 2016

by Jerome Choo

Added proxy usage tracking to the Account API.

April 28th, 2016

by Jerome Choo

Added available colors (if found on the page) to the normalizedSpecs object in the Product API. Updated format of normalizedSpecs to return multiple values, if available, for a single key.
Fixed an issue where image URLs with spaces in the filename would be incorrectly returned.
Improved proxy support in the extraction APIs to help diversify origin IPs. Read more.

April 15th, 2016

by Jerome Choo

Released the tagConfidence argument in the Article API, allowing for the return of tags with lower relevance scores if desired.
Improved Crawlbot handling of DNS and other connection issues; increased range of TLDs supported by Crawlbot.
Fixed an issue where duplicate tags were being returned when sentiment analysis was being performed.