2016-10-24

by Jerome Choo
  • The diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
  • Various improvements to caption detection and parsing in the Article API.
  • Crawlbot now adheres to the "Diffbot" user-agent in robots.txt directives, so that our crawling can be whitelisted when crawling partner or other sites.

2016-10-04

by Jerome Choo
  • Increased the size limits for content POSTed to Diffbot APIs.
  • Bulk Service jobs now require a minimum of 50 URLs for Startup plan customers.
  • Bulk Service and Crawlbot jobs now automatically retry failing URLs.

2016-09-12

by Jerome Choo
  • Numerous improvements to normalizedSpecs in the Product API.
  • Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling.
  • Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds.
  • Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a "success" message.
  • Improved handling of UTF-8 encoded characters within Crawlbot.

2016-06-24

by Jerome Choo
  • Fixed an issue where large Crawlbot and Bulk job downloads would prematurely terminate.
  • Added beta support for executing custom Javascript before processing a page via an extraction API. See Analyze API example (works with all Automatic and Custom APIs).

2016-06-17

by Jerome Choo
  • Added support for custom headers to the Crawlbot and Bulk Job interfaces.
  • Added beta field inferredCategory_beta to the Product API, which provides an automatically-determined category for any extracted product.
  • The Bulk Service will no longer normalize URLs before processing them—all pages will be sent to the specified Diffbot API as-is.
  • Various improvements to performance and quality of Diffbot's rendering engine.
  • Corrected an issue in Custom APIs where a replacement rule would errantly trim blank spaces.

2016-05-24

by Jerome Choo
  • Crawlbot now supports custom headers while crawling; Bulk Processing jobs now support custom headers for all URLs.
  • Fixed an issue in Crawlbot where internal JSON objects were sometimes being returned in JSON data downloads.
  • Various improvements to date-parsing and normalization in the Article API.
  • Improvements to "Replacement" and "Ignore" filters in Custom APIs and manual rules.

2016-05-12

by Jerome Choo
  • Added column to the Crawlbot and Bulk API URL Report indicating if a proxy IP was used.
  • Added the argument useProxies to Crawlbot, which allows for proxy IPs to be used for specific crawls

2016-05-05

by Jerome Choo

Added proxy usage tracking to the Account API.

2016-04-28

by Jerome Choo
  • Added available colors (if found on the page) to the normalizedSpecs object in the Product API. Updated format of normalizedSpecs to return multiple values, if available, for a single key.
  • Fixed an issue where image URLs with spaces in the filename would be incorrectly returned.
  • Improved proxy support in the extraction APIs to help diversify origin IPs. Read more.

2016-04-15

by Jerome Choo
  • Released the tagConfidence argument in the Article API, allowing for the return of tags with lower relevance scores if desired.
  • Improved Crawlbot handling of DNS and other connection issues; increased range of TLDs supported by Crawlbot.
  • Fixed an issue where duplicate tags were being returned when sentiment analysis was being performed.