2016-03-25

by Jerome Choo
  • Updated Crawlbot seeds behavior, so that if a non-www subdomain is specified as the only seed URL, crawling will restrict itself to that subdomain.
  • Significant updates to the beta normalizedSpecs field in the Product API. See more details.
  • Added the field parentUrlDocId to Crawlbot and Bulk Processing JSON objects. This field can be used to match objects to URLs in the Crawlbot or Bulk Processing URL Report.

2016-03-10

by Jerome Choo
  • Added the originalType field to extracted objects when utilizing the Analyze API's' fallback argument.
  • Fixed an issue with our Semantria integration that could lead to errant timeout responses.

2016-02-25

by Jerome Choo
  • Fixed an issue in the Article API to prevent in-line Javascript and CSS from being returned in the html field from unsupported video players.
  • Discussion API: Improved extraction from single-post (no reply) conversations.
  • Improvements to video extraction within the Video API.

2016-01-29

by Jerome Choo
  • Added beta fields quantityPrices, priceRange and multiplePrices to the Product API.
  • Improved availability detection and extraction in the Product API.
  • Improved offerPrice detection in the Product API to reduce the chance of returning an incorrect value from unavailable products or items without a visible price.

2016-01-26

by Jerome Choo

Significant speed improvements to the Global Index.

2016-01-21

by Jerome Choo
  • Released an official endpoint for Custom API management. Please see the documentation for information on programmatic management of custom rules and APIs.
  • Improved video extraction in the Article API to include new providers and HTML5 <video> elements.
  • max:date queries in the Search and Global Index APIs are now inclusive of the date specified.

2016-01-14

by Jerome Choo
  • Improved specification extraction in the Product API.
  • Fixed an issue where the estimatedDate field (Article API) would sometimes not be correctly computed.

2016-01-07

by Jerome Choo
  • Fixed an issue where the <base> element could be incorrectly use to calculate relative paths.
  • Added initial functionality to categorize articles in the Article API based on article text content. If you would like to test this beta feature, contact us.
  • Improved handling of media sources without a specified protocol (e.g. src="//www.youtube.com...). Media element URLs will now match the protocol of the analyzed page.

2015-12-21

by Jerome Choo
  • Crawlbot and Bulk jobs pending delete (per your Diffbot plan) are now identified in the Crawlbot and Bulk interfaces.
  • The API Toolkit now uses Diffbot's custom rendering engine for live web page previews. This should reduce inaccuracies when creating custom rules.

2015-12-18

by Jerome Choo
  • Fixed an issue where plain-text POSTed to the Article API would not perform text analysis (tags, sentiment, language-detection).
  • Improved Crawlbot behavior on Ajax-heavy sites so that pages with the exact same HTML source are no longer deduplicated.
  • Fixed an issue within the Crawlbot and Bulk interfaces where the "Last 500" URL Report was incorrectly returning the first 500.
  • Improved author detection within the Article API.