Getting Started with Extract
Extract uses computer vision and natural language processing to automatically categorize and extract a website into clean, structured JSON.
Diffbot Extract is a popular solution for replacing high volume web scraping pipelines, as rule-based web scraping tend to become costly and frustrating to maintain at scale.
Instead of a set of rules, Diffbot Extract uses computer vision to "read" a web page, categorize it into a standard page type, and extract its contents based on a standard schema.
If your use case involves scraping potentially thousands of pages across several different sites, you could define rules for each individual page, or you just use Diffbot Extract. You can test drive Diffbot Extract for your use case (no sign up required) on diffbot.com/testdrive.
While Diffbot Extract is most productive as a developer API, a UI is available on the Dashboard and diffbot.com for quick plug-and-play use cases.
No Rules? How Does That Work?
Instead of site-specific rules, Diffbot Extract relies on a standard ontology that describes most page types on the web. It can classify any page on the web into one of these standard page types, and then "read" the page using pre-trained ML models to look for standard fields like offerPrice
for product pages and author
for article pages.
Some Extract APIs, like List API, may have a few standard fields, but is designed to be as adaptable as possible to any kind of list on any website.
Others, like Product API, feature more opinionated ontologies that make it easy to integrate with an existing product database.
A full list of Extract APIs is available here.
Next Steps
While a Dashboard interface exists for Extract, it is still primarily a technical product. If you're familiar with APIs, head on over to Introduction to Extract API to start using the API.
For the less technical, you might find already pre-crawled and extracted data in the Diffbot Knowledge Graph more accessible.
If none of the above methods apply to you, consider rule-based web scraping solutions. These are often a bit simpler to understand and implement. Here're a few options (no affiliation):
- Scrapy — popular open-source web scraping library in Python
- BeautifulSoup — another open-source web scraping library in Python
- Octoparse — a UI-based web scraping tool that's easy to use for non-technical users
Updated about 2 years ago