A set of rules and parameters defining what a Custom API actually extracts.
Every instance of a Custom API is defined by a JSON ruleset object, which will include a rules
objects array, the name of the custom api
, and a urlPattern
matching URLs to be extracted with this API.
A simple ruleset object looks like this.
{
"rules": [
{
"name": "Description",
"selector": ".entry-content p"
}
],
"api": "/api/list",
"urlPattern": "(http(s)?://)?(.*\\.)?blog.diffbot.com.*",
"testUrl": "https://blog.diffbot.com/knowledge-graph-glossary/"
}
In this ruleset, the List API is extended to also extract a Description
field for URLs matching the urlPattern
.
A complete Custom API ruleset contains (at minimum) all of the following fields.
Field | Description |
---|---|
urlPattern | Regular expression used to match URLs to the appropriate rule. |
api | Diffbot API against which the ruleset should be applied. The api value should include the /api/ string, e.g. /api/article . |
rules | An array of rules applying to individual fields of the Diffbot API. The rules array can be empty (rules=[] ). More on rules. |
↳name | Field to correct (e.g., title ) or add (e.g., customField ). |
↳selector | CSS selector to find the appropriate content on the page. |
↳value | Optional: a specific value to hard-code, in lieu of a selector. |
↳filters | Optional: additional options to replace content, ignore selectors, or extract HTML attribute values. See below. |
In addition, Custom API rulesets may also include these optional parameters.
Field | Description |
---|---|
testUrl | Optional: A sample URL used to preview your rule within the Custom API Toolkit in the Dashboard. |
prefilters | Optional: An array of selectors that should be completely dropped from the DOM. These selectors will be fully ignored by all Diffbot processing. |
renderOptions | Optional: Querystring arguments to be passed to the Diffbot rendering engine, e.g. wait=5000 . More on renderOptions . |
xForwardHeaders | Optional: An object containing any custom headers to be passed along in all requests to URLs matching the urlPattern . Header values can either be a single string, or an array of strings (from which one will be selected at request-time). Custom headers can include: |
↳User-Agent | Optional: User agent to use in place of Diffbot default. |
↳Referrer | Optional: Custom referrer to use in place of Diffbot default. |
↳Cookie | Optional: Custom cookie content to be sent with all requests. |
↳Accept-Language | Optional: Custom accept-language header to be sent. |
↳X-Evaluate | Optional: Custom Javascript to be executed at render-time. |
Defining a Rule
To recap — a single Custom API instance is defined by a JSON ruleset object. This ruleset object contains an array of rule objects as well as the parameters listed above.
In this section, we look at what defines a single rules
object that lives within the rules
field of a complete Custom API ruleset.
Here's an example of a simple rule object
{
"selector": ".entry-content p",
"name": "text"
}
A Custom API with this rule will
- Look for a DOM element corresponding to the CSS selector
.entry-content p
- Extract the text content of that element
- Return it in the response of the Custom API under the field named
text
Custom API rules can be used to "correct" individual fields of an Extract API
To correct a field that isn't extracting automatically, define a custom rule using the same
name
as the incorrectly extracted field.
Experience with CSS selectors will be very helpful in defining Custom API rules. A reference of all supported selectors and operators are available here.
Should multiple elements match a selector, the text contents of all the elements will be returned string concatenated in the output value.
A rule may also extract the value of an attribute on the selected element. To do this, we can use a rule filter.
Using Rule Filters
filters
may be used in a Custom API rule to get an attribute
value of an element, replace
content extracted, or exclude
certain sections of content.
Here's an example of a rule filter that extracts the src
value of all img
elements.
{
"selector": "img",
"name": "url",
"filters": [
{
"args": [
"src"
],
"type": "attribute"
}
]
}
A filter object is constructed with an args
and a type
field.
type
specifies a filter type to be used (attribute
,exclude
, orreplace
)args
is an array of arguments to be provided to the filter
A rule may contain multiple filters
, hence its representation in a rule as a JSON array.
More details on the use of each available filter is shared below.
Filter Type: attribute
attribute
Retrieves the attribute value of an element specified in args
.
For example, to extract the link http://blog.diffbot.com
from the anchor tag <a href="http://www.blog.diffbot.com" class="outbound">
, we may use the following rule:
{
"selector": "a.outbound",
"name": "link",
"filters": [
{
"args": [
"href"
],
"type": "attribute"
}
]
}
Filter Type: exclude
exclude
Ignores selectors (and all descendants) supplied in args
if they are found within the CSS selector of the parent rule.
Filter Type: replace
replace
Use regular expression syntax to extract only specific sections of text from the original extraction output. Supply your regular expression in the 1st index of your array and the regex group to extract in the 2nd.
For example, this is how you would extract just the numerical price (12.99) off a pricing element (.offerPrice
) that extracts as "$12.99" by default.
{
"selector": ".offerPrice",
"name": "price",
"filters": [
{
"args": [
"^\$(.*)$",
"$1"
],
"type": "replace"
}
]
}
Back references are also supported. For example, you can prepend text with the replace selector (^.*$)
and replacement prefix: $1
Diffbot uses a Java implementation for its regular expression parsing. Regular-Expressions.info offers an excellent overview of language-specific distinctions.
Extracting Multiple Elements into a List
If a CSS selector matches multiple elements on a page, the text values of all the matched elements will be concatenated into a single output value for the field.
To structure the output into an array instead, we can nest rules within rules, we call this a collection.
This is an example of a collection and the HTML structure it will extract.
<div class="img-thumbnail">
<img src="img-1.png" />
<span class="img-caption">Image #1's caption.</span>
</div>
<div class="img-thumbnail">
<img src="img-2.png" />
<span class="img-caption">Image #1's caption.</span>
</div>
{
"selector": "img-thumbnail",
"name": "images",
"rules": [
{
"selector": "img",
"name": "url",
"filters": [
{
"args": [
"src"
],
"type": "attribute"
}
]
}
]
}
We start by defining the largest parent element enclosing the repeating elements (.img-thumbnail
). We then define a nested rules
object that extracts the src
attribute of every img
element inside the repeating parent element.
Notice that each img-thumbnail
element also encloses a caption. We can extract that caption alongside the src
of each image by adding an additional rule in the same nested level as the src
extraction rule.
{
"selector": "img-thumbnail",
"name": "images",
"rules": [
{
"selector": "img",
"name": "url",
"filters": [
{
"args": [
"src"
],
"type": "attribute"
}
]
},
{
"selector": "span.img-caption",
"name": "caption"
}
]
}
Deleting fields
If you do not want a particular field to appear in the output JSON, you can accomplish this via a rule. The rule below will ensure images
will not appear in the output. Notice that the field delete
is set to true
i.e. without quotes.
{
"rules": [
{
"name": "images",
"delete": true
}
]
}
Forcing extraction from a particular section of the page for ListAPI
You can force list extraction from specific node(s). More precisely, you can specify multiple containers from which to force List extraction by separating the XPaths with a pipe |
. ListAPI will treat each container specified for extraction separately. In order to be able to distinguish which list item corresponds to which user defined container, the resulting listings will contain an extra key containerXpath
. To do this, simply specify the XPaths of the container nodes in your rules for the field items
like this:
{
"rules": [
{
"name": "items",
"selector": "/html/body/div[1]/div/div[1]/div/div/section[4]/div/div | /html/body/div[1]/div/div[1]/div/div/section[5]/div/div/div/div/div/div/div/div/article/div/div[2]/div[2]/table"
}
],
"api": "/api/list"
}
The XPaths of the two container nodes would be:
/html/body/div[1]/div/div[1]/div/div/section[4]/div/div
/html/body/div[1]/div/div[1]/div/div/section[5]/div/div/div/div/div/div/div/div/article/div/div[2]/div[2]/table