Pass referrers, user-agents, and cookies to Extract APIs.
Diffbot supports setting/sending the following custom headers for Extract API, Bulk Extract API, and Crawl API. These headers will be used when requesting content from third-party sites:
- User-Agent
- Referer
- Cookie
- Accept-Language
- X-Evaluate
User-Agent, Referer, Accept-Language
Create a new RequestBin. We'll use this to test that our custom headers are coming through.
There are several ways to attach custom headers to API requests.
Direct
Use X-Forward
as a prefix with any header you want forwarded. For example, to send the User-Agent foobar
, we would use the header X-Forward-User-Agent: foobar
.
Here's an example as a cURL request:
curl --location --request GET 'api.diffbot.com/v3/article?token=MY_TOKEN&url=https%3A%2F%2Fen17uofqrlcgv.x.pipedream.net%2F' \
--header 'X-Forward-User-Agent: foobar' \
--header 'X-Forward-Referrer: Diffbot.com' \
--header 'X-Forward-Accept-Language: hr'
These headers are discarded after this call, meaning they need to be added again on a subsequent call.
Rule-based
To permanently attach headers to a rule for an API, you can download raw rule data, modify it, and upload it back to your token, replacing the old rule setup.
A rule with permanently attached headers might look like this:
{
xForwardHeaders: {
User-Agent: [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Safari/604.1.38"
],
X-Evaluate: "function() { start(); setTimeout(function() { setTimeout(function() { end(); }, 5000); }, 25000);}",
Cookie: "foo=bar"
},
rules: [ ],
api: "all",
urlPattern: "(http(s)?://)?(.*\.)?mysite.com.*",
testUrl: "",
},
Re-uploading this JSON will add these headers to all calls issued towards mysite.com
. Notice that User-Agent
is a JavaScript array. If you supply User-Agent
as an array (only possible through this method), Diffbot will randomly pick one from the list when accessing a site. This is great for throwing off bot-detection algorithms.
Note: X-Evaluate is explained below
Dashboard
The Dashboard also allows you to permanently add some headers to a rule. When creating a new custom rule, use the Custom Headers section to enter any headers you wish to add. This will save the headers in the same way as the JSON approach above.
Cookie
The Cookie header allows you to:
- simulate a login session
- remove annoying GDPR and newsletter popups
- ignore ads and content you don't want to extract
For a comprehensive guide on using the Cookie header to simulate a login session, please see How do I extract content behind logins?
X-Evaluate
See Custom Javascript.