A simple query looks like this:
{
"url": "http://chrisnewtn.com",
"type": "html",
"selector": "ul.social li a",
"extract": "href",
}
It says to go to a friend’s website and for noodle to expect a html document.
Then to select anchor elements in a list and for each one extract the href
attribute’s value.
The type
property is used to tell noodle if you are wanting to scrape a html
page, json document etc. If no type is specified then a html page will be
assumed by default.
A similar query can be constructed to extract information from a JSON document.
JSONSelect is used as the underlying library to do this. It supports common CSS3
selector functionality. You can familiarize yourself with it here.
{
"url": "https://search.twitter.com/search.json?q=friendship",
"selector": ".results .from_user",
"type": "json"
}
An extract
property is not needed for a query on JSON documents as json
properties have no metadata and just a single value were as a html element
can have text, the inner html or an attribute like href
.
Different types (html, json, feed & xml)
html
Note: Some xml documents can be parsed by noodle under the html type!
The html type is the only type to have the extract
property. This is because
the other types are converted to JSON.
The extract
property (optional) could be the HTML element’s attribute
but it is not required.
Having "html"
or "innerHTML"
as the extract
value will return the
containing HTML within that element.
Having "text"
as the extract
value will return only the text. noodle will
strip out any new line characters found in the text.
Return data looks like this:
[
{
"results": [
"http://twitter.com/chrisnewtn",
"http://plus.google.com/u/0/111845796843095584341"
],
"created": "2012-08-01T16:22:14.705Z"
}
]
Having no specific extract rule will assume a default of extracting "text"
from the selector
.
It is also possible to request multiple properties to extract in one query if
one uses an array.
Query:
{
"url": "http://chrisnewtn.com",
"selector": "ul.social li a",
"extract": ["href", "text"]
}
Response:
[
{
"results": [
{
"href": "http://twitter.com/chrisnewtn",
"text": "Twitter"
},
{
"href": "http://plus.google.com/u/0/111845796843095584341",
"text": "Google+"
}
],
"created": "2012-08-01T16:23:41.913Z"
}
]
In the query’s selector
property use the standard CSS DOM selectors.
json and xml
The same rules apply from html to the json and xml types. Only that the
extract
property should be ommitted from queries as the JSON node value(s)
targetted by the selector
is always assumed.
In the query’s selector
property use
JSONSelect style selectors.
feeds
The same rules apply to the json and xml types. Only that the extract
property
should be ommitted from queries as the JSON node value(s) targetted by the
selector
is always assumed.
In the query’s selector
property use
JSONSelect style selectors.
The feed type is based upon
node-feedparser so it
supports Robust RSS, Atom, and RDF standards.
Familiarize yourself with its normalisation format before you use JSONSelect style
selector.
Getting the entire web document
If no selector
is specified than the entire document is returned. This is a
rule applied to all types of docments. The extract
rule will be ignored if
included.
Query:
{
"url": "https://search.twitter.com/search.json?q=friendship"
}
Response:
[
{
"results": ["<full document contents>"],
"created": "2012-10-24T15:37:29.796Z"
}
]
Mapping a query to familiar properties
Queries can also be written in noodle’s map notation. The map notation allows
for the results to be accessible by your own more helpful property names.
In the example below map is used to create a result object of a person and
their repos.
{
"url": "https://github.com/chrisnewtn",
"type": "html",
"map": {
"person": {
"selector": "span[itemprop=name]",
"extract": "text"
},
"repos": {
"selector": "li span.repo",
"extract": "text"
}
}
}
With results looking like this:
[
{
"results": {
"person": [
"Chris Newton"
],
"repos": [
"cmd.js",
"simplechat",
"sitestatus",
"jquery-async-uploader",
"cmd-async-slides",
"elsewhere",
"pablo",
"jsonpatch.js",
"jquery.promises",
"llamarama"
]
},
"created": "2013-03-25T15:38:01.918Z"
}
]
Within a query include the headers
property with an array value listing the
headers you wish to recieve back as an object structure. 'all'
may also be
used as a value to return all of the server headers.
Headers are treated case-insensitive and the returned property names will
match exactly to the string you requested with.
Query:
{
"url": "http://github.com",
"headers": ["connection", "content-TYPE"]
}
Result:
[
{
"results": [...],
"headers": {
"connection": "keep-alive",
"content-TYPE": "text/html"
}
"created":"2012-11-14T13:06:02.521Z"
}
]
noodle provides a shortcut to the server Link header with the query
linkHeader
property set to true
. Link headers are useful as some web APIs
use them to expose their pagination.
The Link header will be parsed to an object structure. If you wish to have the Link header in its usual formatting then include it in the headers
array instead.
Query:
{
"url": "https://api.github.com/users/premasagar/starred",
"type": "json",
"selector": ".language",
"headers": ["connection"],
"linkHeader": true
}
Result:
[
{
"results": [
"JavaScript",
"Ruby",
"JavaScript",
],
"headers": {
"connection": "keep-alive",
"link": {
"next": "https://api.github.com/users/premasagar/starred?page=2",
"last": "https://api.github.com/users/premasagar/starred?page=21"
}
},
"created": "2012-11-16T15:48:33.866Z"
}
]
Querying to a POST url
noodle allows for post data to be passed along to the target web server
specified in the url. This can be optionally done with the post
property
which takes an object map of the post data key/values.
{
"url": "http://example.com/login.php",
"post": {
"username": "john",
"password": "123"
},
"select": "h1.username",
"type": "html"
}
Take not however that queries with the post
property will not be cached.
Querying without caching
If cache
is set to false
in your query then noodle will not cache the
results or associated page and it will get the data fresh. This is useful for
debugging.
{
"url": "http://example.com",
"selector": "h1",
"cache": "false"
}
Query errors
noodle aims to give errors for the possible use cases were a query does
not yield any results.
Each error is specific to one result object and are contained in the error
property as a string message.
Response:
[
{
"results": [],
"error": "Document not found"
}
]
noodle also falls silently with the 'extract'
property by ommitting any
extract results from the results object.
Consider the following JSON response to a partially incorrect query.
Query:
{
"url": "http://chrisnewtn.com",
"selector": "ul.social li a",
"extract": ["href", "nonexistent"]
}
Response:
The extract “nonexistent” property is left out because it was not found
on the element.
[
{
"results": [
{
"href": "http://twitter.com/chrisnewtn"
},
{
"href": "http://plus.google.com/u/0/111845796843095584341"
}
],
"created": "2012-08-01T16:28:19.167Z"
}
]
Multiple queries
Multiple queries can be made per request to the server. You can mix between
different types of queries in the same request as well as queries in the map
notation.
Query:
[
{
"url": "http://chrisnewtn.com",
"selector": "ul.social li a",
"extract": ["text", "href"]
},
{
"url": "http://premasagar.com",
"selector": "#social_networks li a.url",
"extract": "href"
}
]
Response:
[
{
"results": [
{
"href": "http://twitter.com/chrisnewtn",
"text": "Twitter"
},
{
"href": "http://plus.google.com/u/0/111845796843095584341",
"text": "Google+"
}
],
"created": "2012-08-01T16:23:41.913Z"
},
{
"results": [
"http://dharmafly.com/blog",
"http://twitter.com/premasagar",
"https://github.com/premasagar",
],
"created": "2012-08-01T16:22:13.339Z"
}
]
Proxy Support
When calling a page multiple times some sites can and will ban your IP address, Adding support for proxy IP addresses allows the rotation of IP addresses.
Query:
{
"url": "http://chrisnewtn.com",
"selector": "ul.social li a",
"extract": ["text", "href"],
"proxy": "XXX.XXX.XXX.XXX"
}