HTML
HTML parser
Below is an example of the HTML parser source file:
{
"type": "html",
// ...
"scrape": {
"container": ".classNameExample",
"endpoint": "example.org",
"article": {
// information to pull from article.
...
}
}
}
container: Refers to the HTML class in which all the articles are located. Usually we search with the inspect element tool of the browser for the class that covers all the content we want.endpoint: The domain of the website. It is used complete the article links.article: This field contains all the information about what the html parser should scrape from the desired website. In order to receive this information, each piece of data must be identified separately.
For each value we want to take from an article we write it as follows
"value": { }
For resolving the title, contents, publish date or attachments of the article, replace <value> with:
titlefor the title of the article.contentfor the contents of the article.pubDatefor the publish date of the article.attachmentsfor the attachments of the article.linkfor the link of the article.
Any other information will be saved at the extras field of the article.
Each value is placed somewhere in the website, so we will need some information to find them:
"value": {
"class": "my_container" // string
"attributes": [ "href", "src", "value", ... ], // array
"find": [ ".class1", "li", "#id" ], // array
"multiple": true | false // boolean
}
class: this value represents the html class where the title is displayed.
attributes: a string array that contains html attributes that are going to used to scrape some specific information like links or images. Some popular attributes are:
hrefto get the link from a view.srcto get the link from an image.valueto get the actual value that is contained in a view<a>my_value</a>.
Notice, if the attributes are not found, then they will be empty.
find: in case the actual element cannot be found by class alone, we use the find option. A string array that contains multiple classes, ids or tags in the right order. It will describe the path to reach the desired value. It will search for the value starting from the inside of the class option above. Notice, classes must start with . and ids with #.
mutliple: in case there are multiple links, images or other attributes in the desired article, enable this option so it can scrape them all.
An example using one greek university UniPi:
{
"url": "https://www.unipi.gr/unipi/el/%CE%B1%CE%BD%CE%B1%CE%BA%CE%BF%CE%B9%CE%BD%CF%8E%CF%83%CE%B5%CE%B9%CF%82.html",
"name": "Univeristy of Piraeus",
"type": "html",
"retryInterval": 1800 * 1000, // 0.5 hours
"scrapeInterval": 3600 * 1000, // 1 hour
// Scraping options
"scrape": {
"container": ".catItemView",
"endPoint": "unipi.gr",
"article": {
// Get the title of the article
"title": {
"class": ".catItemTitle",
"find": [ "a" ],
"multiple": false
},
// Get the content of the article
"content": {
"class": ".catItemBody",
"find": null,
"multiple": false
},
// Get the publish date of the article
"pubDate": {
"class": ".catItemDateCreated",
"attributes": [ "value", "href" ],
"find": null,
"multiple": false
},
// Get the embedded links of the article
"attachments": {
"class": ".catItemLinks",
"attributes": [ "value", "href" ],
"find": [ ".catItemAttachmentsBlock", "li", "a" ], // carefull on the order
"multiple": true
}
}
}
}