Configuration
Each saffron instance need a configuration file. Below is the structure of the configuration file.
Mode
Default value: main
Can receive the values main
and worker
. It specifies if the saffron instance will be the main instance
or a worker only instance.
Database
The database functions that will be called from Saffron. They need to be initialized from the developer. Any kind of error thrown inside the database functions will be called by the Saffron and abort the uploading.
pushArticles
Takes as argument an array of new articles that needs to be pushed in the database.
getArticles
Takes as argument an object containing the fields: tableName
, count
.
tableName
: The table (or collection) where the articles will be retrievedcount
: How many articles will be retrieved from the database.
Sources
path
Default value: ./sources
The folder where the source files are located (will also search for sub folders).
includeOnly
This field was created mainly for test purposes when we want to include only specific sources. The source identification is made using the source name (not the source file name).
exclude
This field was created mainly for test purposes when we want to exclude specific sources. The source identification is made using the source name (not the source file name).
info
This option will exclude sources from includeOnly
if it is set.
Scheduler
jobsInterval
Default value: 3600000
(60 minutes)
The default interval between the source scrapping jobs (it can be overwritten for each source). This option is also used by the scheduler to spread the source files evenly inside a specific timeframe.
caution
The value should not be lower than checksInterval
.
heavyJobFailureInterval
Default value: 86400000
(24 hours)
There are cases where a source scrapping job will continuously fail (website is down).
After 10
continuous fail attempts it will freeze the job and schedule it after heavyJobFailureInterval
ms.
checksInterval
Default value: 120000
(2 minutes)
The scheduler checks periodically to find if a source scrapping job has finished or failed. This option is the interval between these checks.
randomizeInterval
Default value: -500 to 500 seconds
The scheduler adds/subtracts a random time from the interval of each job to avoid making requests at fixed intervals. The result of the function must be in milliseconds.
randomizeInterval: () => {
const random = Math.floor(Math.random() * (high - low) + low) * 1000;
return Math.random() >= 0.5 ? random : -random;
}
Workers
nodes
Default value: 1
The worker nodes that will be started with the saffron instance. In case of multiple workers, each worker will run on different thread. This is useful in case of a large file of source files.
userAgent
The User-Agent that will accompany the request.
jobs.timeout
Default value: 10000
The source scrapping job request timeout (it can be overwritten for each source).
articles.amount
Default value: 10
The amount of articles that will be parsed for each source scrapping job.
Grid
distributed
Default value: false
If the grid network will be started or not.
useHTTP
Default value: false
Use HTTP instead of HTTPS.
serverAddress
The server's address without the http://
or https://
prefix.
The address is the same as the main
node's address.
serverPort
Default value: 3000
The port where the grid will listen.
authToken
The token that will authenticate the worker
nodes with the main
node.
key
The key used by the HTTPS server.
cert
The certificate used by the HTTPS server.
Misc
log
Default value: all
The saffron log level. Can be set as all
, info
, errors
or none
.
Environment
Saffron supports different configuration based on the environment.
The configuration will first receive the root configuration:
{
databsase: {
pushArticles: (articles) => {
// push to production database
},
// ...
},
misc: {
log: "errors" // log only errors
},
// ...
}
and then read the child object development
, production
or testing
based on the environment:
{
databsase: {
pushArticles: (articles) => {
// push to production database
},
// ...
},
misc: {
log: "all"
},
// ...
development: {
databsase: {
pushArticles: (articles) => {
// push to local database
},
// ...
},
misc: {
log: "all" // log everything
},
// ...
}
}