4. Installation¶
Note
To be able to install XFM you need a token (defined as ${REPO_TOKEN}
below). This token provides access to the MGRID package repository. You can
request a token by sending an e-mail to support@mgrid.net.
4.1. Quickstart¶
For a single node install, you can execute
$ REPO_TOKEN=<YOUR_TOKEN>
$ curl -s http://${REPO_TOKEN}:@www.mgrid.net/quickstart/install.sh | sudo bash
This will install and setup a base XFM configuration. This accepts CDA R2 messages with RIM version 2.33R1 Normative Edition 2011. The RIM database is optimized for large message rates.
If successful, you can access the XFM command line tools by running:
$ su - xfmadmin
For available commands and testing an see Using xfmctl.
4.2. Detailed Procedure¶
XFM runs on a set of (virtual) machines, and is bundled with a command line tool
(xfmctl) to help setting up and manage a XFM deployment.
xfmctl is installed on a machine which manages the XFM deployment, and
should have network access to the target machines.
Install xfmctl and its dependencies (xfmctl is written in Python
and it is recommended to run it in a virtualenv environment).
First set some environment variables:
export REPO_TOKEN=<YOUR_TOKEN>
export ADMIN=xfmadmin
export ADMIN_HOME=/home/$ADMIN
export RELEASE=$(cat /etc/redhat-release | grep -oE '[0-9]+\.[0-9]+' | cut -d"." -f1)
Then add the MGRID package repository:
curl -s https://${REPO_TOKEN}:@packagecloud.io/install/repositories/mgrid/mgrid3/script.rpm.sh | sudo bash
This add a Yum repository to the system, so the xfmctl package becomes
available for installation.
Add the install group and user:
groupadd $ADMIN && useradd -d $ADMIN_HOME -m -g $ADMIN $ADMIN
Install the XFM command line tools (xfmctl). This requires the Extra
Packages for Enterprise Linux (EPEL) repository.
yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-${RELEASE}.noarch.rpm
yum install -y mgridxfm3-xfmctl
The next command switches to the $ADMIN user, and preserves
the environment variables (-E switch).
sudo -E -u $ADMIN bash
cd $ADMIN_HOME
Create SSH keys for key-based authentication (allows xfmctl connect to the
nodes).
ssh-keygen -t rsa -b 2048 -f $ADMIN_HOME/.ssh/xfm_rsa -P ""
Create the virtual environment:
virtualenv ${ADMIN_HOME}/pyenv
source ${ADMIN_HOME}/pyenv/bin/activate
echo source ${ADMIN_HOME}/pyenv/bin/activate >> ${ADMIN_HOME}/.bashrc
pip install /opt/mgrid/xfm3-xfmctl/xfmctl-*.tar.gz
If successfull the xfmctl command is available in your virtualenv. It
is also added to the .bashrc file so the virtualenv is activated when
logging in as $ADMIN. Before it can be used it needs a configuration file.
4.3. xfmctl Configuration¶
xfmctl reads its configuration from a file xfm.json (expected in the
directory where xfmctl is run).
xfmctl supports several target environments through so called backends; it
can setup and manage XFM nodes using existing cloud infrastructure, interact
directly with virtual machines, or Docker containers. The backends are:
plain: This backend is the most basic and does not create XFM instances itself (i.e. the XFM instances are already brought up by some other (possibly manual process)).xfmctlonly needs to know the instance names and how to access the it using Secure Shell (SSH).amazon: For the amazon backend xfmctl uses the EC2 web service to create and manage instances. A valid Amazon EC2 account is required. When an instance is available,xfmctlalso connects to it using SSH.
xfmctl reads its configuration from xfm.json in its current directory. The
xfmctl distribution contains sample configuration files for each backend,
see Example configuration file for an example single node configuration.
The xfmctl distribution contains sample configuration files for each
backend, see Example configuration file for an example single node configuration for the
plain backend. Available settings are:
xfmbackendSelected backend. For the selected backend a configuration sectionbackend_<NAME>must exist; for examplebackend_plain.configXFM deployment configuration. Site-specific components (e.g., custom workers, messaging configuration, parser definitions) can be packaged, and selected with this configuration item.baseis the default XFM configuration.repoName of the MGRID package repository. Default ismgrid3.repotokenAccess token of the MGRID package repository.sshkeyPath to the private key used to access XFM instances.persistence(true,false) Set whether message persistence should be enabled for all messages (i.e. the broker stores messages to disk before sending acknowledgements). Enable this option when messages should survive broker restarts or failures (when disabling persistence messages are kept in-memory). Note that writing messages to disk affects performance.enable_metrics(true,false) Set whether XFM components should send metrics to the metrics server.
gatewayhostnameHostname used by nodes (e.g., ingesters, command-line tools) to access the gateway RabbitMQ instance.usernameRabbitMQ username for messaging with the gateway.passwordRabbitMQ password for messaging with the gateway.
brokerhostnameHostname used by nodes (e.g., ingesters, transformers, loaders, command-line tools) to access the broker RabbitMQ instance.usernameRabbitMQ username for messaging with the broker.passwordRabbitMQ password for messaging with the broker.
ingesterprocspernodeThe number of processes to start on a node/instance.prefetchThe amount of messages that are prefetched from the gateway.flowcontrolSubcategory for flow control settings.thresholdNumber of messages that are queued for tranformers and loaders (summed) before flow control is activated (to limit intake from the gateway).periodPeriod in milliseconds for querying queue sizes. This will determine how often the actual queue sizes are checked against the threshold (because of this polling mechanism it is possible that the combined queue sizes exceed the configured threshold).
transformerprocspernodeThe number of processes to start on a node/instance.prefetchThe amount of messages that are prefetched from the broker.jsonSettings for the JSON transformer (not used in this tutorial).partitionsNumber of database table partitions.tableName of the destination table.columnName of the destination table column.
loaderprocspernodeThe number of processes to start on a node/instance.groupSubcategory for group settings.timeoutTimeout in milliseconds for grouping (aggregating) messages. If less messages are received then the group size before the timeout, the (partial) group is processed. This avoids stalling of messages.size_low,size_highThe loaders choose a random group size at startup to avoid running in lockstep when uploading towards the data lake. These parameters control the lower and upper bound of the chosen group size. The prefetch size is chosen as 2 times the group size.pondSettings for the pond databases.pgversionPostgreSQL database version to use;9.4or9.5.portListening port of the pond database server.
metricsMetrics backend (Graphite)hostnameHostname used by nodes (e.g., ingesters, loaders, command-line tools) to access the metrics server.portPort for sending metric data. Note that this is the server port as used by clients.secretKey for accessing the metrics web API.
lakepgversionPostgreSQL database version to use;9.4or9.5.datadirDirectory of the lake database files.hostnameHostname of the lake database.portPort of the lake database.nameName of the lake database.usernameUsername to access the lake database.passwordPassword to access the lake database. Note this is used verbatim in a pgpass file, so:and\should be escaped.
backend_plainusernameUsername used to access an instance.hostsKey-value pairs of XFM instances. The key is the role name, and the value is a list of IP addresses or hostnames of the nodes belonging to that role (nodes should only have a single role). Typically each role represents a group of nodes with the same configuration profile, such as all ingester nodes.The role name is used to determine the configuration profile (i.e. installation instructions).
In its base configuration XFM contains the following roles (this can be extended through site-specific configurations):
gatewayGateway broker (RabbitMQ). Should contain at most 1 entry.brokerMessaging broker (RabbitMQ). Should contain at most 1 entry.ingesterNodes running Ingesters. Can contain 1 or more entries.transformerNodes running Transformers. Can contain 1 or more entries.loaderNodes running Loaders. Can contain 1 or more entries.lakeLake database (PostgreSQL). Should contain at most 1 entry.rabbitmqCombination of gateway and broker on a single RabbitMQ instance. Should contain at most 1 entry.workerCombination of Ingesters, Transformers and Loaders. Can contain 1 or more entries.singlenolakeCombination of all components on a single node except the Lake. Should contain at most 1 entry.singlenodeCombination of all components on a single node. Should contain at most 1 entry.
backend_amazonimageidIdentifier of the Amazon Machine Image (AMI) to use.usernameUsername used to access an instance (typicallyec2-user).keynameName of the key pair.certPath to the certificate to access the Amazon AWS EC2 endpoint.securitygroupSecurity group to use for an instance.management,gateway,broker,ingester,transformer,loadersizeidThe size identifier of an instance hardware configuration (e.g.,t1.micro).
The created SSH key should be used as the sshkey in the xfm.json
configuration file (as created in a previous step).
4.4. Installing nodes¶
Before xfmctl can start installation of a node, it must be able to access it using key-based authentication. To copy the created key to the node, do:
$ ssh-copy-id -i $ADMIN_HOME/.ssh/xfm_rsa <USERNAME>@<HOSTNAME>
Substitute <USERNAME> with the username used in the backend username in
xfm.json, and <HOSTNAME> for each hostname (or IP address) used in the
backend hosts section. To list the configured addresses run:
$ xfmctl list
Now xfmctl should have access to the node. To test, do:
$ xfmctl --hosts=<HOSTNAME> -- cat /etc/redhat-release
When successful, it should print the distribution version, without prompting for a username or password.
xfmctl needs to know on which nodes to run its commands. Above the
--hosts parameter was used to select individual nodes, but the --roles
parameter can be used to run commands on all nodes listed for that role in the
backend hosts section in xfm.json. Multiple roles can be provided,
separated with commas, for example:
$ xfmctl --roles=ingester,transformer,loader -- cat /etc/redhat-release
Now xfmctl can access the nodes it should be allowed privileged access, to
be able to make system-wide changes to the node (e.g., install and configure
software).
To enable privileged access using xfmctl, execute the following
command for all roles (prompts for the root password on each node):
xfmctl --roles=singlenode -- su -c "\"mkdir -p /etc/sudoers.d && echo $ADMIN 'ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/999_xfm\""
When ready, a node can be installed:
$ xfmctl --roles=singlenode setup update_config update
The installation commands are as follows:
setupAdds repositories to the node needed to install XFM dependencies in addition to the base RedHat and CentOS repositories, install the XFM bootstrap package and configures Puppet. Installed repositories are:
- MGRID
- Puppet
- PostgreSQL (9.4 and 9.5)
- Extra Packages for Enterprise Linux (EPEL)
- Software Collections (SCL), only for RedHat/CentOS 6.
update_configCopies the settings inxfm.jsonto a node such that it is available during installation.updateInstalls or updates a node. The actual steps performed depend on the role of the node as configured in the backendhostssection inxfm.json.
4.5. Amazon backend: Creating nodes¶
When using the Amazon backend, there is an additional command create.
This command requires a role parameter. As was already seen in
creating the management server instance this is done by passing the role after a
colon:
$ xfmctl create:management
Note that while the create command returns after the instance is running with
networking enabled, it can take some additional time before access using SSH is
possible. If subsequent update commands timeout, it often helps to wait a
bit and retry.
4.5.1. Updating configuration¶
In setting up the management server the configuration from xfm.json was
uploaded. When this file changes (e.g., to change the prefetch of the
ingesters) the instances should be updated to reflect the changes:
$ xfmctl --roles=ingester update_config update
4.5.2. Example configuration file¶
{
"xfm": {
"backend": "plain",
"config": "base",
"repo": "mgrid3",
"repotoken": "${REPO_TOKEN}",
"sshkey": "${ADMIN_HOME}/.ssh/id_rsa",
"persistence": false,
"enable_metrics": false
},
"gateway": {
"hostname": "localhost",
"username": "xfm",
"password": "tr4nz00m"
},
"broker": {
"hostname": "localhost",
"username": "xfm",
"password": "tr4nz00m"
},
"ingester": {
"procspernode": 1,
"prefetch": 50,
"flowcontrol": {
"threshold": 5000,
"period": 2000
}
},
"transformer": {
"procspernode": 1,
"prefetch": 50,
"json": {
"partitions": 400,
"table": "document",
"column": "document"
}
},
"loader": {
"procspernode": 1,
"group": {
"timeout": 1000,
"size_low": 50,
"size_high": 100
},
"pond": {
"pgversion": "9.4",
"port": 5433
}
},
"metrics": {
"hostname": "localhost",
"port": 2003,
"secret": "verysecret"
},
"lake": {
"pgversion": "9.4",
"datadir": "/var/lib/pgsql/9.4/lake",
"hostname": "localhost",
"port": 5432,
"name": "lake",
"username": "xfmuser",
"password": "lake"
},
"backend_plain": {
"username": "${ADMIN}",
"hosts": {
"singlenode": [
"127.0.0.1"
]
}
}
}
4.6. Multinode install¶
Below is an example on how to install multiple nodes.
- 1 Management server
- 1 RabbitMQ message broker
- 1 Ingester node
- 1 Transformer node
- 1 Loader node
- 1 Lake node
Add all hosts to xfm.json (see xfmctl Configuration for an explanation of the settings):
{
"xfm": {
"backend": "plain",
"config": "base",
"repo": "mgrid3",
"repotoken": "${REPO_TOKEN}",
"sshkey": "${ADMIN_HOME}/.ssh/id_rsa",
"persistence": false,
"enable_metrics": false
},
"gateway": {
"hostname": "192.168.1.110",
"username": "xfm",
"password": "tr4nz00m"
},
"broker": {
"hostname": "192.168.1.110",
"username": "xfm",
"password": "tr4nz00m"
},
"ingester": {
"procspernode": 1,
"prefetch": 50,
"flowcontrol": {
"threshold": 5000,
"period": 2000
}
},
"transformer": {
"procspernode": 1,
"prefetch": 50,
"json": {
"partitions": 400,
"table": "document",
"column": "document"
}
},
"loader": {
"procspernode": 1,
"group": {
"timeout": 1000,
"size_low": 50,
"size_high": 100
},
"pond": {
"pgversion": "9.4",
"port": 5433
}
},
"metrics": {
"hostname": "192.168.1.110",
"port": 2003,
"secret": "${METRICS_API_SECRET}"
},
"lake": {
"pgversion": "9.4",
"datadir": "/var/lib/pgsql/9.4/lake",
"hostname": "localhost",
"port": 5432,
"name": "lake",
"username": "xfmuser",
"password": "lake"
},
"backend_plain": {
"username": "${ADMIN}",
"hosts": {
"rabbitmq": [
"192.168.1.110"
],
"ingester": [
"192.168.1.120"
],
"transformer": [
"192.168.1.130",
"192.168.1.131"
],
"loader": [
"192.168.1.140",
"192.168.1.141"
],
"lake": [
"192.168.1.150"
]
}
}
}
After preparing each node for installation (see Installing nodes), start the installation:
xfmctl --parallel --roles=rabbitmq,lake,ingester,transformer,loader setup update_config update
The --parallel switch is optional but allows installation of multiple nodes
in parallel.
When successfull, the host-based access configuration of the lake should be
edited for loader access. To do this edit the file /etc/xfm/lake_hba.conf
on the lake node. For example for the loaders in the xfm.json from
above (we assume a CIDR mask length of 24), see here for
details:
host lake xfmuser 192.168.1.0/24 trust