At LunaNova, we have opted to run our own hardware in high availability datacentres, to ensure we are providing a truly decentralized service to blockchain networks. A great deal of thought has gone in to how we configure our infrastructure. Below are some highlights and technical details about the choices made to provide a robust, vigilant and performant service.
A strong datacentre is the foundation of any competent blockchain validator offering. In 2018 we shortlisted candidates within a 150 mile radius of our operations-centre and personally inspected each of them before deciding on our primary location. It is a well-run, top-tier facility with redundant power, networking and cooling. 24/7 support staff with stringent security and operational procedures underpin their service. In Autumn 2020 we have deployed more of our own hardware to a second, equally strong, high availability datacentre located in London. This has increased our capacity and provided a further layer of redundancy for our operations.
Several decades of engineering and system-building experience enables us to custom build, test and deploy our own hardware. This affords a much greater degree of control over the specification and performance of our systems and ensures, in the event of any issues, we are not unnecessarily reliant on third parties to resolve them. This is also a cost-effective strategy which means we can afford to provision redundant second systems to remain on standby for our high-availability operations.
Our primary platform uses AMD EPYC systems in clusters. These systems are assembled with enterprise-grade dual Power Supplies and Enterprise SSDs in ZFS mirror configurations for high availability. We run node-specific deployments in individual Virtual Machines (VMs) on top of these systems, for customisation, portability and security reasons. By deploying our server hardware in clusters we can migrate our VMs from system to system without service downtime, vital for prompt roll-out of critical security patches.
For each blockchain we service, we carefully study the software requirements before deciding our approach. For example, for Solana, to produce the highest levels of validator service, we custom built our own high performance GPU system. This is engineered to have far greater thermal performance than any off-the-shelf GPU systems we have encountered, enabling incredibly high Transactions per Second without thermal throttling or reduced component life.
We exclusively use Linux server distributions for our validator systems, as their stability and flexibility is superior. We follow an “infrastructure as code” approach and our systems are deployed, configured and maintained with custom-built Ansible scripts. These ensure that our systems are consistently and reliably configured. Our Ansible scripts also enable us to quickly and capably redeploy individual systems or whole clusters in the case of any emergencies.
We use Prometheus servers to monitor our hardware and VMs. Its deployment is fully integrated in to our Ansible scripts and it allows us to capture critical data, network connectivity and performance metrics about our deployments with minimal overhead.
We use multiple Prometheus servers (2 internally and 2 externally collecting public network connectivity data) in a high-availability configuration. The data gathered is displayed via a number of custom Grafana dashboards, showing hardware and service performance. This provides our sysadmins a clear view of our infrastructure status at all times.
Our hardware systems transfer their system and auth logs to separately configured remote servers, ensuring a durable external audit trail in the hopefully-unlikely event of any system compromise or failure.
We utilize Prometheus’s Alert Manager in a 4 node cluster to provide high-availability alerting. Notifications are routed according to severity and type. Highest priority service- based alerts are routed through Pager Duty in the first instance and in failover, via a secondary route with Twilio. These systems phone or text our on-call syadmins with escalation and repeat alerts to ensure a timely response. Emails are used exclusively for lower priority alerting and as a backup for higher priority alerts. They are PGP encrypted to avoid potentially sensitive system architecture details leaking via plaintext.
Care has been taken to ensure only critical, service-based, incidents result in 24/7 audible alerting and non-response escalation. This enables on-call staff to focus on the necessary and appropriate.
Security, critical when operating blockchain nodes, derives from our thorough approach to network architecture, system hardening and firewall configuration. Our initial system setup scripts apply specific optimizations to reduce the risk of vulnerabilities in our systems.
By focusing on a select number of blockchains, we keep a close eye on their technical communities, ensuring prompt response to software updates and security issues. Where practical, our servers use “unattended upgrade” options to automatically install security updates when they are available.
For services that require the highest levels of availability, automatic updates pose a real risk of unnecessary disruption. For these, we keep a close eye on the relevant security mailing lists and apply any urgent updates manually, as soon as practical. Our many years of experience enables us to avoid needlessly applying updates that are not relevant to our server configurations and may disrupt uptime, which is a common failing among less experienced validator services.
In parallel, our comprehensive monitoring system informs our sys-admins which systems have security updates available and, crucially, which require full restarts after updates have been applied.