Incident Report 07 March 2026
Issue with Kong servers that caused downtime on various services on Prop Data Manage
Incident Report 7th March, 2026
Multiple Prop Data services on Manage degraded by Kong gateway failure
Summary
On the morning of 7th March 2016, we noticed that several services were degraded or not working. We identified that the recently updated Kong service gateways had run out of disk space, and were not passing requests through to the back-end. We cleared the logs, and service was resumed. The disks filled again over the weekend, causing another outage on Monday morning (9th March 2016). We cleared the logs again, restoring service. Afterwards, we managed to identify the root cause and fix it, and service has been stable since.
Impact
The impact of the outage was that several services were not available, either loading slowly or not at all.
Root Cause
The “Kong” servers are a cluster of gateways which direct incoming requests to the correct back-end server. They have run stably for many years, but were badly out of date and becoming a security risk. We scheduled a full upgrade to happen on the evening of Friday, 6th March. The upgrades completed successfully, passed testing, and were put into service. However, a misconfiguration meant that the Kong software was generating a storm of error messages which cause the log files to grow to a very large size, eventually filling the disk. Kong expects to be able to send its status to a service called statsd, which was not present. Kong was managing requests correctly, which is why testing was successful, but the lack of a statsd service generated a flood of error messages into the log files. Eventually the disks filled up, and the service stopped responding.
Resolution
Once the Prop Data development team confirmed the root cause, the following steps were taken: First, the original Kong servers were returned to service, and the new ones were removed. Once service was restored, we were able to connect to the new servers, clear out the log files, and correct the misconfiguration. This was done by installing AWS’s “CloudWatch Agent”, which provides a statsd service. The new servers were then returned to service, alongside the older original servers.
When it was found that the logs were still filling (albeit much more slowly now that the primary cause was corrected), we repeated the process and turned on log rotation, to ensure that stale logging data would be discarded, protecting the disk from being filled.
Communications (Internal)
Conducted via Teams chat and calls.
Communications (External)
Comms were handled on an ad-hoc basis by the account management team.
Knowledge gained
The Kong software expects the statsd service to be present. On the old servers, this service was available because the AWS CloudWatch Agent was already installed. The new servers did not have CloudWatch Agent installed by default, and so the statsd service was not present. We are now aware of these requirements and can make sure that they are correctly installed and configured on future builds.
Action Items (Recommendation)
- Engineering Team to ensure that systems documentation be updated to include the Kong requirement for AWS CloudWatch Agent.