This incident has been fully resolved and all services are operating normally.
Summary: We experienced a partial service degradation beginning yesterday evening. The initial impact was limited, but as activity increased overnight and into the morning, additional services were affected. As monitoring showed increasing errors across a broader part of the platform, the incident was escalated and handled with the highest priority.
Root cause: The issue was caused by an incorrect configuration in our internal message queueing infrastructure (used for system-to-system communication). This configuration continuously created new message listeners on a single internal topic. Over time, this reached the platform limit (2,000 listeners) for that topic. Once the limit was reached, new application instances were unable to start because they could not create the required listener, which reduced available capacity and led to degraded service.
Remediation and prevention: We have corrected the configuration, restored normal operation, and verified service recovery. To reduce the risk of recurrence, we are implementing safeguards to prevent unintended listener growth, tightening validation around configuration changes, and enhancing monitoring and alerting so abnormal patterns are detected and addressed earlier.
We apologize for the disruption and appreciate your patience and understanding.