This incident has been fully resolved and all services are operating normally.
Summary: We experienced a partial service degradation beginning yesterday evening. The initial impact was limited, but as activity increased overnight and into the morning, additional services were affected. As monitoring showed increasing errors across a broader part of the platform, the incident was escalated and handled with the highest priority.
Root cause: The issue was caused by an incorrect configuration in our internal message queueing infrastructure (used for system-to-system communication). This configuration continuously created new message listeners on a single internal topic. Over time, this reached the platform limit (2,000 listeners) for that topic. Once the limit was reached, new application instances were unable to start because they could not create the required listener, which reduced available capacity and led to degraded service.
Remediation and prevention: We have corrected the configuration, restored normal operation, and verified service recovery. To reduce the risk of recurrence, we are implementing safeguards to prevent unintended listener growth, tightening validation around configuration changes, and enhancing monitoring and alerting so abnormal patterns are detected and addressed earlier.
We apologize for the disruption and appreciate your patience and understanding.
Resolved
This incident has been fully resolved and all services are operating normally.
Summary: We experienced a partial service degradation beginning yesterday evening. The initial impact was limited, but as activity increased overnight and into the morning, additional services were affected. As monitoring showed increasing errors across a broader part of the platform, the incident was escalated and handled with the highest priority.
Root cause: The issue was caused by an incorrect configuration in our internal message queueing infrastructure (used for system-to-system communication). This configuration continuously created new message listeners on a single internal topic. Over time, this reached the platform limit (2,000 listeners) for that topic. Once the limit was reached, new application instances were unable to start because they could not create the required listener, which reduced available capacity and led to degraded service.
Remediation and prevention: We have corrected the configuration, restored normal operation, and verified service recovery. To reduce the risk of recurrence, we are implementing safeguards to prevent unintended listener growth, tightening validation around configuration changes, and enhancing monitoring and alerting so abnormal patterns are detected and addressed earlier.
We apologize for the disruption and appreciate your patience and understanding.
Monitoring
We’ve implemented a solution to the issue causing API outages and are now monitoring the results. All services should be fully available again.
We’ll continue to monitor closely and provide further updates if anything changes.
Identified
The issue causing API outages has been identified, and our team is actively implementing a solution to restore service.
Investigating
We’re currently investigating an issue causing API outages. We’ll provide an update as soon as possible.