OpenAI has revealed that a newly deployed telemetry service caused widespread disruptions to its services including ChatGPT, Sora, and its API on Wednesday. According to Kyle Wiggers reporting for TechCrunch, the three-hour outage began around 3 p.m. Pacific time. The company explained that the new service’s configuration inadvertently overwhelmed their Kubernetes API servers, affecting critical infrastructure components including DNS resolution systems. The problem was detected minutes before users experienced issues, but engineers struggled to implement fixes due to locked-out systems. OpenAI acknowledged falling short of expectations and announced several preventive measures, including improved monitoring systems and new mechanisms to ensure engineer access to API servers during emergencies. The company emphasized that the incident was not related to security issues or recent product launches.
OpenAI identifies telemetry service as cause of major outage
Related posts:
Tags: OpenAI