At 10:34 AM, We experienced a service disruption caused by connection timeouts to our database. These timeouts prevented our application layer from communicating with the data store, leading to a cascade of pod failures and service instability.
The connection timeouts triggered a failure in the pods' startup probes, causing them to crash. We have restored service by refreshing the database and cycling the pods. We are currently reviewing our connection pool settings and liveness probe thresholds to ensure better resilience against transient database latency in the future.