19934 shaares
136 private links
136 private links
During the incident, the system hit 3000 IOPS and stayed there. It didn’t recover on its oown. We had to kill connections manually. Why? What was actually happening inside Postgres that made the situation self-sustaining? I had a wrong mental model for days. This post is about fixing that.