Today's outage: post mortem
this is a quite technical walkthrough of the process of finding the cause of todays outage. It is targeted to other server admins, so that we can learn from each other.
Breakfast
It’s 9 o’clock in the morning, I got my porrige and a coffee ☕, ready to watch the tagesschau, but my laptop prompted me with “foodsharing is down!!1!1!” instead.
dammit! but okedoki -> $ ssh dragonfruit.foodsharing.de
First step always if we get an 502 Bad Gateway
error: what does nginx say?
$ tail -F /var/log/nginx/error.log
2021/03/30 06:40:12 [error] 1625#1625: *914973060 connect() to unix:/var/run/php7-fpm-production.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 2003:*****:29bd, server: dragonfruit.foodsharing.de, request: "GET /api/stores/31605/pickups HTTP/2.0", upstream: "fastcgi://unix:/var/run/php7-fpm-production.sock:", host: "foodsharing.de", referrer: "https://foodsharing.de/?page=fsbetrieb&id=31605"
seems like php-fpm is down/unresponsive. After a restart it might work again, but lets have a look into the logs first to find out what causes the issue. but in non of the following sources yielded any hints
journalctl -ef -u php7.4-fpm
/var/log/php7.4-fpm.log
/var/www/production/log/php-error.log
okay, let’s restart php-fpm: $ systemctl restart php7.4-fpm
Limits?
it works again! …. for a while… and then requests take forever … and then 502 errors again. Seems like something builds up. In which limits do we run here? Some research on that nginx error yields, that we might have too many sockets open. so lets temporary increase all limits we find, to see whether it changes something:
$ sysctl -w net.core.somaxconn=60000
$ sysctl -w net.core.netdev_max_backlog=60000
listen.backlog
in/etc/php/7.4/fpm/pool.d/production.conf
After another $ systemctl restart php7.4-fpm
I had the feeling, that it might worked now a little bit longer, before we got the same error again.
Loops of playing with more settings and restarting didn’t lead to more information.
I already thought, that there could be an issue in the php code, but I deactivated already the beta ($ touch /var/www/beta/current/maintenance_mode_on
) and there haven’t been any recent changes to production.
Oke, but maybe there is really something somehow running long, stacking up open requests?
Eureka!
After activating the slow log (request_slowlog_timeout = 5s
in /etc/php/7.4/fpm/pool.d/production.conf
), there are a lot of entries in /var/www/production/log/php-slow.log
like
[30-Mar-2021 10:32:08] [pool production] pid 24884 script_filename = /var/www/production/current/restApi.php [0x00007fae4ec14ef0] curl_exec() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Handler/CurlHandler.php:40 [0x00007fae4ec14e50] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Handler/Proxy.php:28 [0x00007fae4ec14da0] GuzzleHttp\Handler\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Handler/Proxy.php:51 [0x00007fae4ec14d00] GuzzleHttp\Handler\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/PrepareBodyMiddleware.php:66 [0x00007fae4ec14c20] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Middleware.php:29 [0x00007fae4ec14b70] GuzzleHttp\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/RedirectMiddleware.php:70 [0x00007fae4ec14ad0] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Middleware.php:59 [0x00007fae4ec14a40] GuzzleHttp\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/HandlerStack.php:71 [0x00007fae4ec149b0] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:351 [0x00007fae4ec148a0] transfer() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:162 [0x00007fae4ec147c0] requestAsync() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:182 [0x00007fae4ec14730] request() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:95 [0x00007fae4ec14660] __call() /var/www/production/releases/42/src/Lib/WebSocketConnection.php:28 [0x00007fae4ec145d0] post() /var/www/production/releases/42/src/Lib/WebSocketConnection.php:37 [0x00007fae4ec14510] sendSock() /var/www/production/releases/42/src/Modules/Bell/BellGateway.php:218 [0x00007fae4ec144a0] updateFoodsaverClient() /var/www/production/releases/42/src/Modules/Bell/BellGateway.php:181 [0x00007fae4ec14400] delBellForFoodsaver() /var/www/production/releases/42/src/RestApi/BellRestController.php:105 [0x00007fae4ec14360] deleteBellAction() /var/www/production/releases/42/vendor/symfony/http-kernel/HttpKernel.php:157 [0x00007fae4ec14280] handleRaw() /var/www/production/releases/42/vendor/symfony/http-kernel/HttpKernel.php:79 [0x00007fae4ec141d0] handle() /var/www/production/releases/42/vendor/symfony/http-kernel/Kernel.php:195
aaah, apparently the calls to our websocket chatserver take ages.
let’s deactivate them for now in src/Lib/WebSocketConnection.php
, and tada🎉 foodsharing.de runs smoothly again!
Manual request confirmes that the request runs forever:
curl -v http://127.0.0.1:1338/users/1/is-online
Chatserver
a look into its log yields
$ journalctl -efu fs-chatserver
error: Error: This socket has been ended by the other party
at Socket.writeAfterFIN [as write] (net.js:455:14)
at /var/www/production/releases/31/chat/node_modules/tedis/build/core/base.js:83:26
at new Promise (<anonymous>)
at Tedis.Base.command (/var/www/production/releases/31/chat/node_modules/tedis/build/core/base.js:78:16)
at SessionIdProvider.fetchSessionIdsForUser (/var/www/production/releases/31/chat/src/SessionIdProvider.ts:24:43)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at RestController.userIsConnected (/var/www/production/releases/31/chat/src/RestController.ts:26:28) {
code: 'EPIPE'
}
okay, lets debug this! whats happening in /var/www/production/releases/31/chat/src/SessionIdProvider.ts
? wait the file doesn’t exist?
we are already at release 42
, but the chat server still runs on 31
!
Apparently it might be a known bug of tedis
, the redis js library we use: https://github.com/silkjs/tedis/issues/49
because I had no idea, how to analyze the state further and the error might be already solved in the newest release: I did a systemctl restart fs-chatserver
and the chatserver responses properly.
… at least for now. But that the tedis bug led to foodsharing.de being down included a chain of issues we should solve soon:
Issues
- Tedis can become unusable: tedis#49
- Chatserver just throws an error to the console, but doesn’t respond to the running http request: foodsharing#1068
- No timeout for the chatserver requests from PHP: foodsharing#1069
- No timeout for php requests: dragonfruit-ansible#6
the combination of each single issue was causing the outage.
additionally findings
- No monitoring for the chatserver: dragonfruit-ansible#7
- Chatserver should be restarted after deployments: foodsharing#1070
Thanks to ansible, cleaning up all the mess due to the fiddling with settings is quite an easy task, but I will rather run the playbook tonight - who knows what unexpected side effects it will have :D