Today's outage: post mortem

this is a quite technical walkthrough of the process of finding the cause of todays outage. It is targeted to other server admins, so that we can learn from each other.

Breakfast

It’s 9 o’clock in the morning, I got my porrige and a coffee ☕, ready to watch the tagesschau, but my laptop prompted me with “foodsharing is down!!1!1!” instead.
dammit! but okedoki -> $ ssh dragonfruit.foodsharing.de

First step always if we get an 502 Bad Gateway error: what does nginx say?
$ tail -F /var/log/nginx/error.log

2021/03/30 06:40:12 [error] 1625#1625: *914973060 connect() to unix:/var/run/php7-fpm-production.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 2003:*****:29bd, server: dragonfruit.foodsharing.de, request: "GET /api/stores/31605/pickups HTTP/2.0", upstream: "fastcgi://unix:/var/run/php7-fpm-production.sock:", host: "foodsharing.de", referrer: "https://foodsharing.de/?page=fsbetrieb&id=31605"

seems like php-fpm is down/unresponsive. After a restart it might work again, but lets have a look into the logs first to find out what causes the issue. but in non of the following sources yielded any hints

journalctl -ef -u php7.4-fpm
/var/log/php7.4-fpm.log
/var/www/production/log/php-error.log

okay, let’s restart php-fpm: $ systemctl restart php7.4-fpm

Limits?

it works again! …. for a while… and then requests take forever … and then 502 errors again. Seems like something builds up. In which limits do we run here? Some research on that nginx error yields, that we might have too many sockets open. so lets temporary increase all limits we find, to see whether it changes something:

$ sysctl -w net.core.somaxconn=60000
$ sysctl -w net.core.netdev_max_backlog=60000
listen.backlog in /etc/php/7.4/fpm/pool.d/production.conf

After another $ systemctl restart php7.4-fpm I had the feeling, that it might worked now a little bit longer, before we got the same error again.

Loops of playing with more settings and restarting didn’t lead to more information.

I already thought, that there could be an issue in the php code, but I deactivated already the beta ($ touch /var/www/beta/current/maintenance_mode_on) and there haven’t been any recent changes to production. Oke, but maybe there is really something somehow running long, stacking up open requests?

Eureka!

After activating the slow log (request_slowlog_timeout = 5s in /etc/php/7.4/fpm/pool.d/production.conf), there are a lot of entries in /var/www/production/log/php-slow.log like

[30-Mar-2021 10:32:08]  [pool production] pid 24884
script_filename = /var/www/production/current/restApi.php
[0x00007fae4ec14ef0] curl_exec() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Handler/CurlHandler.php:40
[0x00007fae4ec14e50] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Handler/Proxy.php:28
[0x00007fae4ec14da0] GuzzleHttp\Handler\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Handler/Proxy.php:51
[0x00007fae4ec14d00] GuzzleHttp\Handler\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/PrepareBodyMiddleware.php:66
[0x00007fae4ec14c20] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Middleware.php:29
[0x00007fae4ec14b70] GuzzleHttp\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/RedirectMiddleware.php:70
[0x00007fae4ec14ad0] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Middleware.php:59
[0x00007fae4ec14a40] GuzzleHttp\{closure}() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/HandlerStack.php:71
[0x00007fae4ec149b0] __invoke() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:351
[0x00007fae4ec148a0] transfer() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:162
[0x00007fae4ec147c0] requestAsync() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:182
[0x00007fae4ec14730] request() /var/www/production/releases/42/vendor/guzzlehttp/guzzle/src/Client.php:95
[0x00007fae4ec14660] __call() /var/www/production/releases/42/src/Lib/WebSocketConnection.php:28
[0x00007fae4ec145d0] post() /var/www/production/releases/42/src/Lib/WebSocketConnection.php:37
[0x00007fae4ec14510] sendSock() /var/www/production/releases/42/src/Modules/Bell/BellGateway.php:218
[0x00007fae4ec144a0] updateFoodsaverClient() /var/www/production/releases/42/src/Modules/Bell/BellGateway.php:181
[0x00007fae4ec14400] delBellForFoodsaver() /var/www/production/releases/42/src/RestApi/BellRestController.php:105
[0x00007fae4ec14360] deleteBellAction() /var/www/production/releases/42/vendor/symfony/http-kernel/HttpKernel.php:157
[0x00007fae4ec14280] handleRaw() /var/www/production/releases/42/vendor/symfony/http-kernel/HttpKernel.php:79
[0x00007fae4ec141d0] handle() /var/www/production/releases/42/vendor/symfony/http-kernel/Kernel.php:195

aaah, apparently the calls to our websocket chatserver take ages.

let’s deactivate them for now in src/Lib/WebSocketConnection.php, and tada🎉 foodsharing.de runs smoothly again!

Manual request confirmes that the request runs forever: curl -v http://127.0.0.1:1338/users/1/is-online

Chatserver

a look into its log yields

$ journalctl -efu fs-chatserver
 error: Error: This socket has been ended by the other party                                                                                                            
     at Socket.writeAfterFIN [as write] (net.js:455:14)                                                                                                                 
     at /var/www/production/releases/31/chat/node_modules/tedis/build/core/base.js:83:26                                                                                
     at new Promise (<anonymous>)
     at Tedis.Base.command (/var/www/production/releases/31/chat/node_modules/tedis/build/core/base.js:78:16)                                                           
     at SessionIdProvider.fetchSessionIdsForUser (/var/www/production/releases/31/chat/src/SessionIdProvider.ts:24:43)                                                  
     at runMicrotasks (<anonymous>)
     at processTicksAndRejections (internal/process/task_queues.js:97:5)                                                                                                
     at RestController.userIsConnected (/var/www/production/releases/31/chat/src/RestController.ts:26:28) {                                                             
   code: 'EPIPE'
 }

okay, lets debug this! whats happening in /var/www/production/releases/31/chat/src/SessionIdProvider.ts? wait the file doesn’t exist? we are already at release 42, but the chat server still runs on 31!

Apparently it might be a known bug of tedis, the redis js library we use: https://github.com/silkjs/tedis/issues/49

because I had no idea, how to analyze the state further and the error might be already solved in the newest release: I did a systemctl restart fs-chatserver and the chatserver responses properly.

… at least for now. But that the tedis bug led to foodsharing.de being down included a chain of issues we should solve soon:

Issues

Tedis can become unusable: tedis#49
Chatserver just throws an error to the console, but doesn’t respond to the running http request: foodsharing#1068
No timeout for the chatserver requests from PHP: foodsharing#1069
No timeout for php requests: dragonfruit-ansible#6

the combination of each single issue was causing the outage.

additionally findings

No monitoring for the chatserver: dragonfruit-ansible#7
Chatserver should be restarted after deployments: foodsharing#1070

Thanks to ansible, cleaning up all the mess due to the fiddling with settings is quite an easy task, but I will rather run the playbook tonight - who knows what unexpected side effects it will have :D