Bug IO Exception almost daily on manual adventuring.

VladYvhuce · Oct 10, 2024

This is one I have no idea how to trigger, and since I don't know how to read debug logs, may not be able to capture with a log that is not too big to post to the forums. But, it seems like it happens at some point any day that I happen to do a bunch of manual adventuring. In the browser, it shows up simply as the adventuring pane not loading. This can be during combat, or noncombat. This time it happened in the middle of a choice adventure.

This is what shows up when looking in mafia's Graphical CLI tab:
IOException retrieving server reply (choice.php?pwd&whichchoice=923&option=1) -- /192.168.0.52:52425: GOAWAY received.

Crowther · Oct 10, 2024

That message is usually due to a server side error.

fronobulax · Oct 10, 2024

Crowther said:
That message is usually due to a server side error.

Agreed. But there was one place that if I understood what I was reading claimed it could be a client error. The client was calculating a header size as if the protocol were HTTP/1. But things were being sent as HTTP/2. The corresponding portion of the header was compressed and so the sizes were different for HTTP/1 and HTTP/2. The server returned GOAWAY because the header was not the size it was claimed to be.

I suppose if someone who knew protocols better than I do and was motivated then they could check the code and decide whether that applies and whether something is being built for the wrong protocol.

heeheehee · Oct 11, 2024

fronobulax said:
I suppose if someone who knew protocols better than I do and was motivated then they could check the code and decide whether that applies and whether something is being built for the wrong protocol.

If that were the case, then ~every request would result in a GOAWAY.

We do not have our own hand-rolled HTTP codec, as we're relying on Java's standard library implementation. As such, I'm very skeptical of the notion that Mafia is doing something that's standards-noncompliant.

We have seen similar reports in the past, which were nominally attributed to AWS ALB. I have no reason to believe this case is different. https://kolmafia.us/threads/ioexception-goaway-received.27397/

Per observations in the thread (as well as updates to the github issue I linked), it seems that AWS ALB will send a GOAWAY and a TCP FIN after 10k requests on the same connection, or if any of the headers are too big (> 8 KB) post-compression. With that new information at hand, @Ryo_Sangnoir's suggestion to recreate the HttpClient every X requests (for some value < 10000) seems a lot more sensible.

Also, for what it's worth, if it is actually initiated by AWS ALB, then it's likely safe to retry, since said request would not be forwarded to the backend (i.e. KoL's actual webserver) at all.

heeheehee · Oct 11, 2024

That said, r26388 implemented the nominal "reset the HttpClient every X requests" (which was later refined in r27798). I would be a little surprised if we're running into the headers-too-big problem. I'd be more inclined to believe that something isn't actually using the ResettingHttpClient.

Ryo_Sangnoir · Oct 11, 2024

heeheehee said:
Also, for what it's worth, if it is actually initiated by AWS ALB, then it's likely safe to retry, since said request would not be forwarded to the backend (i.e. KoL's actual webserver) at all.

From what I can tell, the request /is/ forwarded to KoL's server, which then responds with GOAWAY. Last I checked I thought it might come from NGINX -- I think the default config is GOAWAY after a large number of requests on the same connection.

heeheehee · Oct 11, 2024

KoL is apparently using both nginx/1.18.0 and nginx/1.8.1, per the returned headers (the AWSALB cookie pins you to a particular backend). Perhaps they've recently started a migration, which would explain why this started happening again. (HTTP/2 support in nginx was added in 1.9.5.)

https://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_requests defaults to 1000; that then also results in a GOAWAY. We likely need to lower our threshold. Nominally, lowering our threshold to 900 should be fine; the overhead of setting up a new HTTP/2 connection (~3 round trips since it looks like my curl request is being downgraded to TLS 1.2) should be fairly small whether amortized over 100 or 1000 requests.

(Per the linked docs, that particular directive is merged into keepalive_requests in newer versions of nginx; between 1.19.7 and 1.19.10 the default drops briefly to 100.)

Even so, if the GOAWAY comes from nginx, it should be safe to retry since it never actually makes it to its backend, namely the PHP server which handles the business logic.

VladYvhuce · Oct 13, 2024

This one happened mid-fight. Again, manual adventuring. This is the graphical CLI log. The fight results never showed up in the relay browser.

[535] The Castle in the Clouds in the Sky (Basement)
Preference lastEncounter changed from Alphabet Giant to Neckbeard Giant
Encounter: Neckbeard Giant
Preference _lastCombatStarted changed from 20241013183846 to 20241013183907
Round 0: Rick Tyger wins initiative!
Round 1: Your toy train moves ahead to the Coal Hopper. Your train takes on coal to power the next stop.
Preference trainsetPosition changed from 457 to 458
Preference _currentDartboard changed from 7513:butt,7514:head,7515:torso,7516:arm,7517:leg to 7513:torso,7514:leg,7515:arm,7516:butt,7517:head
Preference _dartsLeft changed from 2 to 3
Round 1: Rick Tyger casts INFECTIOUS BITE!
Round 2: Jed claws, claws, and bites your opponent for 14 (+16) damage.
Round 2: Neckbeard Giant takes 30 damage.
Round 2: Neckbeard Giant takes 16 damage.
Round 2: Rick Tyger casts SURPRISINGLY SWEET STAB!
IOException retrieving server reply (fight.php?action=skill&whichskill=7489) -- /192.168.0.52:52178: GOAWAY received.
Preference _concoctionDatabaseRefreshes changed from 366 to 367

heeheehee · Oct 14, 2024

Hm. For the next time this happens again, mind grabbing a debug log at the start of your session, then again right after it occurs?

Specifically,

Code:

debug on; text game.php; debug off

should be enough to grab some response headers, which'll tell us which version of nginx is issuing the GOAWAY.

Code:

...
Retrieved: https://www.kingdomofloathing.com/game.php
11 header fields
Field: :status = [200]
Field: cache-control = [no-store, no-cache, must-revalidate, post-check=0, pre-check=0]
Field: content-encoding = [gzip]
Field: content-type = [text/html; charset=UTF-8]
Field: date = [Mon, 14 Oct 2024 00:34:02 GMT]
Field: expires = [Thu, 19 Nov 1981 08:52:00 GMT]
Field: pragma = [no-cache]
Field: server = [nginx/1.18.0]
...

We probably could just set HTTP_CLIENT_REQUEST_LIMIT to 700 and call it a day... although I couldn't seem to get a GOAWAY when I tried to reproduce, so there might be something else at play.

VladYvhuce · Oct 14, 2024

heeheehee said:
Hm. For the next time this happens again, mind grabbing a debug log at the start of your session, then again right after it occurs?

Specifically,

Code:

debug on; text game.php; debug off

should be enough to grab some response headers, which'll tell us which version of nginx is issuing the GOAWAY.

Code:

... Retrieved: https://www.kingdomofloathing.com/game.php 11 header fields Field: :status = [200] Field: cache-control = [no-store, no-cache, must-revalidate, post-check=0, pre-check=0] Field: content-encoding = [gzip] Field: content-type = [text/html; charset=UTF-8] Field: date = [Mon, 14 Oct 2024 00:34:02 GMT] Field: expires = [Thu, 19 Nov 1981 08:52:00 GMT] Field: pragma = [no-cache] Field: server = [nginx/1.18.0] ...

We probably could just set HTTP_CLIENT_REQUEST_LIMIT to 700 and call it a day... although I couldn't seem to get a GOAWAY when I tried to reproduce, so there might be something else at play.

Can you explain a bit more clearly? Do you want me to try logging from the beginning until it occurs and hoping the log isn't too big? Or just turn it on and off? Or on, then do something then off?

heeheehee · Oct 14, 2024

The command I specifically listed should open a log, make a single (small) request, and then turn off the log, so you can run it both at the start of your session and right after encountering the problem.

VladYvhuce · Oct 14, 2024

Ah. Ok. I'll give it a shot. Was putting of till I made sure I understood what you were asking for.

VladYvhuce · Oct 17, 2024

This time it happened while equipping a familiar with equipment in the terrarium. Did the log command as instructed. One after logging in, the second immediately after the adventure pane not displaying.

heeheehee · Oct 18, 2024

Indeed, this is the newer version of nginx correlated with the GOAWAYs, as I thought it might be.

That debug log is fascinating, though. It almost looks like we're running into a timeout, since it's just over an hour since you initiated the first request.

Timestamp: Thu Oct 17 05:24:02 CDT 2024
...
Timestamp: Thu Oct 17 06:24:34 CDT 2024

This might actually be AWS ALB once again, except this time based on time since first request as opposed to raw number of requests. We could probably keep playing whackamole here, since it's really unfortunate that Java evidently has a bug in its GOAWAY handling... but I don't see a reasonable alternative until the fix gets backported more widely.

Bug IO Exception almost daily on manual adventuring.

VladYvhuce

Member

Crowther

Active member

fronobulax

Developer

heeheehee

Developer

heeheehee

Developer

Ryo_Sangnoir

Developer

heeheehee

Developer

VladYvhuce

Member

heeheehee

Developer

VladYvhuce

Member

heeheehee

Developer

VladYvhuce

Member

VladYvhuce

Member

Attachments

heeheehee

Developer