Bug - Fixed Java Bug: On some versions of Java, weird things happen to strings with special characters

Recently, when using OpenJDK 17.0.3, I ran into the following error:

Internal exception: Wrapped net.sourceforge.kolmafia.textui.ScriptException: Bad item value: Our Daily Candles⢠order form (file:/C:/Users/shhhhh/Desktop/mafia/scripts/phccs/phccs_gash.js#3387)

That code was compiled from this Typescript code, which previously ran without incident. The line in question is here, which runs as part of the main function located here.

Rolling my java back to JDK 17+35 seems to have reverted the error, but the issue seems to lie somewhere in handling the TM special character.

When I was still using 17.0.3, I ran some quick tests in the CLI, and came up short of any answer:

> js toItem("Our Daily Candles™ order form") Returned: Our Daily Candles™ order form

> js Item.get("Our Daily Candles™ order form") Returned: Our Daily Candles™ order form

I suspect that it has something to do with saving and later retrieving the string, but have been unable to track down the issue.

This very well may be a bug report that belongs in Rhino, or in Adoptium or what have you, but it feels like mafia is for sure the first place to report it.
 

MCroft

Developer
Staff member
The 17.0.3 release notes talk about changes to xpath, and the script uses xpath, so if Rhino relies on jaxb for xpath, I suspect you're correct that it lies with Oracle, Adoptium, or Mozilla. I don't see anything specific in the Rhino GitHub project bug list on this. I think they (and we) would like a further reduced test case to get at the root of the problem.

I don't know that KoLmafia can do much for you, but I'd suggest starting with printing out the two variables on either side of the !== comparator and seeing which value changed between 17.0.2 and 17.0.3

JavaScript:
  if (eudora && (0,external_kolmafia_.eudoraItem)().name !== eudora) {
    var eudoraNumber = 1 + eudorae.indexOf(eudora);

    if (!(0,external_kolmafia_.xpath)((0,external_kolmafia_.visitUrl)("account.php?tab=correspondence"), "//select[@name=\"whichpenpal\"]/option/@value").includes(eudoraNumber.toString()) && throwOnFail) {
      throw new Error("I'm sorry buddy, but you don't seem to be subscribed to ".concat(eudora, ". Which makes it REALLY hard to correspond with them."));
    } else {
      (0,external_kolmafia_.visitUrl)("account.php?actions[]=whichpenpal&whichpenpal=".concat(eudoraNumber, "&action=Update"), true);
    }

    if ((0,external_kolmafia_.eudoraItem)() !== external_kolmafia_.Item.get(eudora) && throwOnFail) {
      throw new Error("We really thought we changed your eudora to a ".concat(eudora, ", but Mafia is saying otherwise."));
    }
  }
 
The fact that I got a "Bad item value" exception suggests that the issue here is from the Item.get() call, rather than from the eudoraItem() function. I similarly didn't throw an error from the xpath itself, which suggests that the xpath went alright.
 

Veracity

Developer
Staff member
The item name contains an HTML character entity for the (tm) symbol - not an actual character in (some) character encoding.

ASH to_item(string) calls ItemDatabase.getItemId(String) which calls StringUtilities.getCanonicalName(String) which calls StringUtilities.getEntityEncode(String utf8String), which will convert UTF-8 characters into HTML entities.

It actually does this:

Code:
      if (utf8String.contains("&") && utf8String.contains(";")) {
        entityString = CharacterEntities.escape(CharacterEntities.unescape(utf8String));
      } else {
        entityString = CharacterEntities.escape(utf8String);
      }

... which makes sure that the "&" character in existing HTML encoded strings doesn't, itself, get encoded. :)

I guess my question is, what character encoding is the (tm) symbol in your script? If it is not UTF-8, we won't parse it.
 
I guess my question is, what character encoding is the (tm) symbol in your script? If it is not UTF-8, we won't parse it.
You know, I'm really not sure. I'll start looking into that.

Do you have any idea why mafia's ability to roll with the punch here would change between java versions?
 

Ryo_Sangnoir

Developer
Staff member
There's something odd with special characters even in 17+35.

Code:
> ash string x = "™"; print(x)

�
Returned:    void

> js var x = "™"; print(x)

™
Returned:    null

Also, that encoding looks like what KoL does: if you wish "to encode ™", the game states that you wished "to encode â�¢".

Possibly none of this is related.
 

Veracity

Developer
Staff member
encoding.ash:
Code:
string x = "™"; print(x)
yields
Code:
> ash string x = "™"; print(x)

™
Returned: void

> encoding.ash

™
This is on a Mac. I think that the input charset as seen by the gCLI is OS dependent.
 
Last edited:

Ryo_Sangnoir

Developer
Staff member
Mine is Windows, and indeed â�¢ is what you get when you encode ™ using UTF-8 and decode using latin-1.
 

Irrat

Member

Yes, a few places in the source the strings are not read using UTF-8

This doesn't happen for js however, so not sure its related to this issue.

Edit: Just realized there's a PR open regarding this.
 
Last edited:

Rinn

Developer
This can be minimally reproduced with a .js script that contains

Code:
module.exports.main = function() { require("kolmafia").Item.get("Our Daily Candles™ order form") }
 

Irrat

Member
The issue didn't exist for me until I switched to https://adoptium.net/
Now I can confirm I'm getting the issue.

The issue is that adoptium has registered the .js mime type as "text/javascript" while other JDKs see it as "unknown/other"
When Rhino sees the prefix of "text" it'll use "8859_1"

 

Ryo_Sangnoir

Developer
Staff member
Nice sleuthing. Where did you find out that adoptium registered the .js MIME? Could we ask them to use "application/javascript" instead?

Rhino's behaviour is also arguably outdated: the code has been in there since 2010, and I think in the case of no explicit encoding utf-8 is more likely nowadays. I see there's a check for an explicit encoding: is there any way we can provide one?
 

Irrat

Member

Probably these two files.

URLConnection is FileURLConnection which is a sun class and protected. We can't access that.
The mime type is declared inside of FileURLConnection and the methods used to read the encoding in Rhino are statics..

Unless we want to copy/paste a large Rhino method just to change one line, not much options.
I'm not doing anything further on this, just going to settle for my crappy workaround.
 

heeheehee

Developer
Staff member
I see there's a check for an explicit encoding: is there any way we can provide one?
ParsedContentType in Rhino looks for charset=... in the mime type. If we can't control the mime type, then we can't control the explicit encoding.

A longer-term solution might be to send a PR upstream to Rhino to stop treating these as 8859-1. Short-term it makes sense to put in some workaround as in comment #12.
 

Irrat

Member
Oh that completely slipped my mind about the system property. I blame lack of sleep.
That itself would work as a fix, assuming we're able to hook it in early enough.

I wonder if its possible to modify the build process so calls will set it in the invocation process of the jar.
 

Ryo_Sangnoir

Developer
Staff member
Setting it at the same time as the other properties in KoLmafia.java is early enough: I tried it and it worked.

Don't think we'll be able to use the content-type.properties from before they changed, though -- I checked the repo and it's GPLv2, while we're BSD 3-clause. A blank file will fix our issue and hopefully not break anything.
 
Top