Feature ASH language server features

Thanks, I figured that token examination existed (and I agree that it makes sense to include in this initial patch, as it is the API you want to expose). That looks like what I was asking for; I can augment with whatever test cases I had in mind (e.g. multiline comments, which were missing from the existing set of tests).

My preference is to keep errorText separate from resultTokens, as that better conveys intent. I'll take a look in a bit.
 
Some initial observations of existing implementation quirks (that are preserved), from adding some extra test cases:

Map literal: The parsing of $item[foo] is {"$", "item", "[foo", "]"}. That's a little bit surprising, given int[class] parses as {"int", "[", "class", "]"}, but I suppose it's mentioned in existing tests that you've added.

Array literal: We check for too many specified elements, but not too few (int[10] {1, 2, 3}).

Lots of places multiline comments don't work that they probably ought to (e.g. inside plural typed constants).


Some implementation-specific comments (I can continue working through these, just thought I'd post a brief update on where I am):

- It's a little strange that you have Token.equals(String) defined but not actually Token.equals(Token) (or the usual Token.equals(Object)).

- I personally think Line.substring() should return empty string if Line.content is null, for consistency with how I normally think about substring() -- namely, that it always returns a non-null String.

Token substring methods are untested. I see a reference in parseAssignment, so maybe I'll add some tests for that too. I don't see any references to the one-argument version, though -- is that coming later, or is it just for consistency with existing String methods?
 
Token substring methods are untested. I see a reference in parseAssignment, so maybe I'll add some tests for that too. I don't see any references to the one-argument version, though -- is that coming later, or is it just for consistency with existing String methods?
Unused; for consistency and/or in case it's needed in the future.
 
Incorrect. If that was an aggregate type, no rewinding happens.
The rewinding happens if the type is a simple type, like int, boolean, string, buffer, location, bounty, etc.
It most likely is always only going to be a single "word".
This code is weird.

The only non-erroring test case I could come up with to exercise rewinding was creating a typedef with the same name as a variable, then using that variable in a parenthesized expression.

In general, we can't get very far in Parser.parseValue except from places like parseCommand that don't normally expect types. Even an ordinary aggregate literal expression ("int[5] { 1, 2, 3, 4, 5}") follows a completely different codepath; ditto with an assignment or variable initialization. But suddenly when I wrap it in parentheses, Parser decides to use this path. (I then looked at the blame to see when this crazy logic was added... and saw it was part of the aggregate literals change I added 5 years ago...)

(I think this is an opportunity in the future for some cleanup / refactoring for how this logic works. Not now, though.)


Anyways. I'm getting close to a place where I'm happy with test coverage, after some amount of additional tweaking.

One case that comes to mind is multiline comment handling in nextToken. Apparently there's support for something like
Java:
-/*negative*/1.23;
. I'm not sure why you'd ever want to do that, though, although it is valid in other languages like C++ (probably because it's applying the unary minus in that case).

Even weirder is the quiet support for
Java:
1./*decimal*/23;
This isn't valid in other languages...

I'm noting there are a lot of similarities in Parser.currentToken() vs Parser.nextToken(), but one notable difference is that currentToken creates explicit tokens for skipped comments (even though neither method returns them).

Token() doesn't have the same indexOf trick that we use in Line().

Either way, I'll try to post a patch later today with my edits and tests, and then hopefully we can get this behemoth merged if you think everything checks out.
 
(There's just one more place where I still want some time to add some more test coverage: parseString as it applies to templates. I don't know if I fully understand all the changes yet, and adding tests is a useful way of thinking about examples that exercise the corner cases.)
 
For the weird numeric handling: I'd be in favor of changing readIntegerToken to readNumericToken to handle these more consistently, although I suspect it would break method invocations like "1.to_string()".

It's kinda weird that "1.23" creates three tokens! There's also some handling for "1." that never gets called because we run into "Record expected" errors first.
 
I have two diffs if anyone wants to take a look:

- ParserWithManyTests.patch is basically your changes, plus my tweaks, diffed against head.
- ParserPatch.patch is basically the diff of my tweaks vs your changes. Mostly minor changes in Parser (wording, formatting, the occasional logic change). And... 48 test cases, by my count (plus the ones you had provided earlier but commented out. And one new one that's commented out because it doesn't work).

I think this is mostly consistent with your changes. If anyone wants to look at this, I'd recommend just checking out the ParserTest changes to see what the resulting tokens are, and if those make sense. (I added judicious line wrapping, so hopefully the long lists of tokens are easier to read.)

If everything checks out, then I'll commit this later tonight.

edit: also, if anyone is curious regarding the final number for Parser.java coverage, we're at ~60%. I didn't particularly try to target things that weren't touched by this patch, so there's lots of room to improve if people care about, um, working through switch statement and record parsing.
 

Attachments

Here is the null checks patch. It doesn't conflict with the clearCurrentToken patch in any way, so you should be free to look at/apply them in any order you want.

Here are the few (type of) cases it contains, with what was done for each:
  1. A null check that was genuinely only there as a NullPointerException prevention
    Example:
    Java:
    if ( this.currentToken() != null && this.currentToken().equals( "!" ) )
  2. A null check that acted as a "end of file detection", but was also clearly in a context where a name is expected
    Example:
    Java:
    Token typeName = this.currentToken();
     
    if ( typeName == null )
    {
        throw this.parseException( "Type name expected" );
    }
    
    if ( !this.parseIdentifier( typeName.content ) )
    {
        throw this.parseException( "Invalid type name '" + typeName + "'" );
    }
  3. A null check in a "while" loop, that acted as an infinite loop prevention
    Example:
    Java:
    while ( this.currentToken() != null && !this.currentToken().equals( ")" ) )
    {
        <continue>;
    }
  4. Forgotten null checks (on types other than Tokens) that currently resulted in NPEs.
For each, the outcome was, respectively:
  1. Deleted
  2. Replaced with <token>.equals( ";" ), which catches the end of line, as well as including actual ";"s, as users in this situation most likely didn't put a name at all, rather than trying to use ";" as a name.
  3. Replaced with this.atEndOfFile(), which *actually* checks for the end of the file.
  4. Added a null check.
 

Attachments

Ah, good, you added back the atEndOfFile function that I threw away because it was unused.

(One of these changes is technically not a null check, but a String bounds check.)

These look reasonable, I'll likely commit them after work with some minor changes (e.g. test case descriptions) once I've had a bit more time to look at them more closely.


Tangentially related: something that's on my mind is memory footprint (as a result of a recent bug report). A few things immediately come to mind:

- Do we need to continue storing all the Line contents after a successful parse? (asking as a bigger picture thing, since I don't know if the language server and/or protocol relies on providing full context, or just indices)

- We could trade off speed of accessing tokens in exchange for memory by not storing the token contents directly, but rather storing index + length, then using an implicit substring() call instead of .content. We'd have to be careful not to weaken one of our safety guarantees in the process (that Token.content != null).
 
- Do we need to continue storing all the Line contents after a successful parse? (asking as a bigger picture thing, since I don't know if the language server and/or protocol relies on providing full context, or just indices)
Since I know java language clients let you peek source code, I'm guessing there's a feature that asks for server-side context, but I haven't actually looked at *every* feature of the LSP in detail yet, so I don't quite know.
Either way, the Parser.rewindBackTo() method pretty much require that we keep the lines intact (in case there's a comment between the type and the next token), so that one seems unlikely to be a good idea.
- We could trade off speed of accessing tokens in exchange for memory by not storing the token contents directly, but rather storing index + length, then using an implicit substring() call instead of .content. We'd have to be careful not to weaken one of our safety guarantees in the process (that Token.content != null).
That... seems like it could be possible?
 
- We could trade off speed of accessing tokens in exchange for memory by not storing the token contents directly, but rather storing index + length, then using an implicit substring() call instead of .content. We'd have to be careful not to weaken one of our safety guarantees in the process (that Token.content != null).
Tokens are *already* planned to be able to store this information...
Further down the line, we'll be adding the LSP4J (Language Server Protocol for Java) library to KoLmafia.
One of the classes this gives us access to is called the Range, which is composed of two Positions, which, themselves, are composed of a line and a character (as 0-indexed integers).
Token is planned to extend Range, so it is very much already planned to be able to tell where exactly in its Line its content starts, and ends...
(the Token.offset field is just temporary. Once Token extends Range, we'll replace it with Token.getStart().getCharacter() )

So really, this wouldn't need much work...
 
@heeheehee quick bugfix for a bug I found.

Remember how you noticed that "you could put line breaks *anywhere*, in plural typed constants"?
Well, it seems to also apply to the parsing of the two successive slashes needed to make a comment, i.e. "//"...

This means that
Java:
$booleans[tr/
          /ue]
was seen as a comment...

Currently, with the latest patch, this results in a StringIndexOutOfBoundsException.

It's up to you, you have two choices:
  1. Preserve the existing flawed behaviour by simply preventing the exception.
    This effectively preserves a most likely unwanted behaviour that *may* be used by someone, but is *very* unlikely, as it has absolutely no beneficial use.
  2. Fix the behaviour by checking and resetting our slash flag on newlines. This effectively gets rid of a most likely unwanted behaviour, that is unlikely to have been exploited by anyone, but maybe, *maybe* has been.
 

Attachments

I mean, when you put it like that...

With these very weird edge cases in parsing logic, I am inclined to just fix all of them because surely nobody should rely on them, given that they're non-obvious compared to a simpler alternative.

IMO this and other multi-character tokens ought to be parsed atomically, although I guess technically we're in a weird place parsing-wise with multiline plural constants.

I've also added a test for $booleans[tr/\n**/ue] just to prevent future backsliding along the same lines.

Anyways, clearCurrentToken and red_pill fix_flawed_behaviour are now r20836 and 20837, respectively.
 
(I'm going to take a closer look at the null checks patch again, just to make sure I didn't miss anything when I skimmed it this morning. I only pulled it aside because it's nominally more involved than the other two.)
 
And here is the parseString refactor.

- that method is now exclusively for string parsing, allowing a lot of typed constant handling to be removed from it.
- Since the method is now just for strings, it now has a string character check at the start (', " or `).
- The typed constant handling is now in its own method.
- Escaped character handling is now in its own method, which both aforementioned methods now call.
- Since the parsing of typed constants no longer needs to include the opening [ (which it previously did because the parsing of *plural* constant was done in parseString, and parseString needed to include the start character to make sure to not trim leading whitespace), said character is now read before looping through characters, allowing it to be its own token.
- Hexadecimal, Unicode and octal character escapes now also catch IndexOutOfBoundsExceptions, making things like "\u1" safe.
 

Attachments

Back
Top