Feature ASH language server features

Diff:
+            try
+            {
+                line = commandStream.readLine();
+            }
+            catch ( IOException e )
+            {
+                // This should not happen.  Therefore, print
+                // a stack trace for debug purposes.
+
+                StaticEntity.printStackTrace( e );
+                line = null;

Might I suggest initializing this.content / this.lineNumber / this.offset here
and then returning? That way you don't have to keep track of "maybe we caught an
error condition". Guards are generally useful for helping to reduce nesting and
the amount of mental state needed.

https://en.wikipedia.org/wiki/Guard_(computer_science)

[...]

Instead of needing to think about both these cases, can we add an early return
from the catch block as requested above? Regardless, I'd recommend moving this
up after the try/catch statement, as the following:

if ( line == null )
{
    this.lineNumber = this.previousLine != null ? this.previousLine.lineNumber : 0;
    this.offset = this.previousLine != null ? this.previousLine.offset : 0;
    this.content = null;
    return;
}

Then, you can go through the rest of the constructor without worrying if line
might be null.
This is understandable, and easy enough. Done.
Diff:
+                // Remove whitespace at front and end
+                final String trimmed = line.trim();
+                final int ltrim = line.indexOf( trimmed );

This is more expensive than likely needed, but until it's a proven performance
issue, it's likely better to leave it as the more obvious solution.

Java's String.indexOf( String ) is mostly optimized for this situation, anyways (although, we
still need to read through the end of the line to make sure these strings are
identical).

The "better" (at least, more efficient) implementation might look something like...

if ( !trimmed.isEmpty() )
{
    // Here, line.indexOf( trimmed ) == line.indexOf( trimmed.charAt( 0 ) )
    offset += line.indexOf( trimmed.charAt( 0 ) );
}

+
+                if ( ltrim > 0 )
+                {
+                    offset += ltrim;
+                }
+
+                line = trimmed;
+            }
That's smart! It may be "less obvious", but a simple comment should be enough to clarify why we don't check for the whole "trimmed" string.
Easily done.
Diff:
+                this.lineNumber = this.previousLine != null ? this.previousLine.lineNumber : 0;
+                this.offset = this.previousLine != null ? this.previousLine.offset : 0;
+            }
+        }
+
+        private String substring( final int beginIndex )
+        {
+            // subtract "offset" from beginIndex, since
+            // we already removed it
+            return this.content.substring( beginIndex - this.offset );

Is it okay for this function to be unchecked?
The point was meant to be that, just like String.substring, you're meant to already have done a nullcheck before calling it, but... Yeah, it sure doesn't hurt.
 

Attachments

The point was meant to be that, just like String.substring, you're meant to already have done a nullcheck before calling it, but... Yeah, it sure doesn't hurt.
It's subtly different, since the Line itself can be non-null, but .substring() could trigger a null dereference.

My concern here is that you've now deferred the null check around Line.substring() to the caller, which may in turn propagate it even further. For instance, restOfLine() directly returns substring(), adding another layer of indirection.

I've started a small collection of ParserTest cases, focusing on NPEs related to this behavior.

Code:
            {
                "bare directive",
                "script",
                "NPE"
            },
            {
                "unclosed cli_execute",
                "cli_execute {",
                "NPE"
            },
            {
                "unterminated string ending in newline",
                "'\\\n",
                "NPE"
            },

(There are also a number of checks of this.currentToken() == null scattered throughout the code not updated by the patch, which are now most likely incorrect.)
 
(There are also a number of checks of this.currentToken() == null scattered throughout the code not updated by the patch, which are now most likely incorrect.)
This is literally what the next patch will address. Removing the useless ones, replacing some by a check that actually helps, and adding other useful null checks, fixing some NPEs that I found myself.
 
Okay. So, some context that I'm keeping in mind as I work through this latest diff (Parser_Tokens_5... I wish we were using proper tooling for code review so we could avoid versioning through naming)

- Pre-patch, Parser.currentToken() was guaranteed to be a non-null String, as we discussed. Post-patch, Parser.currentToken() is guaranteed to be a non-null Token holding a non-null String content. As such, any hunks that perform the direct transformation of Parser.currentToken() -> Parser.currentToken().current are safe, as are hunks that perform the similar transformation String token = this.currentToken() ... token -> Token token = this.currentToken() ... token.content (up to local variable renaming -- also, null checks against token are just as moot as before).

You could have potentially reduced the cognitive load / state tracking across hunks for this second transformation by instead storing `String token = this.currentToken().content;` from the get-go, but I can deal.

- There are a number of situations where you move a readToken() call earlier so you can operate on Parser.currentToken() instead of Parser.nextToken(). This seems good from a safety perspective, since we now have guarantees in place.

Any other tips I should keep in mind to help reduce the effective size of the diff?

Of the three NPEs I found, two were in existing code, and the third seems like it was previously an infinite loop; I think crashing is better than spinning indefinitely. Both are failure modes, but crashing actually yields a failure.

(Also, an unterminated multiline singular typed constant ("$item[\\\n") yields an StringIndexOutOfBoundsException both with/without your patch. Another test case to make sure you're aware of in the follow-up.)


Some misc observations as I continue my first read-through of the patch.

- Is it a concern that we're buffering the whole file in memory (per parsing it into Lines, which are basically nodes in a linked list)? I guess we were already doing that to some extent, but now we're basically storing the entire file 2x (once in Lines, once in the tokenized representation)

- If you wanted to avoid manually linking your Lines (or storing your line numbers), you could just store them in a list of some variety, and use a ListIterator. That's not clearly better to me from a clarity perspective, so I'm not going to require it one way or another.

- I'm noting that you wrap your comments at a very narrow column width (~60 characters). We've been discussing adopting the Google Java style guide, which dictates 100-character line limits. Also with Google style in mind: there are a number of guidelines re: comment style / wording. (Some of them, like "there must be an empty comment line between paragraphs", help tooling not, um, break your formatting entirely.)

- Another thing that is... probably not correct: Parser is happy to receive an unterminated multiline comment. This is surprising, but I guess not harmful. I hope that nobody actually relies on this behavior...
 
How hard would it be for you to split out the parseString / parsePluralTypedConstant / parseEscapeSequence changes into yet another patch?

(I recognize that you'd have to continue referencing the old variables in that change.)

That might be a more focused change that I can review without having to worry about Line / Token semantics.

For instance, parseString might look something like...

Java:
    private Value parseString( final BasicScope scope )
     {
        if ( !this.currentToken().equals( "\"" ) &&
             !this.currentToken().equals( "'" ) &&
             !this.currentToken().equals( "`" ) )
        {
            return null;
        }

         // Directly work with currentLine - ignore any "tokens" you meet until
         // the string is closed
 
        char startCharacter = this.currentToken().charAt( 0 );
        this.readToken();
        char stopCharacter = startCharacter;
        boolean template = startCharacter == '`';
 
        Concatenate conc = null;
        StringBuilder resultString = new StringBuilder();
        for ( int i = 0; ; ++i )
         {
            if ( i == this.currentLine.length() )
            {
                // Plain strings can't span lines
                throw this.parseException( "No closing " + stopCharacter + " found" );
            }
 
            char ch = this.currentLine.charAt( i );
 
             // Handle escape sequences
             if ( ch == '\\' )
             {
                i = this.parseEscapeSequence( resultString, i );
                continue;
             }
 ...

edit: my view of your latest patch is that it's still combining a few things: there's a refactor to split several bits of functionality into distinct methods, and then there's the move to Line / Token.
 
How hard would it be for you to split out the parseString / parsePluralTypedConstant / parseEscapeSequence changes into yet another patch?

(I recognize that you'd have to continue referencing the old variables in that change.)

That might be a more focused change that I can review without having to worry about Line / Token semantics.
There were some minor bumps, but I recon that it was... simpler than anticipated.
 

Attachments

- Is it a concern that we're buffering the whole file in memory (per parsing it into Lines, which are basically nodes in a linked list)? I guess we were already doing that to some extent, but now we're basically storing the entire file 2x (once in Lines, once in the tokenized representation)
This is what the Language Server needs.
This may seems simple, currently, but later patches will add more functionalities to Tokens and Lines (mostly to Tokens), which the language server relies upon.
- I'm noting that you wrap your comments at a very narrow column width (~60 characters). We've been discussing adopting the Google Java style guide, which dictates 100-character line limits. Also with Google style in mind: there are a number of guidelines re: comment style / wording. (Some of them, like "there must be an empty comment line between paragraphs", help tooling not, um, break your formatting entirely.)
I'm simply trying my best to mimic the current code's style/format. VScode already wraps lines by starting at that line's indentation, so it's not for me.
If I've been going too short, just say it. I simply aim for consistency.
Code_b9h1ErNnTm.png

edit: my view of your latest patch is that it's still combining a few things: there's a refactor to split several bits of functionality into distinct methods, and then there's the move to Line / Token.
I'm not entirely sure of what you're referencing here. Could you point to an example?
 
'm not entirely sure of what you're referencing here. Could you point to an example?
Oh, this was the parseString / parseEscapeSequence / etc refactor that I mentioned. I think you've already tackled it, thanks! I'll take a look momentarily.
 
My only substantial comment re: constructor changes: you implicitly assume throughout this change that this.currentLine can never be null.

We should either nuke the zero-argument Parser constructor, since nobody uses it, or add some logic to ensure that if stream != null and file != null, then this.currentLine is initialized to... something. Otherwise it'll blow up immediately in parse() when you try to get the first token to compare against "script".

I'm going to dump the comments I have so far, since there's still a fair amount I have yet to read. (I feel like I keep hitting a brick wall with parseString, since there are a number of changes and I'm still not sure I understand the mechanics of what's happening.)

Diff:
+        Parser parser = new Parser( scriptFile, null, this.imports );
+        Scope result = parser.parseScope( scope, null, null, scope.getParentScope(), false, false );
+        if ( parser.currentLine.nextLine != null )
         {

These could probably be final.

Diff:
@@ -653,4 +643,4 @@
             if ( this.currentToken() == null )
             {
-                throw this.parseException( "}", "EOF" );
+                throw this.parseException( "}", this.currentToken() );
             }

Hm. There are a few cases like this where you modify an error message
that'll never get run. I guess that's because parseException no longer takes (String, String).

Diff:
@@ -1438,6 +1428,11 @@
 
     private boolean parseIdentifier( final String identifier )
     {
+        if ( identifier == null )
+        {
+            return false;
+        }
+
         if ( !Character.isLetter( identifier.charAt( 0 ) ) && identifier.charAt( 0 ) != '_' )
         {
             return false;
@@ -1456,6 +1451,11 @@
 
     private boolean parseScopedIdentifier( final String identifier )
     {
+        if ( identifier == null )
+        {
+            return false;
+        }
+
         if ( !Character.isLetter( identifier.charAt( 0 ) ) && identifier.charAt( 0 ) != '_' )
         {
             return false;

Safety checks are good, I suppose. Are these reachable by targeted scripts?

Diff:
@@ -1697,5 +1698,10 @@
 
         while ( this.currentToken() != null && !this.currentToken().equals( "}" ) )
         {
+            this.currentToken = null;
+            this.currentLine.tokens.removeLast();
+
+            final String line = this.restOfLine();
+
             try
             {

We're... dropping tokens in cli_execute { ... } blocks? Am I reading this correctly?

I believe this Line.restOfLine() call causes one of the crashes I mentioned down
the line, although that's no worse than the current state.

Diff:
@@ -1710,8 +1716,12 @@
                 StaticEntity.printStackTrace( e );
             }
 
-            this.currentLine = "";
-            this.fixLines();
+            if ( line.length() > 0 )
+            {
+                this.currentLine.makeToken( line.length() );
+            }
+            this.currentLine = this.currentLine.nextLine;
+            this.currentIndex = this.currentLine.offset;
         }
 
         if ( this.currentToken() == null )

(context: parseBasicScript) We're... turning each line into a token? This seems
to be consistent with existing behavior (where we return if/when we find a
closing brace on its own line. Seems fragile, though.

Diff:
@@ -2036,7 +2048,7 @@
             return null;
         }
 
-        this.readToken(); // final
+        this.readToken(); // static
 
         Scope result = new StaticScope( parentScope );

Hooray, typo fixes.

Diff:
@@ -3193,6 +3203,8 @@
 
         else
         {
+            Token anchor = this.currentToken();
+
             Type baseType = this.parseType( scope, true, false );
             if ( baseType != null && baseType.getBaseType() instanceof AggregateType )
             {
@@ -3207,7 +3219,7 @@
             {
                 if ( baseType != null )
                 {
-                    this.replaceToken( baseType.name );
+                    this.rewindBackTo( anchor );
                 }
                 if ( ( result = this.parseVariableReference( scope ) ) != null )
                 {

To confirm: you need the rewinding logic because you may have parsed multiple
tokens while trying to parse an aggregate type?

Diff:
@@ -3320,10 +3332,11 @@
 
     private Value parseString( final BasicScope scope, Type type )
     {
+        this.currentToken = null;
+        this.currentLine.tokens.removeLast();
+

Which token are we getting rid of? The opening quote?

Either way, this is something that I'm seeing a bunch of times and might be
clearer if you defined it as a helper method such as `removeLastToken()`.

Bonus points for comments indicating which tokens are being removed.

Diff:
@@ -3346,22 +3359,39 @@
 
         Concatenate conc = null;
         StringBuilder resultString = new StringBuilder();
-        for ( int i = 0; ; ++i )
+        for ( int i = 1; ; ++i )
         {
-            if ( i == this.currentLine.length() )
+            final String line = this.restOfLine();
+
+            if ( i == line.length() )
             {
+                if ( i == 0 && this.currentIndex == this.currentLine.offset )
+                {
+                    // Empty lines are OK.
+                    this.currentLine = this.currentLine.nextLine;
+                    this.currentIndex = this.currentLine.offset;
+                    i = -1;
+                    continue;
+                }
+

Can you help me understand what case the last if-statement is checking?
Something about an empty line? When does it trigger?
 
My only substantial comment re: constructor changes: you implicitly assume throughout this change that this.currentLine can never be null.

We should either nuke the zero-argument Parser constructor, since nobody uses it, or add some logic to ensure that if stream != null and file != null, then this.currentLine is initialized to... something. Otherwise it'll blow up immediately in parse() when you try to get the first token to compare against "script".
That's what I wanted to do (the "nuking the zero-argument constructor" part).

But that zero-argument version is used for the "default" value of AshRuntime's "parser" field, which, itself, is clearly assumed to never be null.

So it's just a matter of "Parser can't/isn't allowed to parse anything if the zero-argument version was used to initialize it"
This shouldn't be hard to enforce, since, of the parsing methods, only Parser.parse is public.
Diff:
+        Parser parser = new Parser( scriptFile, null, this.imports );
+        Scope result = parser.parseScope( scope, null, null, scope.getParentScope(), false, false );
+        if ( parser.currentLine.nextLine != null )
         {

These could probably be final.
This project really doesn't fancy the use of *final* on variables (not fields (even then...)), so I tried to avoid peppering too many of it...
Diff:
@@ -653,4 +643,4 @@
             if ( this.currentToken() == null )
             {
-                throw this.parseException( "}", "EOF" );
+                throw this.parseException( "}", this.currentToken() );
             }

Hm. There are a few cases like this where you modify an error message
that'll never get run. I guess that's because parseException no longer takes (String, String).
Correct.
Diff:
@@ -1438,6 +1428,11 @@
 
     private boolean parseIdentifier( final String identifier )
     {
+        if ( identifier == null )
+        {
+            return false;
+        }
+
         if ( !Character.isLetter( identifier.charAt( 0 ) ) && identifier.charAt( 0 ) != '_' )
         {
             return false;
@@ -1456,6 +1451,11 @@
 
     private boolean parseScopedIdentifier( final String identifier )
     {
+        if ( identifier == null )
+        {
+            return false;
+        }
+
         if ( !Character.isLetter( identifier.charAt( 0 ) ) && identifier.charAt( 0 ) != '_' )
         {
             return false;

Safety checks are good, I suppose. Are these reachable by targeted scripts?
No, these checks are not currently reachable.
The only way for this method to receive null is if it is supplied Parser.nextToken()
- the null check in parseScopedIdentifier is pretty much impossible to reach. It's only put there so to conform with parseIdentifier.
- the null check in parseIdentifier is *currently* unused, but *will* be used. Twice.
Once for a work-in-progress feature of the language server.
Once for a check that should currently be done, but isn't (bug): at the start of Parser.parseFor.
Diff:
@@ -1697,5 +1698,10 @@
 
         while ( this.currentToken() != null && !this.currentToken().equals( "}" ) )
         {
+            this.currentToken = null;
+            this.currentLine.tokens.removeLast();
+
+            final String line = this.restOfLine();
+
             try
             {

We're... dropping tokens in cli_execute { ... } blocks? Am I reading this correctly?
The two first lines (which I really should have turned into a method. Something like "eraseUnreadTokens()"), isn't "dropping" tokens.

-When not null, Parser.currentToken contains a token that was *not yet read*.
-A token, the moment it is created, instantly gets added to the end of it's Line's tokens field. Even though it's not read yet.

Imagine you have
Code:
1- cli_execute
2- {
3-     something, something else, another action, some other command...
4- }
and we are at line 3.

By calling Parser.currentToken() to check for the presence of "}", we are marking "something" as a token.
If we were to continue without removing it, the line would have two tokens:
"something" and ",something else, another action, some other command...".
We don't want that. We want a single token: "something, something else, another action, some other command..."
Nothing was read. Nothing was discarded.
I believe this Line.restOfLine() call causes one of the crashes I mentioned down
the line, although that's no worse than the current state.
As mentioned previously, the null checks are what's coming in the next patch, which include the addition of end-of-file checks anywhere about to use Parser.restOfLine().
Diff:
@@ -1710,8 +1716,12 @@
                 StaticEntity.printStackTrace( e );
             }
 
-            this.currentLine = "";
-            this.fixLines();
+            if ( line.length() > 0 )
+            {
+                this.currentLine.makeToken( line.length() );
+            }
+            this.currentLine = this.currentLine.nextLine;
+            this.currentIndex = this.currentLine.offset;
         }
 
         if ( this.currentToken() == null )

(context: parseBasicScript) We're... turning each line into a token? This seems
to be consistent with existing behavior (where we return if/when we find a
closing brace on its own line. Seems fragile, though.
Again, remember that this is not all just to parse a script. This is for a language server, and the language server needs *everything* to be in tokens (even comments).
Here, for example, the lines inside cli_execute {...} blocks will be marked as "string"s (well, "macro"s...), so that language clients can color them accordingly with Semantic Tokens/Semantic Highlighting.
Diff:
@@ -3193,6 +3203,8 @@
 
         else
         {
+            Token anchor = this.currentToken();
+
             Type baseType = this.parseType( scope, true, false );
             if ( baseType != null && baseType.getBaseType() instanceof AggregateType )
             {
@@ -3207,7 +3219,7 @@
             {
                 if ( baseType != null )
                 {
-                    this.replaceToken( baseType.name );
+                    this.rewindBackTo( anchor );
                 }
                 if ( ( result = this.parseVariableReference( scope ) ) != null )
                 {

To confirm: you need the rewinding logic because you may have parsed multiple
tokens while trying to parse an aggregate type?
Incorrect. If that was an aggregate type, no rewinding happens.
The rewinding happens if the type is a simple type, like int, boolean, string, buffer, location, bounty, etc.
It most likely is always only going to be a single "word".

We need to rewind because the "word" was *read*, and we don't want that (because the only thing we accept at the point is an aggregate literal, or a variable reference/call with a variable.function() syntax).
Diff:
@@ -3320,10 +3332,11 @@
 
     private Value parseString( final BasicScope scope, Type type )
     {
+        this.currentToken = null;
+        this.currentLine.tokens.removeLast();
+

Which token are we getting rid of? The opening quote?

Either way, this is something that I'm seeing a bunch of times and might be
clearer if you defined it as a helper method such as `removeLastToken()`.

Bonus points for comments indicating which tokens are being removed.
(._. )
Diff:
@@ -3346,22 +3359,39 @@
 
         Concatenate conc = null;
         StringBuilder resultString = new StringBuilder();
-        for ( int i = 0; ; ++i )
+        for ( int i = 1; ; ++i )
         {
-            if ( i == this.currentLine.length() )
+            final String line = this.restOfLine();
+
+            if ( i == line.length() )
             {
+                if ( i == 0 && this.currentIndex == this.currentLine.offset )
+                {
+                    // Empty lines are OK.
+                    this.currentLine = this.currentLine.nextLine;
+                    this.currentIndex = this.currentLine.offset;
+                    i = -1;
+                    continue;
+                }
+

Can you help me understand what case the last if-statement is checking?
Something about an empty line? When does it trigger?
this:
Code:
string x = "abc\
def\





\


ghi\
jkl";

This string will read:
"abc
def

ghi
jkl".
You don't generate an error if you don't put a \ on every line.
(this was previously automatically done by Parser.fixLines() )
 
Okay. Thanks.

I think the weird multiline thing should have been an error but never was due to a quirk of implementation (namely, calling fixLines()). It's especially awkward because that string you've presented is equivalent to "abc\ndef\n\nghi\njkl"... and I really really hope nobody's relying on this behavior.

Moving on from parseString (since the other bit of the changed logic of interest seems to be around tracking this.currentIndex), and onto parseDirective:

Diff:
@@ -3945,2 +3994,4 @@
-        int directiveEndIndex = this.currentLine.indexOf( ";" );
-        if ( directiveEndIndex == -1 )
+        // atEndOfFile() calls currentToken() (which we still want, to trim whitespace and skip comments).
+        // Remove the token that was generated.
+        this.currentToken = null;
+        this.currentLine.tokens.removeLast();

Is this comment still relevant? I see an atEndOfFile() method but can't find any references to it other than this comment.

Then... we combine the three checks for directive strings (< >, ' ', " "), with some more token manipulation.

Right around when we get to Tokenizer, the diff is... worse at matching lines, especially with currentToken(), which reshuffled a bunch of the logic for comment-handling.

In general, one challenge is that it's hard to inspect which tokens are being generated as a result of a parse. I don't know how exactly you plan to use this in the language server, but it might be helpful (for testability, debuggability, etc) to expose the list of resulting tokens. One step beyond "does this parse or not" as in ParserTest.java is "what does this parse into?" which could be something like "int a = 5;" -> {"int", "a", "=", "5", ";"}.

I can continue trying to read this patch to make sure that the code is clearer, or less buggy, or more efficient, but I think there's still a gap with regards to correctness of the generated Lines / Tokens.
 
Diff:
@@ -3945,2 +3994,4 @@
-        int directiveEndIndex = this.currentLine.indexOf( ";" );
-        if ( directiveEndIndex == -1 )
+        // atEndOfFile() calls currentToken() (which we still want, to trim whitespace and skip comments).
+        // Remove the token that was generated.
+        this.currentToken = null;
+        this.currentLine.tokens.removeLast();

Is this comment still relevant? I see an atEndOfFile() method but can't find any references to it other than this comment.
Right, I forgot to (temporarily) change that comment. The check will be added in the next patch (as I said, "anything that is about to use Parser.restOfLine() )
Currently, though, the same result is achieved by that (seemingly useless)

Java:
        if ( this.currentToken() == null )
        {
            throw this.parseException( "<", this.currentToken() );
        }
Right around when we get to Tokenizer, the diff is... worse at matching lines, especially with currentToken(), which reshuffled a bunch of the logic for comment-handling.
If it helps, here is the code (if you want me to summarize it/explain anything, just say it)
Java:
    /**
     * Returns {@link #currentToken} if non-null.
     * Otherwise, move in front of the next non-comment
     * token that we can find, before assigning it to
     * {@link #currentToken} and returning it.
     * <p>
     * Never returns {@code null}.
     */
    private Token currentToken()
    {
        // If we've already parsed a token, return it
        if ( this.currentToken != null )
        {
            return this.currentToken;
        }

        boolean inMultiLineComment = false;

        // Repeat until we get a token
        while ( true )
        {
            // at "end of file"
            if ( this.currentLine.content == null )
            {
                // will make an "end of file" token
                return this.currentToken = this.currentLine.makeToken( 0 );
            }

            final String restOfLine = this.restOfLine();

            if ( inMultiLineComment )
            {
                final int commentEnd = restOfLine.indexOf( "*/" );

                if ( commentEnd == -1 )
                {
                    this.currentLine.makeComment( restOfLine.length() );

                    this.currentLine = this.currentLine.nextLine;
                    this.currentIndex = this.currentLine.offset;
                }
                else
                {
                    this.currentToken = this.currentLine.makeComment( commentEnd + 2 );
                    this.readToken();
                    inMultiLineComment = false;
                }

                continue;
            }

            if ( restOfLine.length() == 0 )
            {
                this.currentLine = this.currentLine.nextLine;
                this.currentIndex = this.currentLine.offset;
                continue;
            }

            // "#" was "supposed" to start a whole-line comment,
            // but a bad implementation made it act just like "//"

            // "//" starts a comment which consumes the rest of the line
            if ( restOfLine.startsWith( "#" ) ||
                 restOfLine.startsWith( "//" ) )
            {
                this.currentLine.makeComment( restOfLine.length() );

                this.currentLine = this.currentLine.nextLine;
                this.currentIndex = this.currentLine.offset;
                continue;
            }

            // "/*" starts a comment which is terminated by "*/"
            if ( restOfLine.startsWith( "/*" ) )
            {
                final int commentEnd = restOfLine.indexOf( "*/", 2 );

                if ( commentEnd == -1 )
                {
                    this.currentLine.makeComment( restOfLine.length() );

                    this.currentLine = this.currentLine.nextLine;
                    this.currentIndex = this.currentLine.offset;
                    inMultiLineComment = true;
                }
                else
                {
                    this.currentToken = this.currentLine.makeComment( commentEnd + 2 );
                    this.readToken();
                }

                continue;
            }

            return this.currentToken = this.currentLine.makeToken( this.tokenLength( restOfLine ) );
        }
    }

    /**
     * Calls {@link #currentToken()} to make sure we are
     * currently in front of an unread token.
     * Then, returns a string version of the next
     * token that can be found after that.
     * @return the content of the next token
     *         to come after the token we are currently
     *         in front of, or {@code null} if we are
     *         at the end of the file.
     */
    private String nextToken()
    {
        int offset = this.currentToken().restOfLineStart;
        Line line = this.currentLine;
        boolean inMultiLineComment = false;

        while ( true )
        {
            // at "end of file"
            if ( line.content == null )
            {
                return null;
            }

            final String restOfLine = line.substring( offset ).trim();

            if ( inMultiLineComment )
            {
                final int commentEnd = restOfLine.indexOf( "*/" );

                if ( commentEnd == -1 )
                {
                    line = line.nextLine;
                    offset = line.offset;
                }
                else
                {
                    offset += commentEnd + 2;
                    inMultiLineComment = false;
                }

                continue;
            }

            // "#" was "supposed" to start a whole-line comment,
            // but a bad implementation made it act just like "//"

            if ( restOfLine.length() == 0 ||
                 restOfLine.startsWith( "#" ) ||
                 restOfLine.startsWith( "//" ) )
            {
                line = line.nextLine;
                offset = line.offset;
                continue;
            }

            if ( restOfLine.startsWith( "/*" ) )
            {
                offset += 2;
                inMultiLineComment = true;
                continue;
            }

            return restOfLine.substring( 0, this.tokenLength( restOfLine ) );
        }
    }

    /**
     * Forget every token up to {@code destinationToken},
     * so that we can resume parsing from there
     */
    private void rewindBackTo( final Token destinationToken )
    {
        this.currentToken();

        while ( this.currentToken != destinationToken )
        {
            this.currentLine.tokens.removeLast();

            while ( this.currentLine.tokens.isEmpty() )
            {
                // Don't do null checks. If previousLine is null, it means we never
                // saw the destination token, meaning we'd want to throw an error anyway.
                this.currentLine = this.currentLine.previousLine;
            }

            this.currentToken = this.currentLine.tokens.getLast();
            this.currentIndex = this.currentToken.offset;
        }
    }

    /**
     * If we are not at the end of the file, null out
     * {@link #currentToken} (allowing a new one to be gathered
     * next time we call {@link #currentToken()}), and move
     * {@link #currentIndex} forward.
     */
    private void readToken()
    {
        // at "end of file"
        if ( this.currentToken().getLine().content == null )
        {
            return;
        }

        this.currentIndex = this.currentToken.restOfLineStart;
        this.currentToken = null;
    }
In general, one challenge is that it's hard to inspect which tokens are being generated as a result of a parse. I don't know how exactly you plan to use this in the language server, but it might be helpful (for testability, debuggability, etc) to expose the list of resulting tokens. One step beyond "does this parse or not" as in ParserTest.java is "what does this parse into?" which could be something like "int a = 5;" -> {"int", "a", "=", "5", ";"}.
Rest assured, this is very much *already* a thing on my side. It's currently used by the "Semantic Tokens" feature (which I referenced earlier).
I can continue trying to read this patch to make sure that the code is clearer, or less buggy, or more efficient, but I think there's still a gap with regards to correctness of the generated Lines / Tokens.
Could you elaborate on that? If it is just "worries", know that I had ample time, occasions, and ways to test this.
idea64_N23vIKgPhx.png

Code_VPiNvSQD6j.png
 
If it helps, here is the code (if you want me to summarize it/explain anything, just say it)
I mean, I can apply the diff myself and read the code directly (and in fact, I have done so multiple times). What I'm not confident about is that behavior is entirely unchanged, and I would need more time to reason through the code.

As I already mentioned, the current parser tests check some edge cases of parser behavior. They say nothing about the tokens that are generated. Sure, we also have CustomScriptTest which integrates with the AshRuntime, but that only provides second-order observability.

I get that we'll be able to see this in action once we start speaking LSP. But surely there's some intermediate state between a complete lack of visibility into what the Parser is actually doing, and however many thousands of lines of changes we'd need to have a fully-fledged language server.
Could you elaborate on that? If it is just "worries", know that I had ample time, occasions, and ways to test this.
I don't doubt that. But there's fundamentally a difference between confirming that it works at one point in time, and ensuring that it continues to work in the future, regardless of whether we change the code, the build system, the environment. That we don't break it while fixing some other feature, or exacerbate existing problems.

It's great that you've gotten something working end-to-end. Let's imagine for a second that I say "okay, let's merge all this code even though you're the only one who understands how it works" and everything works magically without any problems or breakages of existing use cases. What happens if something breaks down the line? Or the code bitrots over time? Do we expect you to continue fixing this for all eternity?

I'm not fundamentally opposed to committing these patches, but any significant change warrants careful examination. Testing serves two purposes here: assurance of correctness today (helping convince your reviewer that the code is safe and works as intended), and safeguarding against breakages tomorrow.
 
What I'm not confident about is that behavior is entirely unchanged
Wait... don't you mean "that behaviour is entirely *changed*"?
I don't get how it's a bad thing that it's unchanged. Isn't that the point? Making sure that nothing changes on the existing front, while making sure it adds support for a new feature?
 
Yes, I agree that it's good that behavior does not change as a result of a refactor.

I'm not confident at this time that your patch meets that standard.
 
My only substantial comment re: constructor changes: you implicitly assume throughout this change that this.currentLine can never be null.

We should either nuke the zero-argument Parser constructor, since nobody uses it, or add some logic to ensure that if stream != null and file != null, then this.currentLine is initialized to... something. Otherwise it'll blow up immediately in parse() when you try to get the first token to compare against "script".
That's what I wanted to do (the "nuking the zero-argument constructor" part).

But that zero-argument version is used for the "default" value of AshRuntime's "parser" field, which, itself, is clearly assumed to never be null.

So it's just a matter of "Parser can't/isn't allowed to parse anything if the zero-argument version was used to initialize it"
This shouldn't be hard to enforce, since, of the parsing methods, only Parser.parse is public.
Parser_Tokens_7.patch addresses this concern, by making it impossible to call Parser.parse() if the zero-argument constructor was used, and by giving a default to Parser.getCurrentLine() (which was, yes, a liability so far).

differences between this patch and the previous one:
TortoiseMerge_VuWtEf7Bxd.png
TortoiseMerge_FlMb2DCH2q.png
 

Attachments

Anything I can do to help meanwhile? Tests or something?
I think the most useful tests at this point would validate the output of parse() in some fashion, whether that's by somehow processing the returned BasicScope, or exposing the internal tokens themselves.

Even just something like
Code:
class Parser {
...
public List<String> getTokens()
{
  List<String> tokens = new ArrayList<String>();
  for ( Line line = this.getFirstLine(); line.nextLine != null; line = line.nextLine )
  {
    for ( Token token : line.tokens )
    {
      tokens.add( token.content );
    }
  }
  return tokens;
}
...
}
which you could then use to extend ParserTest.java (or write a new test file) to check the results of the happy path.

I don't know how you plan to expose the tokens from the Parser to whatever class implements the language server, but that may be an even better than creating something just for testing purposes (as you'd then be testing the actual public API).
 
I think the most useful tests at this point would validate the output of parse() in some fashion, whether that's by somehow processing the returned BasicScope, or exposing the internal tokens themselves.

Even just something like
Code:
class Parser {
...
public List<String> getTokens()
{
  List<String> tokens = new ArrayList<String>();
  for ( Line line = this.getFirstLine(); line.nextLine != null; line = line.nextLine )
  {
    for ( Token token : line.tokens )
    {
      tokens.add( token.content );
    }
  }
  return tokens;
}
...
}
which you could then use to extend ParserTest.java (or write a new test file) to check the results of the happy path.

I don't know how you plan to expose the tokens from the Parser to whatever class implements the language server, but that may be an even better than creating something just for testing purposes (as you'd then be testing the actual public API).
I guess it's my bad. When I mentioned this:
In general, one challenge is that it's hard to inspect which tokens are being generated as a result of a parse. I don't know how exactly you plan to use this in the language server, but it might be helpful (for testability, debuggability, etc) to expose the list of resulting tokens. One step beyond "does this parse or not" as in ParserTest.java is "what does this parse into?" which could be something like "int a = 5;" -> {"int", "a", "=", "5", ";"}.
Rest assured, this is very much *already* a thing on my side. It's currently used by the "Semantic Tokens" feature (which I referenced earlier).
I should have clarified; the "way to get all of a parser's tokens" that I was referencing was very much already in Parser.java. It is "called" by the Semantic Tokens feature.

I can very much just put it into the current patch (minus a datatype that's not yet available, that's supposed to act as a "filter" of some sort).
 
Here. This latest patch adds token examination to the tests.

n.b. You'll note that, in typed constants and plural constants, the opening square bracket is part of the content token.
This is a side-effect of having to (temporarily) throw out the parseString/parsePluralConstant/parseEscapeSequence refactor.
That bracket will become its own token once that refactor can come back into play.
 

Attachments

Back
Top