the DOM, regex, scalability, and other jargony words

roippi · Sep 30, 2014

So, I'm on to the next thing that catches my whimsy, and this one's a doozy. But unlike many other Roippi Projects™, this one I can't do alone - I'm asking for a cultural shift amongst fellow developers, as much as anything else.

Executive summary

Instead of parsing HTML with regex, we should be using an HTML parser. Any time there is code like this:

Code:

			pattern = Pattern.compile( "auce:(?:</small>)?</td><td align=left><b><font color=black>(?:<span>)?(\\d+)<" );
			matcher = pattern.matcher( responseText );
			if ( matcher != null && matcher.find() )
			{...}

we should instead be using an HTML parser's API to search within the DOM for that element.

That's the gist. More words in the next post.

roippi · Sep 30, 2014

- Motivation

There is a *lot* of code in mafia. Mafia just does a lot of stuff. But if one were to characterize what mafia spends all of its processor time doing, you would find that the lion's share is spent processing text. Searching for stuff in strings, really - that's what we spend our time doing. And - perhaps horrifyingly - we are doing a lot of it completely wrong.

Okay, that's alarmist and hyperbolic. I should say, rather, that we could do a much better job of searching for stuff in strings. And it all starts with the fact that we're searching for stuff in a particular kind of string - HTML.

- So, about regex...

Parsing HTML with regex is considered by many to be a cardinal sin. There are really good reasons for this that I won't get into, but it all reduces to

bugs
performance

Regarding bugs: parsing HTML with regex is really fragile. To give an example, look at this recent bug report. Note that our regex was broken by something that KoL added to the alt text of a completely unrelated element. Also note that debugging this was particularly difficult.

As for performance: every time we apply a regex to some page's responseText, we are saying "hey, globally search within this text, see if anything matches." That's so much more work than needs to be done! Really what you want to say is:

(click, the path to my charpane's HP as explained by firefox's element inspector )

go there in the DOM and get me that element's text. Ignore everything else.

On my (pretty fast) computer, parsing the charpane takes around 20ms. That's not a lot in the big scheme of things, but think of how often the charpane gets loaded. What if it took 1 ms? How much "zippier" would mafia be, if we applied similar savings elsewhere in the codebase?

- What should we be doing?

In a lot of cases, we're already doing the right thing. Any time we do a responseText.contains( blah ), that's completely fine. Frequently we're just looking for a snippet of text in a large blob of unstructured text, and an HTML parser doesn't help with that. Like, parsing noncombat results for example - big blob of text, see if some phrase is in it. We're good.

Whenever we need to fetch a specific element from the DOM, we should be using an HTML parser. We even have one already in mafia's source tree - HTMLCleaner. As an example, the charpane is a well-structured piece of HTML that's frequently loaded; we should be grabbing things like HP, MP, etc by querying the abstract syntax tree generated by the parser. It's less bug-prone and is very scalable.

- The downside

I'm asking people to learn a new tool. And the best way to use it, in my opinion, is through XPath expressions, which is a bit of a steep learning curve - it's somewhat like learning regex all over again. (really, closer to learning CSS selectors from scratch, which is not so bad.)

This is a burden. I realize that. The upsides of spending your time on this are frankly not so huge - who cares if a page loads a couple of ms faster, right? Or, by spending time learning this tool, you'll save yourself time in the future by not having to fix some bug? Woo?

I think that the net effect of all of us agreeing to use a dedicated parser, when appropriate, could be huge. A bunch of little improvements that add up to a big thing. At worst, I believe it is a tool worth learning so that future code that we write is better.

roippi · Sep 30, 2014

tl;dr: I'm going to refactor the charpane to do its parsing using a dedicated HTML parser. It would be cool if other developers took note of this and used an HTML parser in the future when grabbing specific things from the DOM of other pages.

Veracity · Sep 30, 2014

I will watch with interest what you do with the charpane. I hope you will use HTMLCleaner, rather than loading in a second parser. I did a fair amount of comparison shopping when I installed that one, and I think it was the best choice - at that time, at least.

roippi · Sep 30, 2014

Heh. I recently did some comparison shopping too, and I agree - you chose well. I love that it supports XPath queries.

I might update it, though - it's still well-maintained and our version is from 2007

roippi · Sep 30, 2014

Also also, I won't be committing anything on this before the upcoming point release.

Bale · Sep 30, 2014

roippi said:
I'm asking people to learn a new tool. And the best way to use it, in my opinion, is through XPath expressions, which is a bit of a steep learning curve - it's somewhat like learning regex all over again. (really, closer to learning CSS selectors from scratch, which is not so bad.)

I know this is being directed at mafia devs, not scripters, but I have an interesting question. Have you considered making xpath expression parsing of html available to scripters? I have ChIT in mind since it would apparently be helpful if I had access to this powerful tool for extracting useful data from the charpane instead of regexp.

I am certainly willing to learn a new tool if it means that I can shave off a lot of the processing time and bug proof my code at the same time.

roippi · Sep 30, 2014

I've considered it, yes

It's something that should happen, in some form or other. Recreating the full API in ASH is a bit much, but we should be able to do something with xpath. The only question is how far down the rabbithole to go, heh.

Veracity · Sep 30, 2014

The other side of this is that once you have an HTML tree, you can edit it. Have you ever looked at the CharPaneDecorator? It is a heck of a lot cleaner than it used to be and deals with things like "Familiars above vs. below effects" which used to confuse it. But, it is chock-full of regular expressions.

roippi · Sep 30, 2014

That's one thing I haven't looked into. I know that HTMLCleaner works by creating a valid XML tree out of arbitrary HTML, in a fault-tolerant way like browsers do. I know that it provides some *Serializer helper classes but I haven't seen if you can serialize back to HTML equivalent to its input. If so, then yes we could edit the tree directly and then write out to HTML, rather than a whole bunch of string munging in various decorators.

Veracity · Sep 30, 2014

I think that I did not import all of HTMLCleaner into our tree. I wanted it for parsing fight.php output but did not expect to rewrite the HTML text from the tree, so I may have omitted those files. But, if you update it, by all means, take the whole package.

fronobulax · Sep 30, 2014

Sounds good.

lostcalpolydude · Sep 30, 2014

I learned what little I know of regex from examples in mafia and from google, I'm sure I can do the same with XPath. I had wondered before about the performance impact of all that string checking, but I didn't have any ideas for reducing the impact other than avoiding checks when possible (using boolean checks for things that often don't need checks).

xKiv · Sep 30, 2014

Two thoughts here.

roippi said:
On my (pretty fast) computer, parsing the charpane takes around 20ms. That's not a lot in the big scheme of things, but think of how often the charpane gets loaded. What if it took 1 ms? How much "zippier" would mafia be, if we applied similar savings elsewhere in the codebase?

I assume you are talking about fully-automated experience.
With relay browser, most of the (perceived) lag is caused by the folowing difference:
- browser connected directly to kol can start rendering page as soon as it receives first chunk of html from kol
- browser connected to mafia doesn't receive anything until mafia receives and processes the entire page
(it will *finish* rendering at pretty much the same time in both cases, but the difference in start time is significant for how it feels to a human, and for how soon a human can start reacting)
For this reasin, I am convinced that relay browser times will be dominated by network times.
(you should still do it, better code is better code)

Frequently we're just looking for a snippet of text in a large blob of unstructured text, and an HTML parser doesn't help with that.

But it can help with isolating the blob in the (presumably larger) response.

roippi · Sep 30, 2014

xKiv said:
I assume you are talking about fully-automated experience.
With relay browser, most of the (perceived) lag is caused by the folowing difference:

Well for the charpane specifically we actually only load it during relay adventuring. But a lot of my points were meant in a general sense.

I'm not terribly interested in profiling performance improvements exhaustively here. If we can cut our processing time by an order of magnitude, I imagine it will make things feel faster, if somewhat subtly. But I think the best improvement is algorithmically - if we're just doing a bunch of constant-time operations, it doesn't matter when we quadruple the amount of stuff we parse in the future, since obviously O(1) ops scale better.

But it can help with isolating the blob in the (presumably larger) response.

True, the response does contain a bunch of extraneous stuff like inline javascript that we could be isolating out. But I wouldn't expect enormous time savings on that front, actually - presumably java's String.contains() implements something like Boyer-Moore under the hood to achieve sublinear performance on substring searches.

Darzil · Sep 30, 2014

I don't know, but suspect, there are opportunities for a lot of performance improvements. I noticed when looking at the AreaCombatData stuff that we were managing to hit it's recalculate function 20+ times after some combats! I'm pretty sure we recalculate concoctions a lot too. Now, there are good reasons for that sort of thing, as they can change when all sorts of stuff changes (eg area combat data changes with zone, mainstat, muscle, moxie, init, ML, equipment, experience changers, familiar, meat drop, item drop). Whether we actually want to recalculate after each stat/meat/item gain in one set of combat results, rather than just doing it at the end, is another matter.

I am unconvinced that those occasions where Mafia feels laggy, which are resolved by a restart (and which I don't get as much any more), are caused by slow parsing. On those occasions I'll see the results of mafia parsing the page in the CLi 2-3 seconds before the relay browser gets the update.

But I'm all for faster parsing, and learning new stuff is part of why I'm working on mafia in the first place, as it helps my (limited) java skills.

Veracity · Sep 30, 2014

We recalculate concoctions every time an item goes into or out of inventory, storage, closet (since, depending on your settings, any of those can be the source for an ingredient). I believe we defer the refresh when we completely refresh one of those locations, but we still refresh multiple times for each adventure. I've spent a fair amount of time trying to optimize this, but didn't really solve the problem.

Veracity · Sep 30, 2014

Darzil said:
I am unconvinced that those occasions where Mafia feels laggy, which are resolved by a restart (and which I don't get as much any more), are caused by slow parsing. On those occasions I'll see the results of mafia parsing the page in the CLi 2-3 seconds before the relay browser gets the update.

I've seen this too. I don't know if we are taking our time actually sending the response down to the browser, if the OS is taking its time delivering the response to the browser after it receives it from KoLmafia, or if the browser is taking its time displaying the response. Some good instrumentation would help. I started that, with the "debug trace" feature, which logs requests/responses between browser and KoLmafia and between KoLmafia and KoL with millisecond precision.

roippi · Sep 30, 2014

p.s. just to convince people that xpath is cool

HTML:

> test xpath charpane.html //span[@class="black"]/text()

1: 760 / 809
2: 718 / 1168
3: 21,756,038
4: 194

My HP, MP, meat, and adv.

HTML:

> test xpath charpane.html //center[3]//td[@valign="center"]/text()

1: The Sonata of Sneakiness (2)
2: Smooth Movements (7)
3: Peeled Eyeballs (7)
4: Pisces in the Skyces (12)
5: Elemental Saucesphere (52)
6: Empathy (815)
7: Fat Leon's Phat Loot Lyric (822)
8: Springy Fusilli (825)
9: Leash of Linguini (825)
10: Polka of Plenty (873)
11: Spirit of Bacon Grease (∞)

Effects.

Both of those xpath expressions took me under a minute to write (with the help of firefox's dev tools) and I find them super readable. Of course I've been writing/reading xpath for a while so some others will find it more arcane.

I imagine I'll end up writing an xpath primer at some point.

Darzil · Sep 30, 2014

It's a shame that compare won't also work with compact panel! (but I'm sure you can quickly do one that would)

It's the one really annoying thing about character pane in particular, there are different styles to support and different classes/avatars with different resources to handle. And the html is not consistent !

the DOM, regex, scalability, and other jargony words

Developer

Developer

Developer

Developer

Developer

Developer

Minion

Developer

Developer

Developer

Developer

Developer

Developer

Active member

Developer

Developer

Developer

Developer

Developer

Developer