- Motivation
There is a *lot* of code in mafia. Mafia just does a lot of
stuff. But if one were to characterize what mafia spends all of its processor time doing, you would find that the lion's share is spent processing text. Searching for stuff in strings, really - that's what we spend our time doing. And - perhaps horrifyingly - we are doing a lot of it completely wrong.
Okay, that's alarmist and hyperbolic. I should say, rather, that we could do a much better job of searching for stuff in strings. And it all starts with the fact that we're searching for stuff in a particular kind of string - HTML.
- So, about regex...
Parsing HTML with regex is considered by many to be a cardinal sin. There are
really good reasons for this that I won't get into, but it all reduces to
Regarding bugs: parsing HTML with regex is really fragile. To give an example,
look at this recent bug report. Note that our regex was broken by something that KoL added to the alt text of a completely unrelated element. Also note that debugging this was
particularly difficult.
As for performance: every time we apply a regex to some page's responseText, we are saying "hey, globally search within this text, see if anything matches." That's so much more work than needs to be done! Really what you want to say is:
(click, the path to my charpane's HP as explained by firefox's element inspector )
go there in the DOM and get me that element's text. Ignore everything else.
On my (pretty fast) computer, parsing the charpane takes around 20ms. That's not a lot in the big scheme of things, but think of how often the charpane gets loaded. What if it took 1 ms? How much "zippier" would mafia be, if we applied similar savings elsewhere in the codebase?
- What should we be doing?
In a lot of cases, we're already doing the right thing. Any time we do a
responseText.contains( blah ), that's completely fine. Frequently we're just looking for a snippet of text in a large blob of unstructured text, and an HTML parser doesn't help with that. Like, parsing noncombat results for example - big blob of text, see if some phrase is in it. We're good.
Whenever we need to fetch a specific element from the DOM, we should be using an HTML parser. We even have one already in mafia's source tree - HTMLCleaner. As an example, the charpane is a well-structured piece of HTML that's frequently loaded; we should be grabbing things like HP, MP, etc by querying the abstract syntax tree generated by the parser. It's less bug-prone and is very scalable.
- The downside
I'm asking people to learn a new tool. And the best way to use it, in my opinion, is through XPath expressions, which is a bit of a steep learning curve - it's somewhat like learning regex all over again. (really, closer to learning CSS selectors from scratch, which is not so bad.)
This is a burden. I realize that. The upsides of spending your time on this are frankly not so huge - who cares if a page loads a couple of ms faster, right? Or, by spending time learning this tool, you'll save yourself time in the future by not having to fix some bug? Woo?
I think that the net effect of all of us agreeing to use a dedicated parser, when appropriate, could be huge. A bunch of little improvements that add up to a big thing. At worst, I believe it is a tool worth learning so that future code that we write is better.