Page 1 of 4 1 2 3 ... LastLast
Results 1 to 10 of 33

Thread: the DOM, regex, scalability, and other jargony words

  1. #1
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default the DOM, regex, scalability, and other jargony words

    So, I'm on to the next thing that catches my whimsy, and this one's a doozy. But unlike many other Roippi Projects™, this one I can't do alone - I'm asking for a cultural shift amongst fellow developers, as much as anything else.

    Executive summary

    Instead of parsing HTML with regex, we should be using an HTML parser. Any time there is code like this:

    Code:
    			pattern = Pattern.compile( "auce:(?:</small>)?</td><td align=left><b><font color=black>(?:<span>)?(\\d+)<" );
    			matcher = pattern.matcher( responseText );
    			if ( matcher != null && matcher.find() )
    			{...}
    we should instead be using an HTML parser's API to search within the DOM for that element.

    That's the gist. More words in the next post.

  2. #2
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default

    - Motivation

    There is a *lot* of code in mafia. Mafia just does a lot of stuff. But if one were to characterize what mafia spends all of its processor time doing, you would find that the lion's share is spent processing text. Searching for stuff in strings, really - that's what we spend our time doing. And - perhaps horrifyingly - we are doing a lot of it completely wrong.

    Okay, that's alarmist and hyperbolic. I should say, rather, that we could do a much better job of searching for stuff in strings. And it all starts with the fact that we're searching for stuff in a particular kind of string - HTML.

    - So, about regex...

    Parsing HTML with regex is considered by many to be a cardinal sin. There are really good reasons for this that I won't get into, but it all reduces to

    • bugs
    • performance

    Regarding bugs: parsing HTML with regex is really fragile. To give an example, look at this recent bug report. Note that our regex was broken by something that KoL added to the alt text of a completely unrelated element. Also note that debugging this was particularly difficult.

    As for performance: every time we apply a regex to some page's responseText, we are saying "hey, globally search within this text, see if anything matches." That's so much more work than needs to be done! Really what you want to say is:

    DOM.PNG (click, the path to my charpane's HP as explained by firefox's element inspector )

    go there in the DOM and get me that element's text. Ignore everything else.

    On my (pretty fast) computer, parsing the charpane takes around 20ms. That's not a lot in the big scheme of things, but think of how often the charpane gets loaded. What if it took 1 ms? How much "zippier" would mafia be, if we applied similar savings elsewhere in the codebase?

    - What should we be doing?

    In a lot of cases, we're already doing the right thing. Any time we do a responseText.contains( blah ), that's completely fine. Frequently we're just looking for a snippet of text in a large blob of unstructured text, and an HTML parser doesn't help with that. Like, parsing noncombat results for example - big blob of text, see if some phrase is in it. We're good.

    Whenever we need to fetch a specific element from the DOM, we should be using an HTML parser. We even have one already in mafia's source tree - HTMLCleaner. As an example, the charpane is a well-structured piece of HTML that's frequently loaded; we should be grabbing things like HP, MP, etc by querying the abstract syntax tree generated by the parser. It's less bug-prone and is very scalable.

    - The downside

    I'm asking people to learn a new tool. And the best way to use it, in my opinion, is through XPath expressions, which is a bit of a steep learning curve - it's somewhat like learning regex all over again. (really, closer to learning CSS selectors from scratch, which is not so bad.)

    This is a burden. I realize that. The upsides of spending your time on this are frankly not so huge - who cares if a page loads a couple of ms faster, right? Or, by spending time learning this tool, you'll save yourself time in the future by not having to fix some bug? Woo?

    I think that the net effect of all of us agreeing to use a dedicated parser, when appropriate, could be huge. A bunch of little improvements that add up to a big thing. At worst, I believe it is a tool worth learning so that future code that we write is better.

  3. #3
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default

    tl;dr: I'm going to refactor the charpane to do its parsing using a dedicated HTML parser. It would be cool if other developers took note of this and used an HTML parser in the future when grabbing specific things from the DOM of other pages.

  4. #4
    Developer Veracity's Avatar
    Join Date
    Mar 2006
    Location
    The Unseelie Court
    Posts
    11,551

    Default

    I will watch with interest what you do with the charpane. I hope you will use HTMLCleaner, rather than loading in a second parser. I did a fair amount of comparison shopping when I installed that one, and I think it was the best choice - at that time, at least.
    Ph'nglui mglw'nafh Cthulhu
    R'lyeh wgah-nagl fhtagn.

  5. #5
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default

    Heh. I recently did some comparison shopping too, and I agree - you chose well. I love that it supports XPath queries.

    I might update it, though - it's still well-maintained and our version is from 2007

  6. #6
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default

    Also also, I won't be committing anything on this before the upcoming point release.

  7. #7
    Minion Bale's Avatar
    Join Date
    Jun 2008
    Posts
    13,287

    Default

    I'm asking people to learn a new tool. And the best way to use it, in my opinion, is through XPath expressions, which is a bit of a steep learning curve - it's somewhat like learning regex all over again. (really, closer to learning CSS selectors from scratch, which is not so bad.)
    Originally Posted by roippi View Post
    I know this is being directed at mafia devs, not scripters, but I have an interesting question. Have you considered making xpath expression parsing of html available to scripters? I have ChIT in mind since it would apparently be helpful if I had access to this powerful tool for extracting useful data from the charpane instead of regexp.

    I am certainly willing to learn a new tool if it means that I can shave off a lot of the processing time and bug proof my code at the same time.
    If people like my scripts, please send me stuffed Hodgmen.
    Universal Recovery, OCD Inventory Control, CounterChecker, newLife, ChIT.


  8. #8
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default

    I've considered it, yes

    It's something that should happen, in some form or other. Recreating the full API in ASH is a bit much, but we should be able to do something with xpath. The only question is how far down the rabbithole to go, heh.

  9. #9
    Developer Veracity's Avatar
    Join Date
    Mar 2006
    Location
    The Unseelie Court
    Posts
    11,551

    Default

    The other side of this is that once you have an HTML tree, you can edit it. Have you ever looked at the CharPaneDecorator? It is a heck of a lot cleaner than it used to be and deals with things like "Familiars above vs. below effects" which used to confuse it. But, it is chock-full of regular expressions.
    Ph'nglui mglw'nafh Cthulhu
    R'lyeh wgah-nagl fhtagn.

  10. #10
    Developer roippi's Avatar
    Join Date
    Aug 2010
    Posts
    2,663

    Default

    That's one thing I haven't looked into. I know that HTMLCleaner works by creating a valid XML tree out of arbitrary HTML, in a fault-tolerant way like browsers do. I know that it provides some *Serializer helper classes but I haven't seen if you can serialize back to HTML equivalent to its input. If so, then yes we could edit the tree directly and then write out to HTML, rather than a whole bunch of string munging in various decorators.

Similar Threads

  1. Feature - Implemented Detecting detective skull words
    By Captain Kirk in forum Bug Reports
    Replies: 15
    Last Post: 07-10-2014, 11:07 AM
  2. Replies: 0
    Last Post: 07-08-2014, 04:40 PM
  3. Replies: 0
    Last Post: 08-09-2012, 09:03 PM
  4. Replies: 0
    Last Post: 08-09-2012, 09:03 PM
  5. Translate interpretive dance to words
    By DerDrongo in forum Relay Override Scripts
    Replies: 3
    Last Post: 10-08-2010, 03:48 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •