xpath primer

roippi · Oct 3, 2014

Yep. It's an xpath primer. This is going to be entirely example-driven (because honestly, xpath is completely impenetrable without examples). First, here's some example HTML:

HTML:

<html>
  <body>
    <h1>A header</h1>
    <table>
      <tr id="first">
        <td id="first_cell">Cell 1</td>
        <td>Cell 2</td>
      </tr>
      <tr id="second" class="apple">
        <td>Cell 3<p> with a <b>paragraph</b> in it</p></td>
      </tr>
    </table>
    <p>This has <span class="banana"> some spans</span>
      <div class="banana">and a div</div>
      <div id="second" class="apple">and a <div>nested div with a <span class="banana">span</span></div></div>
    </p>
    <a href="http://www.google.com"></a>
  </body>
</html>

You can feel free to save this as data/test.html in your mafia installation - this is exactly how I'm generating all of the examples - and run your own xpath queries against it. I will note that the "test xpath" CLI command provides different output than the xpath() ASH function - instead of returning the full innerHTML content of each node, it just returns each node's name. So don't be confused by this.

Okay. Before I dive into all of the "real" examples, let's get this one out of the way.

-- "give me all of the nodes."

Code:

> test xpath //*

1: head
2: body
3: h1
4: table
5: p
6: div
7: div
8: a
9: tbody
10: tr
11: tr
12: td
13: td
14: td
15: p
16: b
17: span
18: div
19: span

Peruse that, compare to the original HTML, and now ask the obvious question. "head? tbody?! where did these nodes come from?!!" Good question. Have some bold text.

HTML is an error tolerant language. Some things, when you omit them from the source, are still included in the DOM. For example, <tr> elements in a <table> always have an implicit <tbody> declaration surrounding them, even if not explicitly written in the HTML. Browsers and HTML parsers will insert these into the DOM - be aware!

You are free to explore this phenomenon in your browser-of-choice's dev tools; ultimately the takehome is that you should look at a given page's DOM in your browser's dev tools or the output of mafia's HTML parser (these should hopefully be equivalent). Writing xpath queries directly from the page source is going to give you headaches. KoL, by the way, is especially fond of not closing its tags, leaving it up to the browser to figure out where to close them; there are in fact very complex rules on how to close orphaned elements like this - rules which humans are bad at remembering - so do yourself a favor and just look at the parsed results.

Examples to follow.

roippi · Oct 3, 2014

-- "give me all <tr> nodes that are the direct child of a <table> node."

Code:

> test xpath //table/tr

no matches.

Did you pay attention to the bold text in the above post? Go read it again.

-- "give me all <tbody> nodes that are the direct child of a <table> node."

Code:

> test xpath //table/tbody

1: tbody

Good, you learned!

-- "give me all <tr> nodes that are contained by a <table> node."

Code:

> test xpath //table//tr

1: tr
2: tr

And thus, the difference between a single forward slash and two: the former means "direct children named x" and the latter means "global search inside these nodes for children named x". I'm sure I'm not exactly precise on the terminology, but you catch my drift.

-- "give me all <td> nodes that have attribute id equal to 'first_cell'"

Code:

> test xpath //td[@id="first_cell"]

1: td

Two new things here. First, @ means attribute. We've been searching for node (aka tag) names before, and now this is the syntax for attributes.

Second, [this is a predicate]. When I write something[something_else], that translates to "give me the somethings WHICH HAVE A something_else". Useful.

-- "give me THE TEXT OF all <td> nodes that have attribute id equal to 'first_cell'"

Code:

> test xpath //td[@id="first_cell"]/text()

1: Cell 1

Same query as last time, but this time we applied the text() function to it. This is a special function that will recursively concatenate the text contents of a given node's children and return that, instead of the node itself. Very useful.

-- "give me the text of all <p> nodes that are the dc of a <td> node that are the second <tr> child of a <tbody> node that are the dc of a table node that are the dc of a body node."

Code:

> test xpath /body/table/tbody/tr[2]/td/p/text()

1: with a paragraph in it

Here, I've not used any global (//) searches, instead specifying the exact full path to a node. If you care about eking every last drop of performance out of your xpath queries, this is the way to do it - the engine doesn't have to do any backtracking, it just goes right to the desired spot in the DOM. (in general, don't worry about performance; xpath is fast) Note that you can specify by index which nodes you want - there are two <tr> nodes in that <tbody> nodes; I selected the second. Note also that indexing is one-based (sigh).

As you can see, the english translations of some of these xpath queries is going to get quite long. That's kind of the point - xpath is a very compact, expressive language. It does its job better than English.

-- "give me the text of all <tr> nodes that have a class attribute."

Code:

> test xpath //tr[@class]/text()

1: Cell 3 with a paragraph in it

Pretty straightforward. If you don't specify what the attribute needs to be equal to, you're just saying it needs to exist.

-- "give me any nodes which have any direct children which have a class equal to banana."

Code:

> test xpath //*[*[@class="banana"]]

1: body
2: p
3: div

Now we're getting somewhere. You can nest predicates, that's pretty cool. And you can specify * to match any node, that's cool too. This is a tool/trick that will come up a lot - you have one exact spot in the DOM you want to match, but then you want to "backtrack" up a few nodes to grab more text around it. This is how you do it.

But wait - if you look at the HTML source... why did this match <body>? Two class="banana" nodes have the same <p> parent, there should only have been two matches, right? Bzzt. Go back and read the bold text in the above post. Your browser and the parser implicitly closed the <p> tag right before the <div> tags that were nested inside it. Did you catch that on first reading through the HTML? Of course not - you're a human. Only look at the parsed DOM, don't trust the page source.

roippi · Oct 3, 2014

Getting more advanced now. Honestly if you master the stuff in the above post, you can probably already grab 95% of the things you need to grab out of the DOM.

-- "give me any nodes which have "Cell 1" as their inner text."

Code:

> test xpath //*[text() = "Cell 1"]

1: td

I introduced text() pretty early on, and it was this pseudo-magical thing that you applied at the end of an expression to turn everything into text. But really you can use it anywhere in the expression to fetch text contents - here I'm using it in a predicate to match a node's inner text. Do remember that text() recursively retrieves a nodes' entire innerHTML; thus //*[text() = "Cell 3 with a paragraph in it"] would match both the enclosing <tr> and <td>, since they have identical innerHTML.

-- "give me the text of each <td> that is the last child of a node."

Code:

> test xpath //td[position() = last()]/text()

1: Cell 2
2: Cell 3 with a paragraph in it

The position() and last() functions can be useful for selecting a subset of nodes from a collection. Note that there is no first() function - that's just the number 1.

You can use arithmetic operations on these, by the way. position() > 4, last() / 2, etc.

This is the point where I'd explain how the count() and contains() functions work, but the former seems to be broken in our current htmlcleaner and the latter is not supported. Eventually when I manage to upgrade our current version of htmlcleaner those issues will go away.

xKiv · Oct 3, 2014

Ha. So we can do

Code:

> test xpath //come/on/baby/light/my/fire

no matches.

(or //we/didnt/start/the/fire)

roippi said:
Code:

> test xpath //table//tr 1: tr 2: tr

I would like to point out here that this will return all TR tags anywhere under any table. It might not be immediately obvious to everyone that

Code:

<table id='a'><tr><td><table id='b'><tr><td></td></tr></table></td></tr></table>
> test xpath //table[@id='a']//tr

will also have two results. XPath doesn't care that there's another table in the way.
(also, //table//tr would also return the same two results, because each element can only be selected or not selected, it doesn't iterate over all possible ways to select it)

And an example that combines several concepts that you will no doubt want to explain later anyway:

Code:

> test xpath //tr[td[contains(text(),'2')]]/td

(you can: 1) select on functions, including inner text and substring, 2) select only elements that have certain children [specified recursively by xpath], 3) continue selecting (multiple) children even after using a predicate )

roippi · Oct 3, 2014

xKiv said:
I would like to point out here that this will return all TR tags anywhere under any table. It might not be immediately obvious to everyone that

Code:

<table id='a'><tr><td><table id='b'><tr><td></td></tr></table></td></tr></table> > test xpath //table[@id='a']//tr

will also have two results.

Yep. I think my English translation of that query - "give me all <tr> nodes that are contained by a <table> node." - is accurate, but you're right that there are some behaviors with // searches that can trip some people up. I'll see if I can expand on that.

And an example that combines several concepts that you will no doubt want to explain later anyway:

Yup, I'm ramping up to combining concepts like that.

roippi · Oct 3, 2014

Updated with some more examples.

Frustratingly our current htmlcleaner's implementation of count() seems to be broken, and it doesn't support contains() at all, which is a bummer - that's a rather useful function. I do still plan on upgrading our version of htmlcleaner so that people have access to those functions, but I have to make certain that doing so doesn't break the extant parsing we're doing in FightRequest.

xpath primer

roippi

Developer

roippi

Developer

roippi

Developer

xKiv

Active member

roippi

Developer

roippi

Developer