-- "give me all <tr> nodes that are the direct child of a <table> node."
Code:
> test xpath //table/tr
no matches.
Did you pay attention to the bold text in the above post? Go read it again.
-- "give me all <tbody> nodes that are the direct child of a <table> node."
Code:
> test xpath //table/tbody
1: tbody
Good, you learned!
-- "give me all <tr> nodes that are contained by a <table> node."
Code:
> test xpath //table//tr
1: tr
2: tr
And thus, the difference between a single forward slash and two: the former means "direct children named x" and the latter means "global search inside these nodes for children named x". I'm sure I'm not exactly precise on the terminology, but you catch my drift.
-- "give me all <td> nodes that have attribute id equal to 'first_cell'"
Code:
> test xpath //td[@id="first_cell"]
1: td
Two new things here. First, @ means attribute. We've been searching for node (aka tag) names before, and now this is the syntax for attributes.
Second, [this is a predicate]. When I write something[something_else], that translates to "give me the somethings WHICH HAVE A something_else". Useful.
-- "give me THE TEXT OF all <td> nodes that have attribute id equal to 'first_cell'"
Code:
> test xpath //td[@id="first_cell"]/text()
1: Cell 1
Same query as last time, but this time we applied the text() function to it. This is a special function that will recursively concatenate the text contents of a given node's children and return that, instead of the node itself. Very useful.
-- "give me the text of all <p> nodes that are the dc of a <td> node that are the second <tr> child of a <tbody> node that are the dc of a table node that are the dc of a body node."
Code:
> test xpath /body/table/tbody/tr[2]/td/p/text()
1: with a paragraph in it
Here, I've not used any global (//) searches, instead specifying the exact full path to a node. If you care about eking every last drop of performance out of your xpath queries, this is the way to do it - the engine doesn't have to do any backtracking, it just goes right to the desired spot in the DOM. (in general, don't worry about performance; xpath is fast) Note that you can specify by index which nodes you want - there are two <tr> nodes in that <tbody> nodes; I selected the second. Note also that indexing is one-based (sigh).
As you can see, the english translations of some of these xpath queries is going to get quite long. That's kind of the point - xpath is a very compact, expressive language. It does its job better than English.
-- "give me the text of all <tr> nodes that have a class attribute."
Code:
> test xpath //tr[@class]/text()
1: Cell 3 with a paragraph in it
Pretty straightforward. If you don't specify what the attribute needs to be equal to, you're just saying it needs to exist.
-- "give me any nodes which have any direct children which have a class equal to banana."
Code:
> test xpath //*[*[@class="banana"]]
1: body
2: p
3: div
Now we're getting somewhere. You can nest predicates, that's pretty cool. And you can specify * to match any node, that's cool too. This is a tool/trick that will come up a lot - you have one exact spot in the DOM you want to match, but then you want to "backtrack" up a few nodes to grab more text around it. This is how you do it.
But wait - if you look at the HTML source... why did this match <body>? Two class="banana" nodes have the same <p> parent, there should only have been two matches, right? Bzzt. Go back and read the bold text in the above post. Your browser and the parser implicitly closed the <p> tag right before the <div> tags that were nested inside it. Did you catch that on first reading through the HTML? Of course not - you're a human. Only look at the parsed DOM, don't trust the page source.