roippi
Developer
Yep. It's an xpath primer. This is going to be entirely example-driven (because honestly, xpath is completely impenetrable without examples). First, here's some example HTML:
You can feel free to save this as data/test.html in your mafia installation - this is exactly how I'm generating all of the examples - and run your own xpath queries against it. I will note that the "test xpath" CLI command provides different output than the xpath() ASH function - instead of returning the full innerHTML content of each node, it just returns each node's name. So don't be confused by this.
Okay. Before I dive into all of the "real" examples, let's get this one out of the way.
-- "give me all of the nodes."
Peruse that, compare to the original HTML, and now ask the obvious question. "head? tbody?! where did these nodes come from?!!" Good question. Have some bold text.
HTML is an error tolerant language. Some things, when you omit them from the source, are still included in the DOM. For example, <tr> elements in a <table> always have an implicit <tbody> declaration surrounding them, even if not explicitly written in the HTML. Browsers and HTML parsers will insert these into the DOM - be aware!
You are free to explore this phenomenon in your browser-of-choice's dev tools; ultimately the takehome is that you should look at a given page's DOM in your browser's dev tools or the output of mafia's HTML parser (these should hopefully be equivalent). Writing xpath queries directly from the page source is going to give you headaches. KoL, by the way, is especially fond of not closing its tags, leaving it up to the browser to figure out where to close them; there are in fact very complex rules on how to close orphaned elements like this - rules which humans are bad at remembering - so do yourself a favor and just look at the parsed results.
Examples to follow.
HTML:
<html>
<body>
<h1>A header</h1>
<table>
<tr id="first">
<td id="first_cell">Cell 1</td>
<td>Cell 2</td>
</tr>
<tr id="second" class="apple">
<td>Cell 3<p> with a <b>paragraph</b> in it</p></td>
</tr>
</table>
<p>This has <span class="banana"> some spans</span>
<div class="banana">and a div</div>
<div id="second" class="apple">and a <div>nested div with a <span class="banana">span</span></div></div>
</p>
<a href="http://www.google.com"></a>
</body>
</html>
You can feel free to save this as data/test.html in your mafia installation - this is exactly how I'm generating all of the examples - and run your own xpath queries against it. I will note that the "test xpath" CLI command provides different output than the xpath() ASH function - instead of returning the full innerHTML content of each node, it just returns each node's name. So don't be confused by this.
Okay. Before I dive into all of the "real" examples, let's get this one out of the way.
-- "give me all of the nodes."
Code:
> test xpath //*
1: head
2: body
3: h1
4: table
5: p
6: div
7: div
8: a
9: tbody
10: tr
11: tr
12: td
13: td
14: td
15: p
16: b
17: span
18: div
19: span
Peruse that, compare to the original HTML, and now ask the obvious question. "head? tbody?! where did these nodes come from?!!" Good question. Have some bold text.
HTML is an error tolerant language. Some things, when you omit them from the source, are still included in the DOM. For example, <tr> elements in a <table> always have an implicit <tbody> declaration surrounding them, even if not explicitly written in the HTML. Browsers and HTML parsers will insert these into the DOM - be aware!
You are free to explore this phenomenon in your browser-of-choice's dev tools; ultimately the takehome is that you should look at a given page's DOM in your browser's dev tools or the output of mafia's HTML parser (these should hopefully be equivalent). Writing xpath queries directly from the page source is going to give you headaches. KoL, by the way, is especially fond of not closing its tags, leaving it up to the browser to figure out where to close them; there are in fact very complex rules on how to close orphaned elements like this - rules which humans are bad at remembering - so do yourself a favor and just look at the parsed results.
Examples to follow.
Last edited: