Profile Photo

Jamie Skipworth


Technology Generalist | Software & Data


Extracting text from XML

Earlier on I wrote a post where I performed word counts using programs written in Go. The text data came from StackOverflow’s bitcoin forum. However, those files are XML, so I had to perform some jiggery-pokery to extract the text from them. So, because my memory is terrible, I thought I’d spend some time here to document how I did it (well, for the Posts.xml file).

You’ll need to install the html-xml-utils package for your system if you don’t already have it.

In the Stack Overflow dataset there’s a Posts.xml file, which contains row elements. Those row elements have a Body attribute, and that’s what I want to extract. Here’s the first row:

<row Id="1" PostTypeId="1" CreationDate="2011-08-30T21:12:34.090" Score="22" ViewCount="22542" Body="&lt;p&gt;What open source miner applications are there? Especially to see how the mining process works.&lt;/p&gt;&#xA;" OwnerUserId="12" LastEditorUserId="5406" LastEditDate="2016-01-21T15:10:59.710" LastActivityDate="2016-01-21T15:10:59.710" Title="What open source miner applications are there?" Tags="&lt;mining-setup&gt;&lt;miner-configuration&gt;&lt;mining-software&gt;" AnswerCount="10" CommentCount="1" FavoriteCount="8" />

So the first thing we’ll do is extract the Body attribute data with hxselect

cat Posts.xml | hxselect -c "row::attr(Body)" 

The -c option supresses the output of the attribute itself (IE, the Body="blah" bits). That command yields the following unholy mess.

&lt;p&gt;The official list is on &lt;a href=&quot;https://en.bitcoin.it/wiki/Software#Mining_apps&quot; rel=&quot;nofollow&quot;&gt;The Wiki&lt;/a&gt; and is constantly being modified since new software shows up fairly often, but here's the quick breakdown:&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Poclbm - Python/OpenCL GPU miner (GUI)&lt;/li&gt;&#xA;&lt;li&gt;DiabloMiner - Java/OpenCL GPU miner (MAC OS X GUI)&lt;/li&gt;&#xA;&lt;li&gt;RPC Miner - remote RPC miner (MAC OS X GUI)&lt;/li&gt;&#xA;&lt;li&gt;Phoenix miner - miner&lt;/li&gt;&#xA;&lt;li&gt;Cpu Miner - miner&lt;/li&gt;&#xA;&lt;li&gt;Ufasoft miner - miner&lt;/li&gt;&#xA;&lt;li&gt;Pyminer - Python miner, reference implementation&lt;/li&gt;&#xA;&lt;li&gt;Remote miner - mining pool software&lt;/li&gt;&#xA;&lt;li&gt;Open Source FGPA Bitcoin Miner - a miner that makes use of an FPGA Board&lt;/li&gt;&#xA;&lt;/ul&gt;

Good, I’ve earned a biscuit.

Now, the body text is user-generated and often contains HTML tags (links, code etc). I didn’t want this markup to be included in the word counts, so I strip it out.

This is probably a good idea anyway because humans are unreliable and we have to assume all this HTML is badly formed. First we decode the escapes using hxunent.

cat Posts_sml.xml | hxselect -s '\n' -c "row::attr(body)" | hxunent 

That hxunent command converts all the escapes like &lt; into the proper characters like <.

<p>The official list is on <a href="https://en.bitcoin.it/wiki/Software#Mining_apps" rel="nofollow">The Wiki</a> and is constantly being modified since new software shows up fairly often, but here's the quick breakdown:</p>

<ul>
<li>Poclbm - Python/OpenCL GPU miner (GUI)</li>
<li>DiabloMiner - Java/OpenCL GPU miner (MAC OS X GUI)</li>
<li>RPC Miner - remote RPC miner (MAC OS X GUI)</li>
<li>Phoenix miner - miner</li>
<li>Cpu Miner - miner</li>
<li>Ufasoft miner - miner</li>
<li>Pyminer - Python miner, reference implementation</li>
<li>Remote miner - mining pool software</li>
<li>Open Source FGPA Bitcoin Miner - a miner that makes use of an FPGA Board</li>
</ul>

That’s much better, I can actually read it now. Next, I want to remove those HTML tags. I could use other html-xml-util tools but they don’t seem to behave as expected for me (probably user-error). Instead I’ll use a plain old boring regex in sed:

cat Posts_sml.xml | hxselect -s '\n' -c "row::attr(body)" | hxunent | sed 's/<[^>]*>//g' 

Then, I’ll collapse any large sequences of newlines into just one with tr:

cat Posts_sml.xml | hxselect -s '\n' -c "row::attr(body)" | hxunent | sed 's/<[^>]*>//g' | tr -s '\n'

And lo and behold we have plain-text:

The official list is on The Wiki and is constantly being modified since new software shows up fairly often, but here's the quick breakdown:
Poclbm - Python/OpenCL GPU miner (GUI)
DiabloMiner - Java/OpenCL GPU miner (MAC OS X GUI)
RPC Miner - remote RPC miner (MAC OS X GUI)
Phoenix miner - miner
Cpu Miner - miner
Ufasoft miner - miner
Pyminer - Python miner, reference implementation
Remote miner - mining pool software
Open Source FGPA Bitcoin Miner - a miner that makes use of an FPGA Board

It’s probably not the best way to do it, but it’s good enough.