Word count for HTML documents

Ok, I’ll have to be careful, so this site doesn’t become a Komodo-only fansite, but I still feel the need to praise the exceptional value of this editor.

I’m currently busy preparing a paper. The thing is, however, that I have to produce an approximate number of words in this paper. Most people would’ve just say “use wc”. Normally, this is a good suggestion, but as the paper I’m writing is HTML, WC isn’t neccesarily accurate. It ends up counting elements surrounded by whitespace as “words”. Which is, according to the tool’s design, entirely correct. It’s correct, but not correct for me, so I needed to roll my own, and this is where Komodo comes in very handy:

Since Komodo is a XUL application, it also conveniently exposes a document object, which we can reuse for this purpose. See below, for a macro that may not qualify as ‘beautiful’, but it’s still extremely handy:

komodo.assertMacroVersion(2);
if (komodo.view) { komodo.view.setFocus() };
var text = (komodo.editor.selText || komodo.editor.text).replace(/<\?.*\?>/,"").replace(/<!.*>/,"");
var doc = document.createElementNS('http://www.w3.org/1999/xhtml','floble'); // implementation.createDocument('http://www.w3.org/1999/xhtml',null,null)
doc.innerHTML = text;
text = doc.textContent.replace(/^\s*/,'').replace(/\s*$/,'').replace(/\s{2,}/igm," ");
textNodes = text.split(/\s+/);
alert("Wordcount: "+textNodes.length);

So, you’re wondering what more you can do? Well, mostly anything you could ever think of doing in a browser.

Comments

Comment from Brandon on 2007-04-19 14:40

Have you been entering these macros into the Komodo Extensibility Contest on support.activestate.com? The contest closes on Friday.

Comment from Andrew Gregory on 2007-04-19 15:20

Maybe not wc by itself, but…

lynx -dump -nolist foo.html | wc -w

:)

Comment from Jakob Breivik Grimstveit on 2007-04-19 20:31

I just read this post via RSS and decided to find the solution myself - very similar to Andrew’s:

links -dump http://www.grimstveit.no/ | wc -l

Comment from Arve on 2007-04-20 15:46

Andrew, Jakob: Of course, this is entirely possible. There are a few differences, though (and I leave it to you to do this):

1. Needs to work on selections.
2. Needs to work on any generic XML.

Comment from Anon on 2007-05-22 17:37

javascript:alert(“Wordcount: “+ document.evaluate(“normalize-space(/*)”, document ,null, XPathResult.STRING_TYPE, null).stringValue.split(” “).length)

Comment from Arve on 2007-05-22 17:52

Anon: Whoah. That’s just elegant. I’m going to thwack myself for not thinking of it myself first.

Comment from Anon on 2007-05-22 18:57

Doesn’t quite meet spec as can’t do arbitrary selections, but could be improved to do relevant sub-blocks by passing them in as context node (document.body say, for html) and using that (.) rather than the document element (/*), or just matching directly (//*[@id=’postcontent’]).

This discussion has been closed. No further comments may be added.