Word count for HTML documents

Ok, I’ll have to be careful, so this site doesn’t become a Komodo-only fansite, but I still feel the need to praise the exceptional value of this editor.
I’m currently busy preparing a paper. The thing is, however, that I have to produce an approximate number of words in this paper. Most people would’ve just say “use wc“. Normally, this is a good suggestion, but as the paper I’m writing is HTML, WC isn’t neccesarily accurate. It ends up counting elements surrounded by whitespace as “words”. Which is, according to the tool’s design, entirely correct. It’s correct, but not correct for me, so I needed to roll my own, and this is where Komodo comes in very handy:
Since Komodo is a XUL application, it also conveniently exposes a document object, which we can reuse for this purpose. See below, for a macro that may not qualify as ‘beautiful’, but it’s still extremely handy:
bc. komodo.assertMacroVersion(2);
if (komodo.view) { komodo.view.setFocus() };
var text = (komodo.editor.selText || komodo.editor.text).replace(/<\?.*\?>/,””).replace(//,””);
var doc = document.createElementNS(‘http://www.w3.org/1999/xhtml’,’floble’); // implementation.createDocument(‘http://www.w3.org/1999/xhtml’,null,null)
doc.innerHTML = text;
text = doc.textContent.replace(/^\s*/,”).replace(/\s*$/,”).replace(/\s{2,}/igm,” “);
textNodes = text.split(/\s+/);
alert(“Wordcount: “+textNodes.length);
So, you’re wondering what more you can do? Well, mostly anything you could ever think of doing in a browser.

7 Comments

  1. Brandon

     /  2007-04-19

    Have you been entering these macros into the Komodo Extensibility Contest on support.activestate.com? The contest closes on Friday.

  2. Maybe not wc by itself, but…
    lynx -dump -nolist foo.html | wc -w
    🙂

  3. I just read this post via RSS and decided to find the solution myself – very similar to Andrew’s:

    links -dump http://www.grimstveit.no/ | wc -l

  4. Andrew, Jakob: Of course, this is entirely possible. There are a few differences, though (and I leave it to you to do this):
    1. Needs to work on selections.
    2. Needs to work on any generic XML.

  5. Anon

     /  2007-05-22

    javascript:alert(“Wordcount: “+ document.evaluate(“normalize-space(/*)”, document ,null, XPathResult.STRING_TYPE, null).stringValue.split(” “).length)

  6. Anon: Whoah. That’s just elegant. I’m going to thwack myself for not thinking of it myself first.

  7. Anon

     /  2007-05-22

    Doesn’t quite meet spec as can’t do arbitrary selections, but could be improved to do relevant sub-blocks by passing them in as context node (document.body say, for html) and using that (.) rather than the document element (/*), or just matching directly (//*[@id=’postcontent’]).