11 ways to valid RSS

In an attempt to identify the way people are using RSS 2.0, I have identified 11 different methods of specifying content in RSS 2.0. Some of them should be functionally equivalent to an XML parser, and some should not.
Asking the question Should my aggregator do something sensible with this? — some of these even seem to be mutually incompatible.


h3. Content in the description element
I have so far identified five different variants of content in the @@ element:
# “Plaintext as CDATA with HTML entities”:http://virtuelvis.com/download/rss-tests/desc-a.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fdesc-a.xml
# “HTML within CDATA”:http://virtuelvis.com/download/rss-tests/desc-b.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fdesc-b.xml
# “HTML escaped with entities”:http://virtuelvis.com/download/rss-tests/desc-c.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fdesc-c.xml
# “Plain text in CDATA”:http://virtuelvis.com/download/rss-tests/desc-d.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fdesc-d.xml
# “Plaintext with inline HTML using escaping”:http://virtuelvis.com/download/rss-tests/desc-e.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fdesc-e.xml
h3. <content:encoded>
I have encountered and identified two different ways of using @@:
# “Using entities”:http://virtuelvis.com/download/rss-tests/content-f.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fcontent-f.xml
# “Using CDATA”:http://virtuelvis.com/download/rss-tests/content-g.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fcontent-g.xml
h3. XHTML content
Finally, I have encountered and identified four different ways in which people has specified XHTML content:
# “Using <xhtml:body>”:http://virtuelvis.com/download/rss-tests/xhtml-h.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fxhtml-h.xml
# “Using <xhtml:div>”:http://virtuelvis.com/download/rss-tests/xhtml-i.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fxhtml-i.xml
# “Using <body> with default namespace”:http://virtuelvis.com/download/rss-tests/xhtml-j.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fxhtml-j.xml
# “Using <div> with default namespace”:http://virtuelvis.com/download/rss-tests/xhtml-k.xml – “Validate”:http://feedvalidator.org/check.cgi?url=http%3A%2F%2Fvirtuelvis.com%2Fdownload%2Frss-tests%2Fxhtml-k.xml
h3. Download
“Download all feeds in zip file”:http://virtuelvis.com/download/rss-tests/rss-tests.zip for offline/private testing.
h3. Any more?
If you have seen other ways to specify content in RSS 2.0 that you have seen in actual use, feel free to point me to it, so I can construct a minimal test case from it.
h3. Update
What might not have been apparent when this entry was first posted, is this: _This is not a matter of 11 different RSS formats. These 11 test documents are all RSS 2.0_.
h3. Conclusion
Really _Simple_ Syndication? _Simple_?!??

16 Comments

  1. RSS = Really Simple Syndication?

    Arve writes about how the RSS format has disintegrated into a mess of incompatible formats: 11 ways to valid RSS I think its apparent that…

  2. Jury Duty

    Back later. For now, learn some more about valid RSS feeds, or spend some time in the CSS Zen Garden….

  3. And people wonder why we need Atom? Really Simple Syndication my ass! Pfft!

  4. RSS incompatibilities

    RSS incompabitilies: I was pointed to Arve Bersvendsen’s research here by Jarle Bergersen… Arve did a taxonomy of 11 different types of RSS/RDF found in the wild, and reached this conclusion: “Asking the question ‘Should my aggregator do something se…

  5. Arve,
    I’ve posted comments about your post in my weblog. Your post points out that there are 5 ways to provide content in RSS not 11. Read my post on RSS follies if you wonder how I got that number.
    Asbjørn,
    Atom adds lot more complexity to providing content than RSS does. I don’t even have to start using contrived examples like changing namespace prefixes or CDATA vs. escaped content to make the combination of ways to provide content in ATOM reach 11.
    # summary type=text\plain
    # summary type=text\html
    # summary type=application\xhtml+xml
    # content type=text\plain
    # content type=text\html
    # content type=application\xhtml+xml
    # summary type=text\plain mode=escaped
    # summary type=text\html mode=escaped
    # summary type=application\xhtml+xml mode=escaped
    # content type=text\plain mode=escaped
    # content type=text\html mode=escaped
    # content type=application\xhtml+xml mode=escaped
    That’s 12 without having to use claims like different namespace prefixes or escaped content vs. CDATA. Of course, I could go much longer if I decided to use content from other MIME types and mode=base64 but I’m sure you get the point
    (Ed. note: typographical/semantical edit performed to make the numbered list an actual list)

  6. Dare, to clear something up: No, I haven’t misunderstood XML at the most fundamental level — which is what you _should_ read into the sentence Some of them should be functionally equivalent to an XML parser, and some should not.
    The reason I mentioned every variant of escaping as well is that, no matter how much you may want it or not, _people are going to use piss-poor quasi-XML tools, or they’re going to go through regexp hell to use it_. Believeing that people will create or parse RSS feeds the “one true way” is wishful thinking.
    Finally, for content in the description element, if we sideline the escaping for the purpose of this specific argument, and instead look at the semantics:
    bc..

    A paragraph with a link.

    Another paragraph

    p. Is _not_ the same as:
    bc.. A paragraph and a link.
    Another paragraph
    p. And please, do not attempt to make this an Atom vs. RSS discussion.

  7. Dare, the problem isn’t whether you have a quadrazillion different ways to embed content in your XML feed (no matter what format), it is whether you can specify how you embed it or not.
    Atom provides a way to specify this, RSS doesn’t.

  8. Arve,
    So what’s your point? RSS is complex because you can’t process XML and HTML with regular expressions?
    Your original point was that processing content in RSS was difficult and I pointed out that things aren’t as bad as you claim if you use proper tools. You retort by claiming that people want to process RSS with improper tools? So what? That is an irrational argument. Removing a screw is hard if you have a hammer but easy with a screwdriver. What was your point again?
    Both the examples you show are embedded HTML in a description element. I assume you’re claiming that you should process newlines as @

    @ tags. I don’t see why anyone would do that, you can only bend so far backwards for people. What happens if I mix wiki text with my HTML, such as *bold* does that also mean that you should parse the wiki-isms and the HTML?

  9. No, Dare, my point was that people who choose to use piss-poor tools for the job has a lot of extra work to do. My point was also that in any other place than Utopia, people will use these piss-poor tools to get the job done. Which means that the users of the piss-poor tools might actually benefit from knowing these eleven variants, whether they are equivalent to an XML parser or not. Please, do not read malice into where malice does not exist.
    As for the examples in my last comment: Both are in use, and while I agree with you that “both are HTML and should be treated as such”, that may not be the author’s intent — is your goal to produce according to spec, or to provide human-readable content? Squished paragraphs is not in that category.
    BTW: There is a fairly pragmatic solution to this that should work fairly well: If there are no elements defined as block-level by HTML 4.01 in the @@, but there are inline elements present, you treat double newlines as paragraphs, and any inline elements are treated as just that, inline elements.

  10. Arve,
    If you want to claim that there was no malicious intent in your blog post perhaps you should retract the line
    bq. Really Simple Syndication? Simple?!??
    from your post as it is quite clear that you are exagerrating the complexity of RSS if the facts are looked at objectively. So far all you’ve stated is that it is complex if you use the wrong tool for the job which is a fact of life regardless of whether you are working on software, hardware or some other aspect of human endeavor.

  11. Adam Fitzpatrick

     /  2004-03-17

    bq. Which means that the users of the piss-poor tools might actually benefit from knowing these eleven variants, whether they are equivalent to an XML parser or not. Please, do not read malice into where malice does not exist.
    Given your good intentions, wouldn’t it be a more rewarding use of your time and effort to encourage these people to move away from piss-poor tools, rather than helping them to make more work for themselves?

  12. bq. Given your good intentions, wouldn’t it be a more rewarding use of your time and effort to encourage these people to move away from piss-poor tools, rather than helping them to make more work for themselves?
    I seriously don’t expect people to switch from their tools, because of what I write on this blog, and how they spend their time programming is their problem, not mine.
    And to Dare:
    bq. from your post as it is quite clear that you are exagerrating the complexity of RSS if the facts are looked at objectively.
    Given the assumption that nothing I write in this blog will make people turn away from using regular expressions instead of the right tool, RSS is not simple.
    Finally: Instead of making this an “Is RSS evil?” or “Don’t use a screwdriver where a hammer is most suited”, take it as an incentive to develop a set of best practices for RSS that will let even the screwdriver-owners produce and consume RSS feeds with the least amount of trouble (The same goes for Atom, btw)

  13. Arve,
    Processing XML is not simple, processing HTML is not simple, writing a C compiler is not simple. I don’t think coming up with a set of guidelines so that any college freshman can write a C compiler using regular expressions is a good idea especially when they can just use gcc.
    So why exactly is it a good idea to come up with “best practices” in the case of HTML and XML processing in RSS?
    I still don’t see your point.

  14. The idea is that the threshold should be lower for those who produce their personal aggregators using their stupid tools, and that the production of syndicated content should be easier for those using stupid tools.
    Those who use proper tools for the job won’t notice the difference anyway.

  15. Completely agree with Dare. Complex things as XML require the proper tools.
    You’re trying to make something new, complex and incredible versatile and powerful sort of backwards compatible with tools that were’t thought for with it in mind. Should XML become regex-friendly then?! Of course NOT, IMO.

  16. 11 ways to valid RSS

    At vurtuelvis, Arve Bersvendsen has identified 11 different methods of specifying content in RSS 2.0….