XHTML, validation, JavaScript and Wordpress
Most people don't really bother with XHTML validation. Indeed, I'd say that 99,98% of all webpages on the Web today have their DOCTYPE setted automatically (be it for a site builder or a CMS) without its author even knowing about it.
Just browse sites over the Web, even from big companies, and you'll see how rare it is tp find a HTML document 100% valid.
In this article I'm gonna talk a bit about validation in general, about XHTML validation when JavaScript is in place, explain the problems I've faced with my Wordpress plugin Hikari Email & URL Obfuscator when I tried to insert JavaScript inside posts, and what I did to solve it. That's a lot to talk, so use the following Table of Contents to go directly to your concerning subject, or read the whole article with no hurry
- Introduction
- Let's talk about validation
- But, why is validation important at all?
- And where JavaScript comes in place?
- Wordpress doesn't like CDATA inside posts
- XHTML valid JavaScript inside Wordpress posts
- Conclusion
- Related Posts
- Comments (0)
Let's talk about validation
One of the reasons why HTML succeeded is because of the flexibility it has when it's being parsed by browsers. Its authors were smart, and foresaw that the forthcoming webdesigners wouldn't be software developers, and they wouldn't know programming languagues and how languages are parsed and compiled.
If you don't know, programming languages have a very strict syntax. Write anything wrong, and you'll face a compiling error. Everybody when starting to learn how to develop softwares have a hard time getting his first software to run. Generally, webdesigners don't like this strictness, and HTML flexibility helps them a lot.
But what do I mean with flexibility? I mean that, when a browser finds an error in the document syntax, its parser doesn't throw an error in the user's face, it tries to figure out what the author meant, and renders the page based on what the browser thinks was the right syntax, it tries to fix errors instead of warning about them.
But of course, softwares don't think. What really happens is that browser developers develop error predictors. They know most popular errors and are prepared for them, and simply try to "close things" when some syntax is wrong and all fixing attempts are used and it's still not solved.
HTML itself is very "open" to syntax strictness, it allows for exemple some tags to be opened without needing to be closed, or letting the closing be implicit. It was great in the beginning, but when the browsers war, M$ Internet Explorer vs Netscape Navigator, began, with all its incompatibilities and proprietary extensions. Even worse, some genious people were thinking of using webpages as interfaces to softwares, starting what would be known as web applications.
And it was the web apps need that emerged talks that later resulted in the creation of XHTML. As you should know, XHTML is an "HTML over XML", a language based on XML. All HTML elements are present, but now they are formed over XML, with all its syntax requirements.
With XHTML, <br>
became <br />
, <input checked />
became <input checked="checked" />
, and so forth.
But, why is validation important at all?
As I've said, we had the browsers war. It's not so evident nowadays, but its third period is still in place.
Today we have what we call Web Standards, and these standards define our web languages (HTML, XHTML, CSS, etc) syntax. In a standardized way, so that everybody can follow. Of course, Microsoft and its Internet Explorer doesn't follow standards, but it's a problem to those who insists to use it, not ours .
Still today we design websites for browsers, not for languages. What I mean is that, in my dreams, in the perfect world we'd not say This website works in all browsers except Microsoft's
, we'd say This website is designed using XHTML 1.0 Strict and CSS 2.1, in this link you can see supported browsers.
And we'd have a community site where browsers would be validated as in accordance with each standard language version, and list all browsers valid on that version.
Really, in software development world this works! Develop something in ANSI C, and any compiler is able to compile it. Want a special feature? include a lib that has it to your source and it will compile. Of course some compilers try to seize their users so that they can't compile in any other compiler, but if you remain in the OpenSource world you'll be pretty much fine. I myself use Pidgin IM, whose UI is developed using GTK, and it works great over Windows.
Back to the Web world, we souldn't really be forced to test our designs over all browsers and report supported browsers. We should be allowed to just choose one browser that is 100% valid over the languages we wanna use, and test on it. If it works on that browser, any other browser that is also valid for that language will render the page in the same way.
... will render, if our page is valid on that language! Yes, we can't complain browsers don't render our pages accordingly, if we don't design our pages right! HTML syntax fixing is there to help us, it's better to have that help than have our pages not being rendered, but good designers and developers shouldn't need it.
And here I talk about HTML flexibility side effect. It's great in the end point because pages are rendered even when they have syntax errors, but it's terrible in developing because errors are hidden and designers think everything is OK when it isn't. There should really be an option on browsers to disable error fixing, and to print HTML error messages instead.
It's great to have so many websites full of errors being rendered fine, but there are so many websites full of errors because browsers don't warn about them! Errors fixing is a end user feature, it should be a last attempt, and not to be used so that webdesigners can be careless and sloppy. It's not because browsers will fix your mess that you can mess it all up.
Now, where the problem resides. There is a public foundation, W3C, that has the authority to define standards. Once a standard is defined, browsers, automatic builders and webdesigners start working to support this new standard. While we're all talking inside the standard and syntax is valid, browsers (should) know how to render each element with its attributes inside its context, automatic builders (should) know which HTML element to use for each input they receive from their users, and webdesigners (should) know which HTML element to use for each design they wanna do.
While webdesigners and automatic webpage builders create valid HTML documents, browsers (should) know exactally how to render them. Everybody is talking inside the standard, everybody can understand each other perfectly, my dreamed world is fine.
But what happens when we design a valid HTML document, and a browser doesn't support that HTML version, and renders it differently from what we wanted? What happens is our real world, browsers making a mess, and we having to go out of the standard to design our websites based on their proprietary "features" and bugs.
And what happens when a webdesigner or an automatic webpage builder create an HTML document with invalid code? What happens is that browsers try to fix the invalid code (if they are able to detect the error to begin with). But since errors are not forecast in standard, each browser deals with them differently. Error fixing codes are always proprietary, since there is no standard for them, and then websites get more and more browsers dependant.
Internet Explorer is great for HTML documents full of errors, and terrible for valid documents.......
So, that's why validation is so important. Valid documents are rendered the same by all browsers that correctly support its languages version. If you don't follow the standards, you must support each version of each browser separately. And while most websites don't follow any standard at all, browsers will keep neglecting standards, because most people don't bother with them and browser devs are more often faced with non-standard errors fixing challenges than with correctly rendering standard elements. And errors fixing should be the last measure, it should be used the least possible times, it should be the exception, and not the rule.
If everybody followed that, my dreamed world of websites supporting languages and not browsers would become true.
And where JavaScript comes in place?
I've talked on general about XHTML and about validation, now let's go a bit closer to the objective of this article.
Ok, we are already designing XHTML valid webpages, everything is fine. But now we wanna add some dynamic behavior to our websites, and for that we use JavaScript.
In the past, during the browsers war, JavaScript was a proprietary feature. Indeed, Microsoft even today don't use it, what they use is JScript, their own language that supports parts of JavaScript. And this language was the main weapon during Internet Explorer vs Netscape Navigator war. It's from its cradle, it was designed to not be a standard, it was designed so that a script developed for a browser doesn't run on any other browser!
But this is already in the past, today we can pretty much develop JavaScript softwares that run in most browsers. But when it comes for XHTML validation, we have a dilemma. After all, JavaScript is not HTML, and due to that, HTML parsers must be notified in some way that code isn't HTML and should be skipped.
In the past, during the said browsers war, JavaScript was part of what Microsoft called DHTML. Dynamic HTML was a commercial name that meant HTML is dynamically generated in client-side, instead of "simply" served from server-side.
In those times, the fashion was having full HTML documents being generated by JavaScript, or most parts of it, and of course any browser that wasn't able to run those codes would generate an error or simply show a blank page.
Time passed, Microsoft won the first stage of the war, developers grew up and evolved, and then Unobstructive JavaScript emerged. With it, JavaScript lost its HTML generator rule, to receive a HTML and preferably a CSS controller rule.
Today, Websites should be served, rendered, viewed and used even when JavaScript is not available or disabled, and JavaScript softwares be only an enhancement feature, available to those who wanna use JavaScript, and never a requirement. Yes, I said never, meaning never!
JavaScript stopped being used inline, in the middle of HTML documents, to reside in their own *.js files, which are included in the document header or in the end of its body. Doing so, after the browser finishes rendering the document, these softwares find by themselves the HTML elements they wanna control, attach themselves to them, and do their job. And these elements must work even without JavaScript, or not be rendered otherwise, remaining hidden.
This is how the standard rules how it must be done today. But there are still times when JavaScript must be added inline to documents.
One exemple is when Java Servlet/PHP/Ruby/etc must pass data parameters to JavaScript use, when the server-side software must talk to client-side. In these cases, a JavaScript object is created inside the document, and when the code residing the *.js files run, it gets this object full of parameters and use this data to run. This data can also be passed in a function call and using other methods, but what's important here is that they are included in the document using the following very known syntax:
1 2 3 | <script type="text/javascript"> customFunction("JavaScript code here, the least possible, preferably only parameters data to be gathered later by a *.js code"); </script> |
And here's where XHTML validation takes place. In the past, when JavaScript/DHTML was a new and exciting feature, only newest browsers supported JavaScript. Older ones, not knowing what to do with that script text, well, they just used to print it in the page, which then would become full of "gibberish", that of course bothered website visitors and wasn't wanted by webdesigners.
The solution for this issue was simple: old browsers only understand HTML, so let's use HTML comment tags to surround JavaScript codes, and then old browsers will understand it as simple comments and skip them, while "JavaScript ready" browsers would detect JavaScript as something different from HTML and those comment tags would have no effect when parsing JavaScript code! And then we started seeing (well, it was meant to not be seen, but anyway ) this:
1 2 3 4 5 | <script type="text/javascript"> <!-- customFunction("JavaScript code here, the least possible, preferably only parameters data to be gathered later by a *.js code"); --> </script> |
Great, everything was perfect (when we had the right browser...). And then time passed, XHTML came out, web apps became a reality, and things started requiring stuff to be done right. JavaScript is not HTML comments, and so forth they souldn't be seen so.
It's better if user-agents (now, there are many more softwares using the Web, working as HTTP clients, and parsing and rendering HTML than simply browsers, which are an exemple of user-agents), while parsing HTML, see JavaScript as "a block of text that has a meaning and a reason to be there, and should be left alone, because it's not HTML".
This approach is better than "HTML comment", because parsers may want to strip comments while converting plain text into software-understandable data, and if they do so our JavaScript may be removed too. Here come what I've talked before regarding error fixing, real HTML comments, such as <!-- here starts left sidebar -->
are good to go, but "comments" such as <script type="text/javascript"><!-- customFunction("JavaScript code here, the least possible, preferably only parameters data to be gathered later by a *.js code"); --></script>
are not. And if both are delimited using the same "tag", browsers must figure out by themselves what's inside those comment tags, or just leave all comments live. Not good, not good.
A solution came out. Instead of defining JavaScript as HTML comments, it became being defined as XML CDATA, < ![CDATA[ ]] >
. XHTML is now XML after all!
CDATA means character data, and is used to indicate real data that has importance and must be kept (comments, on the other hand, are meant to be used just by developers and is discarded by compilers and interpreters in any other language in the world), it's just not XML, but may be used later by the software. Bingo!
But again we had another problem. As in the past when old browsers didn't understand JavaScript and printed it in the page, when XHTML came out almost no browser understandted it as XML, XHTML was just seen as simple HTML (and it still is, by Internet Explorer, that even today don't understand XHTML...).
XHTML is XML, but is compatible with HTML too, and a software that understands HTML (as Internet Explorer...) understands XHTML with ease. With a few exceptions, and CDATA is one of those. Browsers that don't understand XHTML don't know what to do when they find CDATA tags, and now we come with the problem:
XHTML parsers must be told JavaScript is not XML and should not be parsed and left alone, while HTML brosers (at that time) already knew what JavaScript is and don't need any special measure, as the <!-- -->
used before.
The new solution was now to use JavaScript, and not HTML, comments to hide what old browsers didn't understand. Remembering, in the past old browsers understood HTML but not JavaScript, and HTML comment tags were used to hide JavaScript from them. Now old browsers didn't understand XML, but they understood JavaScript, so JavaScript comments were used to hide XML CDATA tags from them. And here's what we use even today:
1 2 3 4 5 | <script type="text/javascript"> /* < ![CDATA[ */ customFunction("JavaScript code here, the least possible, preferably only parameters data to be gathered later by a *.js code"); /* ]] > */ </script> |
Note I've been separating the < and > from the rest of the tag, because in Wordpress I can't use it. And here's where Wordpress creates our newest problem, that has been troubling plugin developers for 3 years now...
Wordpress doesn't like CDATA inside posts
Today, we don't need to escape CDATA tags from old browsers anymore. I've not tested with IE6, but I believe that even it doesn't bug with those thags anymore. We just use JavaScript comments around them for precaution, but soon we should start removing it from our codes.
And while that, I've also read people saying that forthcoming browsers won't work with HTML comment tags anymore, because they shouldn't be there anyway, since JavaScript is not a comment to be removed. I don't believe browsers will do that and stop processing JavaScript hidden by HTML comment tags, but having comments removed for sure would clean up and make parsing easier and less bug prone. So it's desireble, and being so we should do what we can to make it possible, and old HTML comment tags, meant only to hide JavaScript from browsers as old as IE3, should be avoided.
But when we use Wordpress we face a problem, it doesn't like CDATA inside posts and pages. Widgets are ok, actions and filters outside posts are also ok, but anything that goes inside posts, be it from filters, shortcodes, or the post/page itself, is a no go.
The whole problem resides in the function the_content()
, residing in wordpress/wp-includes/post-template.php
:
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | <?php /** * Display the post content. * * @since 0.71 * * @param string $more_link_text Optional. Content for when there is more text. * @param string $stripteaser Optional. Teaser content before the more text. */ function the_content($more_link_text = null, $stripteaser = 0) { $content = get_the_content($more_link_text, $stripteaser); $content = apply_filters('the_content', $content); $content = str_replace(']]>', ']]>', $content); echo $content; } ?> |
If you don't know, the_content()
is the function used to print Wordpress posts and pages content. The function starts getting the content from get_the_content()
, calls the filter 'the_content'
and them everything goes down, when str_replace()
is used to break CDATA closing tag. >
is the HTML code for >
. HTML codes are used for special chars used by HTML to define tags, attributes, etc. These special chars of course are not printed in the rendered page because they are used for markup, so when we are not using them as part of tags and want they viewed in the browser we use these codes.
The consequence of this replacement is that XML CDATA is opened (since, if nothing is wrong, each CDATA closing tag has a opening one somewhere) and never closed, and a XHTML document that would be valid becomes invalid and full of errors when parsed. All errors are related to opened tags not being closed, because once a CDATA is opened, everything that otherwise would be understood as XML tags is now skipped from being parsed, and since CDATA is never closed, all tags opened before it, are never closed for the parser understanding, and each of those opened and never closed tags is reported as an error.
But, why does Wordpress do that? Why does it break CDATA closing tags? Seeing the code again, we can see it's not a bug, the str_replace()
call is very clean and readable. Somebody put it there somewhere in the past for a reason, why?
This is a very old discussion, in Wordpress trac (http://core NULL.trac NULL.wordpress NULL.org/ticket/3670) we have a 3 years old (from 25/01/2007) ticket reporting the bug, still in version 2.1 of Wordpress, and it has been discussed even in 2010, without any perspective of a change.
Many solutions had been proposed: totally remove it, move it before filters are run so that plugins are not affected, move it to a filter so that it can be removed when needed... and nothing actually done. It remains there, hardcoded, hopelessly residing after post content being queried and filters applied, breaking everything, everything!, that may had added a CDATA before it.
But why in hell is it there? That's the funny part, nobody really knows! Whoever added that line there, seems to not be following Wordpress development anymore, I even suppose it's there since b2, the blogging tool it was forked from. Whoever did that for sure had a reason, but it wasn't documented and he is not available anymore to say why, and since nobody knows, Wordpress authorities seem to fear touch it and break something. The ugly fear of fixing a "bug" and creating another one even bigger.
Some people suppose it's used because, apparently, when Wordpress posts are served as RSS, they are marked as CDATA (I didn't verified if it is), and a CDATA closing tag inside the post would break RSS, that's why a post in RSS can't have the specific CDATA closing tag. But if it's so, why remove this tag in aaaaaall instances of the_content()
, and not only when RSS is used? The RSS issue perfectly explains the replacement as it's done, but doesn't justify it. If that was the reason, the replacement should be moved to a proper place, or even the whole JavaScript tag should be removed using regex, which would be a much better solution.
In my personal opinion, in the borning days of Wordpress, when it was still beta and being used by a small amount of ppl only, when even fewer ones had access to its code, somebody noted this RSS issue, and simply added that str_replace()
as a fast and ugly fix, to fix his personal issue, hoping that somebody else would add a better fix. But it never happened, almost nobody notes it because JavaScript inside posts is very rare, even more JavaScript marked as CDATA, and the ugly fix remained there.
Let me be clear about it, CDATA isn't meant only for JavaScript, it's used in XML for any kind of block of text that may break the XML structure for having text that isn't part of the markup but may be understood as so. (A clear exemple is a JavaScript that prints some HTML, or even has HTML tags as string in it.) But in XHTML, pratically only JavaScript has the possibility of doing so (remember that special chars when not used for markup, in tags, should be converted to HTML codes!).
The JavaScript itself is not a problem and could be used in XHTML without marked as CDATA, if we can assure it won't have anything that may break the XHTML structure. As I've said, CDATA is exactally used for these situations where a text is not part of the XML structure but may be confused as so. You can make the test, I've already seem it, JavaScript without text similar to HTML tags doesn't make a XHTML document invalid.
Also, XHTML wasn't idealized originally for websites, it was for web applications, because HTML was designed to be written for humans and not to me built by softwares. They wanted to use the Web as an interface for applications, needed HTML for that, but wanted a markup language with XML advantages, closer to software development and better to store more complex data, easier for softwares to build making all this together.
It wasn't even meant to be served as text/html to begin with, XHTML was meant to be always served as application/xhtml+xml. But that doesn't mean we can't, and both browsers and servers, including server softwares like CMS, should support it.
XHTML valid JavaScript inside Wordpress posts
As I've said Unobstructive JavaScript encourages us to leave JavaScript in its own file, separated from HTML document, which only refers it. JavaScript code should be used inline in HTML only to pass data to that code, and this should preferably be done in the header or in the end of the body.
But still, having JavaScript in the middle of the document isn't prohibited to be done. It's not wrong and when needed it's perfectly acceptable. All standards and design patterns allow it, and Wordpress can't block us from doing so.
So, what can we do about it?
The best solution would be the person who added that code to show up and explain why it was added, so that we can find a better solution to the original problem. Since it won't happen, the best would be experient Wordpress users to test everything possible and see what would break, and finding these bugs develop a proper fix.
But considering the problem is know for at least 3 years, and everybody who tried to solve it wasn't able to, the next best solution is the largest amount of people remove it, and use Wordpress without it, hoping that somebody finds out what it's protecting.
Removing the breaking code is easy, the problem is that everytime we update Wordpress we must go back there and remove it again. But that's what I've done and with it we can use JavaScript and markup it as it should be in XHTML:
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | <?php /** * Display the post content. * * @since 0.71 * * @param string $more_link_text Optional. Content for when there is more text. * @param string $stripteaser Optional. Teaser content before the more text. */ function the_content($more_link_text = null, $stripteaser = 0) { $content = get_the_content($more_link_text, $stripteaser); $content = apply_filters('the_content', $content); // $content = str_replace(']]>', ']]>', $content); echo $content; } ?> |
If you don't wanna touch Wordpress files, or you are developing a plugin and of course can't ask all your users to do it, you may try to avoid adding JavaScript to posts. If you are doing it from a plugin using a filter, just store the script text and add it to the footer. You should really learn about Unobstructive JavaScript if you didn't yet, there are a few ways for JavaScript to locate anything in the DOM and change it, the easiest is using the id attribute.
If you can't avoid adding your code into the post, the next solution is to not markup it properly, what's sad. If you can assure your code will never have any text that remembers an HTML tag, you can just leave it without any delimitation, as long as I know it's ok for webpages being served as text/html that are only meant for browsers. If you can't assure, just use <!-- -->
HTML comment tags, I've never seen a report of they not working, for now they are just outdated/deprecated for their original purpose, and at least for now, as long as I know, they are 100% safe and work fine.
Conclusion
What I'm doing in my plugin is use HTML comment when a script is added to a post (when 'the_content'
and 'the_excerpt'
filters are used), and CDATA otherwise. I also have an option where users can choose the best solution for them, so that I implement all solutions and only offer a suggestion of which one to use, leaving to them the power to choose, but I feel this option is too advanced and I fear some of them don't understand it at all, and it's something they souldn't need to deal with.
And you? What do you think, what do you choose to do? Do you know a better solution, or know why that line is in Wordpress's the_content()
? If you have anything that could help, please talk .
Popularity: 6%
- application/xhtml+xml
- behavior
- browser
- browsers war
- CDATA
- CMS
- compiler
- deprecated
- developer
- DHTML
- DOCTYPE
- dynamic
- flexibility
- get_the_content
- GTK
- HTML
- Internet Explorer
- JavaScript
- JScript
- language
- Microsoft
- MIME
- Netscape Navigator
- parser
- Pidgin
- plugin
- post
- post-template.php
- regex
- RSS
- str_replace
- strict
- syntax
- text/html
- the_content
- the_excerpt
- unobstructive JavaScript
- user-agent
- validation
- web application
- Web Standards
- webdesigner
- Wordpress
- XHTML
- XHTML valid
It has accumulated a total of 19,963 views. You can follow any comments to this article through the Comments RSS 2.0 Feed. You can leave a comment, or trackback from your own site.
Readers who viewed this page, also viewed:
Related Posts:
- Hikari Email & URL Obfuscator - JavaScript markup
- Hikari Email & URL Obfuscator
- Wordpress comments: new built-in feature X Wordpress…
- Hikari Internal Links - Exemples
- How to hide sidebars on a Wordpress theme
- Wordpress
- Wordpress Large post showing blank / empty?! NO WAY!!
- Hikari Email & URL Obfuscator - Advanced Usage
- Hikari Titled Comments
- Hikari Category Permalink
Comentando vc contribui e participa na criação do conteúdo do site.
Contribua. Commente. :)
(Os comentários abaixo representam a opinião dos visitantes, o autor do site não se responsabiliza por quaisquer consequências e/ou danos que eles venham a provocar.)