The problem with InnerHTML
Published by Julien Lecomte at 6:24 pm under Web Development
The innerHTML
property is extremely popular because it provides a simple way to completely replace the contents of an HTML element. Another way to do that is to use the DOM Level 2 API (removeChild
, createElement
, appendChild
) but using innerHTML
is by far the easiest and most efficient way to modify the DOM tree. However, innerHTML
has few problems of its own that you need to be aware of:
- Improper handling of the
innerHTML
property can enable script-injection attacks on Internet Explorer when the HTML string contains a script tag marked as deffered:<script defer>...<script>
- Setting
innerHTML
will destroy existing HTML elements that have event handlers attached to them, potentially creating a memory leak on some browsers.
There are a few other minor drawbacks worth mentioning:
- You don't get back a reference to the element(s) you just created, forcing you to add code to retrieve those references manually (using the DOM APIs…)
- You can't set the
innerHTML
property on all HTML elements on all browsers (for instance, Internet Explorer won't let you set theinnerHTML
property of a table row element)
I am more concerned with the security and memory issues associated with using the innerHTML
property. Obviously, this problem is nothing new, and very bright people have already figured out ways to work around some of these problems.
Douglas Crockford wrote a purge
function that takes care of breaking some circular references caused by attaching event handlers to HTML elements, allowing the garbage collector to release all the memory associated with these HTML elements.
Removing the script tags from the HTML string is not as easy as it seems. A regular expression should do the trick, although it's hard to know whether it covers all possible cases. Here is the one I came up with:
/<script[^>]*>((.|[\r\n])*?)<\\?\/script>/ig
Now, let's put these two techniques together in a single setInnerHTML
function (Update: Thanks to those who commented on this article. I fixed the errors/holes you mentioned, and also decided to bind the setInnerHTML
function to YAHOO.util.Dom
)
YAHOO.util.Dom.setInnerHTML = function (el, html ) {
el = YAHOO.util .Dom.get(el);
if (!el || typeof html !== 'string') {
return null;
}
// Break circular references.
(function (o) {
var a = o. attributes, i, l, n, c ;
if (a) {
l = a.length;
for (i = 0 ; i < l; i += 1 ) {
n = a[ i].name;
if ( typeof o[n] === 'function' ) {
o[n] = null;
}
}
}
a = o. childNodes;
if (a) {
l = a.length;
for (i = 0 ; i < l; i += 1 ) {
c = o. childNodes[i];
// Purge child nodes.
arguments .callee(c);
// Removes all listeners attached to the element via YUI's addListener.
YAHOO.util.Event.purgeElement (c);
}
}
})(el);
// Remove scripts from HTML string, and set innerHTML property
el.innerHTML = html.replace (/<script[^>]*>((.|[\r\n])*?)<\\?\/ script>/ig, "");
// Return a reference to the first child
return el.firstChild ;
};
Voila! Let me know if there is anything else that should be part of this function, or if I missed anything obvious in the regular expression.
Update: There are obviously many more ways to inject malicious code in a web page. The setInnerHTML
function barely normalizes the <script>
tag execution behavior across all A-grade browsers. If you are going to inject HTML code that cannot be trusted, make sure you sanitize it first on the server side. There are many libraries available for this.
Ref: http://www.julienlecomte.net/blog/2007/12/38/
Comments