The problem with InnerHTML

 

Published by Julien Lecomte at 6:24 pm under Web Development

The innerHTML property is extremely popular because it provides a simple way to completely replace the contents of an HTML element. Another way to do that is to use the DOM Level 2 API (removeChild , createElement, appendChild) but using innerHTML is by far the easiest and most efficient way to modify the DOM tree. However, innerHTML has few problems of its own that you need to be aware of:

  • Improper handling of the innerHTML property can enable script-injection attacks on Internet Explorer when the HTML string contains a script tag marked as deffered: <script defer>...<script>
  • Setting innerHTML will destroy existing HTML elements that have event handlers attached to them, potentially creating a memory leak on some browsers.

There are a few other minor drawbacks worth mentioning:

  • You don't get back a reference to the element(s) you just created, forcing you to add code to retrieve those references manually (using the DOM APIs…)
  • You can't set the innerHTML property on all HTML elements on all browsers (for instance, Internet Explorer won't let you set the innerHTML property of a table row element)

I am more concerned with the security and memory issues associated with using the innerHTML property. Obviously, this problem is nothing new, and very bright people have already figured out ways to work around some of these problems.

Douglas Crockford wrote a purge function that takes care of breaking some circular references caused by attaching event handlers to HTML elements, allowing the garbage collector to release all the memory associated with these HTML elements.

Removing the script tags from the HTML string is not as easy as it seems. A regular expression should do the trick, although it's hard to know whether it covers all possible cases. Here is the one I came up with:

/<script[^>]*>((.|[\r\n])*?)<\\?\/script>/ig

Now, let's put these two techniques together in a single setInnerHTML function (Update: Thanks to those who commented on this article. I fixed the errors/holes you mentioned, and also decided to bind the setInnerHTML function to YAHOO.util.Dom)

YAHOO.util.Dom.setInnerHTML  = function (el, html ) {
    el
= YAHOO.util .Dom.get(el);
   
if (!el || typeof html !== 'string') {
       
return null;
   
}

   
// Break circular references.
   
(function (o) {

       
var a = o. attributes, i, l, n, c ;
       
if (a) {
            l
= a.length;
           
for (i = 0 ; i < l; i += 1 ) {
                n
= a[ i].name;
               
if ( typeof o[n] === 'function' ) {
                    o
[n] = null;
               
}
           
}
       
}

        a
= o. childNodes;

       
if (a) {
            l
= a.length;
           
for (i = 0 ; i < l; i += 1 ) {
                c
= o. childNodes[i];

               
// Purge child nodes.
                arguments
.callee(c);

               
// Removes all listeners attached to the element via YUI's addListener.
                YAHOO
.util.Event.purgeElement (c);
           
}
       
}

   
})(el);

   
// Remove scripts from HTML string, and set innerHTML property
    el
.innerHTML = html.replace (/<script[^>]*>((.|[\r\n])*?)<\\?\/ script>/ig, "");

   
// Return a reference to the first child
   
return el.firstChild ;
};

Voila! Let me know if there is anything else that should be part of this function, or if I missed anything obvious in the regular expression.

Update: There are obviously many more ways to inject malicious code in a web page. The setInnerHTML function barely normalizes the <script> tag execution behavior across all A-grade browsers. If you are going to inject HTML code that cannot be trusted, make sure you sanitize it first on the server side. There are many libraries available for this.


Ref: http://www.julienlecomte.net/blog/2007/12/38/

Comments

Popular posts from this blog

DevOps - Key Concepts - Roles & Responsibilities - Tools & Technologies

Rewriting the DotNetNuke Url Rewriter Module

Trouble In the House of Google