Category Archives: 7990

Cleaning HTML With Regular Expressions

While participating in a forum discussion, the need to clean up HTML from “dangerous” constructs came up.

In the present case it was needed to remove SCRIPT, OBJECT, APPLET, EMBBED, FRAMESET, IFRAME, FORM, INPUT, BUTTON and TEXTAREA elements (as far as I can think of) from the HTML source. Every event attribute (ONEVENT) should also be removed keep all other attributes, though.

HTML is very loose and extremely hard to parse. Elements can be defined as a start tag (<element-name>) and an end tag (</element-name>) although some elements don’t require the end tag. If XHTML is being parsed, elements without an end tag require the tag to be terminated with /> instead of just >.

Attributes are not easier to parse. By definition, attribute values are required to be delimited by quotes () or double quotes (), but some browsers accept attribute values without any delimiter.

We could build a parser, but then it will become costly to add or remove elements or attributes. Using a regular expression to remove unwanted elements and attributes seems like the best option.

First, lets capture all unwanted elements with start and end tags. To capture these elements we must:

  • Capture the begin tag character followed by the element name (for which we will store its name – t): <(?<t>element-name)
  • Capture optional white spaces followed by any character: (\s+.*?)?
  • Capture the end tag character: >
  • Capture optional any characters: .*?
  • Capture the begin tag character followed by closing tag character, the element name (referenced by the name – t) and the end tag character: </\k<t>>
<(?<t>tag-name(\s+.*?)?>.*?</\k<t>>

To capture all unwanted element types, we end up with the following regular expression:

<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\s+.*?)?>.*?</\k<t>>

Next, lets capture all unwanted elements without an end tag. To capture these elements we must:

  • Capture the begin tag character followed by the element name: <element-name
  • Capture optional white spaces followed by any character: (\s+.*?)?
  • Capture an optional closing tag character: /?
  • Capture the end tag character: >
<tag-name(\s+.*?)?/?>

To capture all unwanted element types, we end up with the following regular expression:

<(script|object|applet|embbed|frameset|iframe|form|textarea|input|button)(\s+.*?)?/?>

To remove those unwanted elements from the source HTML, we can combine these two previous regular expressions into one and replace any match with an empty string:

Regex.Replace(
    sourceHtml,
    "|(<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
        + "|(<(script|object|applet|embbed|frameset|iframe|form|input|button|textarea)(\\s+.*?)?/?>)"    ,
string.Empty);

And finally, the unwanted attributes. This one is trickier because we want to capture unwanted attributes inside an element’s start tag. To achieve that, we need to match an element’s opening tag and capture all attribute definitions. To capture these attributes we must:

  • Match but ignore the begin tag character followed by any element name: (?<=<\w+)
  • Match all:
    • Don’t capture mandatory with spaces: (?:\s+)
    • Capture attribute definition:
      • Capture mandatory attribute name: \w+
      • Capture mandatory equals sign: =
      • Capture value specification in one of the forms:
        • Capture double quoted value: “[^"]*”
        • Capture single quoted value: ‘[^']*’
        • Capture unquoted value: .*?
  • Match but ignore end tag: (?=/?>)
(?<=<\w+)((?:\s+)(\w+=(("[^"]*")|('[^']*')|(.*?)))*(?=/?>)

The problem with the previous regular expression is that it matches the start tag and captures the whole list of attributes and not each unwanted attribute by itself. This prevents us from from replacing each match with a fixed value (empty string).

To solve this, we have to name what we want to capture and use the Replace overload that uses a MatchEvaluator.

We could capture unwanted attributes as we did for the unwanted elements, but then we would need to remove them from the list of all the element’s attributes. Instead, we’ll capture the wanted attributes and build the list of attributes. To identify the wanted attributes, we’ll need to name them (a). The resulting code will be something like this:

Regex.Replace(
    sourceHtml,
    "((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
    match =>
    {
        if (!match.Groups["a"].Success)
        {
            return string.Empty;
        }
        
        var attributesBuilder = new StringBuilder();
        
        foreach(Capture capture in match.Groups["a"].Captures)
        {
            attributesBuilder.Append(' ');
            attributesBuilder.Append(capture.Value);
        }
        
        return attributesBuilder.ToString();
    }
);

To avoid parsing the source HTML more than once, we can combine all the regular expressions into a single one.

Because we are still outputting only the wanted attributes, there’s no change to the match evaluator.

A few options (RegexOptions) will also be added to increase functionality and performance:

  • IgnoreCase: For case-insensitive matching.
  • CultureInvariant: For ignoring cultural differences in language.

  • Multiline: For multiline mode.

  • ExplicitCapture: For capturing only named captures.

  • Compiled: For compiling the regular expression into an assembly. Only if the regular expression is to be used many times.


The resulting code will be this:


Regex.Replace(
    sourceHtml,
    "(<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
        + "|(<(script|object|applet|embbed|frameset|iframe|form|input|button|textarea)(\\s+.*?)?/?>)"
        + "|((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
    match =>
    {
        if (!match.Groups["a"].Success)
        {
            return string.Empty;
        }
        
        var attributesBuilder = new StringBuilder();
        
        foreach(Capture capture in match.Groups["a"].Captures)
        {
            attributesBuilder.Append(' ');
            attributesBuilder.Append(capture.Value);
        }
        
        return attributesBuilder.ToString();
    },
    RegexOptions.IgnoreCase
        | RegexOptions.Multiline
        | RegexOptions.ExplicitCapture
        | RegexOptions.CultureInvariant
        | RegexOptions.Compiled
);

This was not extensively tested and there might be some wanted HTML remove and some unwanted HTML kept, but it’s probably very close to a good solution.

More On ASP.NET Validators And Validation Summary Rendering of Properties

On previous posts [^][^] I mentioned the size of ASP.NET validators and validation summary rendering and the fact that expando attributes are being used to add properties. Mohamed also mentions this issue.


Besides the fact that custom attributes aren’t XHTML conformant, Firefox differs from Internet Explorer in the way it handles these attributes.


On Internet Explorer, these attributes are converted in string properties of the HTML element. On Firefox, on the other hand, these attributes are only accessible through the attributes collection.


I wonder why I don’t like client-side JavaScript development.

The Cause Of ASP.NET Validators And Validation Summary Slowness

When building ASP.NET pages, if you use too many validators and validation summaries your pages can become very slow. Have you ever wondered why?


Lets build a simple page web page with a few validators. Something like this:


Web page with validation


The page is composed of:



ASP.NET renders the ValidationSummary as a DIV and each validator as a SPAN and uses expando attributes to add properties to those elements.


According to the documentation, expando attributes are set dynamically from JavaScript to preserve XHTML compatibility for the rendered control’s markup.


The problem is that all that JavaScript makes the HTML document larger and slower to execute than if the properties were rendered in HTML as attributes of the elements.


For such a small page, the difference in size approaches 2k bytes. If you add a few dozen validators to he page, the slowness is noticeable.


I’m all in favor of strict standards and standards compliance, but in this case, I wish XHTML would allow arbitrary attributes.