What’s wrong with ASP.NET? HTML encoding

The problem


Back when ASP.NET was first introduced, I had pretty high hopes that the new controls would offer support for automatic HTML encoding. Unfortunately, there was very little of this, and most of it was more than a bit lukewarm (more on this later). In some ways, things have improved a bit in v. 2.0, but they’re considerably worse in others.


Before you read any further, you might want to ask yourself which ASP.NET controls perform HTML encoding for you and under what circumstances this is done. If the answer doesn’t leap to mind, you’ve perhaps got a first inkling that there might be a little problem with API consistency and/or the documentation. Then again, maybe you’ve never worried about HTML encoding in your web applications, in which case I’d strongly recommend that you read up on HTML injection and cross-site scripting. A good starting point might be CERT Advisory CA-2000-02.


We’ll look at which controls perform HTML encoding soon. First, we’re going to need to nail down some conceptual stuff because not all encoding is created equal. You may already be aware that HTML, URLs, and client-side script use different encodings. For the sake of simplicity, the remainder of this post will refer mainly to HTML encoding, although the other two forms of encoding do merit consideration as well.


There is more than one flavour of HTML encoding, even within ASP.NET. The first is exposed via the System.Web.HttpUtility.HtmlEncode methods. These encode the characters >, <, &, “, as well as any characters with codes between 160 and 255, inclusive. The other main encoding flavour used by ASP.NET is “attribute” encoding, which is exposed via the System.Web.HttpUtility.HtmlAttributeEncode methods. In ASP.NET 1.1., these encode the & and ” characters only. In ASP.NET 2.0, these encode the characters <, &, and “.


Attribute encoding ought to be a superset of full HTML encoding that also encodes the single quote character in case that’s what happens to be wrapping the attribute. However, as you may have noticed from the above, the ASP.NET version of attribute encoding is even wimpier than its full encoding brother. To make matters worse, the full HTML encoding implemented by ASP.NET is no great shakes in the first place. Security isn’t the only reason for HTML encoding, and failure to encode everything outside the low ASCII range can impact on page readability when client browsers don’t apply the correct code page (which happens more often than you might think, whether it’s the client’s or the server’s fault).


Now that we know what kinds of HTML encoding are available in ASP.NET, let’s take a look at the encoding support offered by the built-in ASP.NET controls.  The following table covers some of the more commonly used controls and properties.  (There are, of course, many other controls and properties that one might wish to see encoded, but I’ve tried to keep the list down to things that most folks are likely to use reasonably frequently.)



Control ASP.NET 1.1 ASP.NET 2.0
Literal None None by default.
HTML encoded if Mode property is set to LiteralMode.Encode.
Label None
Button Text is attribute encoded.
LinkButton None
ImageButton Image URL is attribute encoded. Image URL is URL path encoded then attribute encoded.
HyperLink Text is not encoded.
NavigateUrl is attribute encoded.
Text is not encoded.
NavigateUrl is URL path encoded (unless it uses the javascript: protocol) then attribute encoded.
TextBox Single-line text box (input type=”text”) is attribute encoded.
Multi-line text box (textarea) is HTML encoded.
DropDownList and ListBox Option values are attribute encoded.
Option display texts are HTML encoded.
CheckBox and CheckBoxList Value is not used.
Display text is not encoded.
RadioButton and RadioButtonList Value is attribute encoded.
Display text is not encoded.
Table None
DataGrid None for text columns.
Hyperlink columns follow the pattern for HyperLink controls.
Validators (BaseValidator subclasses) and ValidationSummary Validator display text is not encoded.
For client script, the validator error message and validation summary header text are attribute encoded.
When rendering “populated” validators and the validation summary controls from the server, no encoding is applied.
Validator display text is not encoded.
For client script, the validator error message and validation summary header text are javascript encoded (blacklisting approach).
When rendering “populated” validators and the validation summary controls from the server, no encoding is applied.
HiddenField N/A Value is attribute encoded.
GridView and DetailsView N/A Text fields HTML encode if their HtmlEncode property is set to true. (This is the default, which is also used for auto-generated columns.) However, the null display text for text fields is not encoded even if the field’s HtmlEncode is set to true.
Hyperlink fields follow the pattern for HyperLink controls.


Assuming you’ve actually taken the time to read the above, you might have noticed that there are five basic patterns of encoding usage:


  1. No encoding ever applied.
  2. HTML and/or attribute encoding, as appropriate (with a bit of additional URL and/or javascript encoding applied when appropriate), applied all the time.
  3. Attribute encoding applied for attributes, but no encoding applied for other text.
  4. Optional encoding set via a boolean property, defaulting to applying the encoding.
  5. Optional encoding set via an enumerated property, defaulting to not applying the encoding.

If this strikes you as perhaps a wee bit inconsistent, you wouldn’t be alone. Wouldn’t it be great to see a consistent approach that telegraphs well and acts as a pit of success? If all the controls performed HTML encoding by default but allowed overriding when necessary (preferably via a single approach), the vast majority of developers writing for ASP.NET would end up generating a more secure, more reliable applications with considerably less effort.


Workarounds


While we’re all waiting around for the ASP.NET team to eventually provide reasonable built-in support for HTML encoding, what can we do to ensure that our apps are both protected from HTML injection and character mis-rendering? A good starting point would be to fully encode all data (i.e.: anything not 100% known at compile time, and even some stuff that is) that will be pushed to the client browser. Unfortunately, as was already mentioned above, the built-in encoding scheme leaves a little something to be desired. Luckily, the ACE team folks at Microsoft have been working on a couple of tools that take a more robust approach to HTML (and URL and script) encoding. Rather than blacklisting a fixed set of potentially problematic characters for encoding, they whitelist a set of known safe characters (low ASCII a-z, A-Z, 0-9, space, period, comma, dash, and underscore for HTML encoding) and encode everything else. This quite nicely takes care of both security and appearance issues, and you may wish to seriously consider using this approach rather than calling System.Web.HttpUtility.HtmlEncode to perform your HTML encoding.


Regardless of which HTML encoding approach you select, you’re quickly going to run into a bit of trouble with double encoding if you simply start assigning pre-encoded text to control properties (e.g.: someTextBox.Text = HttpUtility.HtmlEncode(someString)). When dealing with malicious input, this is pretty much a non-issue. However, not all data that ought to be encoded is malicious, and you usually wouldn’t want users seeing stuff like a &gt; b rather than a > b. Unfortunately, if we want to avoid double encoding in the set of controls that perform non-overrideable encoding (including attribute encoding), we need to use custom controls. To make matters worse, it can require rather a lot of work to subclass most of the controls in order to override the encoding behaviour. In quite a few cases, simply starting from scratch would probably make more sense than trying to subclass the built-in controls. Also, even for those controls where double encoding wouldn’t be an issue (e.g.: Label, CheckBox), it’s probably worth considering using custom controls anyway since the pain of authoring the custom control isn’t likely to outweigh the cumulative effort of all the manual encoding calls you might make across all your projects.


Don’t like these workarounds? Maybe it’s time to start complaining

8 thoughts on “What’s wrong with ASP.NET? HTML encoding”

  1. I’m running into the same problem with a script I’m designing. This is very agrivating and I hope they do fix this is 3.0.

  2. Errata : (for ASP.NET 2.0)
    single-line TextBox value is attribute encoded.
    multi-line TextBox value is html encoded.

  3. Thanks for the heads-up about the single-line text box encoding. Given that the double-encoding example in the “Workarounds” section depends on attribute-encoding in a single-line textbox, I must have been completely asleep at the wheel when I populated that line of the table. It’s fixed now…

  4. I just thought I had better check, and was also shocked that label does not encode. And none of the examples mention it. And this is 2009!

    Inconsistent is even worse than none. Thanks for the table.

    I think we need the PHP hack — never accept odd characters in input, and hope all input comes through such a filter. PHP abandoned that some years ago for good reason.

    I hope LINQ does a better job of SQL injection attacks.

  5. It seems like there should be a parameter to Eval() where you can specify whether you want to encode what you are binding to.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>