Building World Class Software at uptime

Saturday, June 28, 2008

Santizing HTML with regular expressions

Jeff Atwood argues for the sanity of this regular expression:

var whitelist =
@"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>|
</?s>|</?strike>|</?blockquote>|</?sub>|</?super>|
</?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>|
</?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";


This is perfectly good "dumb code"; as long as there are unit tests for the containing method that make it clear what the intent is, I'd accept this as production code. It's unlikely that the list of allowable tags will change much over time, so maintenance of the regex isn't really an issue.

That said, if the whitelist is expected to change, I'd point out that the regexp has a strong smell of code duplication: nearly all the tags listed have the same form: . I'd probably start with an array of allowable tags, and construct the regexp from those:
String constructRegex() {
String[] containers = { "p", "b", "strong", "i", "em", "s", "strike", "blockquote", "sub", "super", "h1", "h2", "h3", "pre", "code", "ul", "ol", "li" };
String[] emptyTags = { "br", "hr" };

List components = new ArrayList();
for (String tag : containers) {
components.add("</?" + tag + ">");
}
for (String tag : emptyTags) {
components.add("<" + tag + "\\s?/?>");
}
components.add("</a>");
components.add("<a[^>]+>");
components.add("<img[^>]+/?>");

return StringUtils.join("|", components);
}


The advantage of this is that is says things Once And Only Once. The disadvantage is that it's a bit less readable for people who are already fluent in regular expression syntax.

Where does this all leave us? That there isn't a single right answer. A lot of decisions in programming come down to taste, and knowing your audience.

I'll be saying more about this in a future post, especially as it relates to unit testing.



Thursday, June 26, 2008

Not ready for Agile




Monday, June 23, 2008

How to get better at programming

A great article from Jeff Atwood called The Ultimate Code Kata. If you're at all interested in getting better at programming (rather than getting better at Java/C++/PHP/whatever), read and act.