Santizing HTML with regular expressions
Jeff Atwood argues for the sanity of this regular expression:
This is perfectly good "dumb code"; as long as there are unit tests for the containing method that make it clear what the intent is, I'd accept this as production code. It's unlikely that the list of allowable tags will change much over time, so maintenance of the regex isn't really an issue.
That said, if the whitelist is expected to change, I'd point out that the regexp has a strong smell of code duplication: nearly all the tags listed have the same form: . I'd probably start with an array of allowable tags, and construct the regexp from those:
The advantage of this is that is says things Once And Only Once. The disadvantage is that it's a bit less readable for people who are already fluent in regular expression syntax.
Where does this all leave us? That there isn't a single right answer. A lot of decisions in programming come down to taste, and knowing your audience.
I'll be saying more about this in a future post, especially as it relates to unit testing.
var whitelist =
@"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>|
</?s>|</?strike>|</?blockquote>|</?sub>|</?super>|
</?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>|
</?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";
This is perfectly good "dumb code"; as long as there are unit tests for the containing method that make it clear what the intent is, I'd accept this as production code. It's unlikely that the list of allowable tags will change much over time, so maintenance of the regex isn't really an issue.
That said, if the whitelist is expected to change, I'd point out that the regexp has a strong smell of code duplication: nearly all the tags listed have the same form: . I'd probably start with an array of allowable tags, and construct the regexp from those:
String constructRegex() {
String[] containers = { "p", "b", "strong", "i", "em", "s", "strike", "blockquote", "sub", "super", "h1", "h2", "h3", "pre", "code", "ul", "ol", "li" };
String[] emptyTags = { "br", "hr" };
List components = new ArrayList();
for (String tag : containers) {
components.add("</?" + tag + ">");
}
for (String tag : emptyTags) {
components.add("<" + tag + "\\s?/?>");
}
components.add("</a>");
components.add("<a[^>]+>");
components.add("<img[^>]+/?>");
return StringUtils.join("|", components);
}
The advantage of this is that is says things Once And Only Once. The disadvantage is that it's a bit less readable for people who are already fluent in regular expression syntax.
Where does this all leave us? That there isn't a single right answer. A lot of decisions in programming come down to taste, and knowing your audience.
I'll be saying more about this in a future post, especially as it relates to unit testing.




0 Comments:
Post a Comment
<< Home