Building World Class Software at uptime

Saturday, June 28, 2008

Santizing HTML with regular expressions

Jeff Atwood argues for the sanity of this regular expression:

var whitelist =
@"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>|
</?s>|</?strike>|</?blockquote>|</?sub>|</?super>|
</?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>|
</?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";


This is perfectly good "dumb code"; as long as there are unit tests for the containing method that make it clear what the intent is, I'd accept this as production code. It's unlikely that the list of allowable tags will change much over time, so maintenance of the regex isn't really an issue.

That said, if the whitelist is expected to change, I'd point out that the regexp has a strong smell of code duplication: nearly all the tags listed have the same form: . I'd probably start with an array of allowable tags, and construct the regexp from those:
String constructRegex() {
String[] containers = { "p", "b", "strong", "i", "em", "s", "strike", "blockquote", "sub", "super", "h1", "h2", "h3", "pre", "code", "ul", "ol", "li" };
String[] emptyTags = { "br", "hr" };

List components = new ArrayList();
for (String tag : containers) {
components.add("</?" + tag + ">");
}
for (String tag : emptyTags) {
components.add("<" + tag + "\\s?/?>");
}
components.add("</a>");
components.add("<a[^>]+>");
components.add("<img[^>]+/?>");

return StringUtils.join("|", components);
}


The advantage of this is that is says things Once And Only Once. The disadvantage is that it's a bit less readable for people who are already fluent in regular expression syntax.

Where does this all leave us? That there isn't a single right answer. A lot of decisions in programming come down to taste, and knowing your audience.

I'll be saying more about this in a future post, especially as it relates to unit testing.



Thursday, June 26, 2008

Not ready for Agile




Monday, June 23, 2008

How to get better at programming

A great article from Jeff Atwood called The Ultimate Code Kata. If you're at all interested in getting better at programming (rather than getting better at Java/C++/PHP/whatever), read and act.



Friday, May 23, 2008

Learning C is important, but you can still get a job without it

In the second Stackoverflow podcast, Joel Spolsky argues that all developers should know C, and Jeff Atwood disagrees. Bloggers are up in arms; most of them agree with Joel, and Alastair Rankine even uses it to argue that Jeff has jumped the shark.

Here's my take: No, Virginia, you don't have to know C to be a professional programmer, but it definitely helps. You don't have to know Ruby, Java, Prolog, C# or Lisp either, but they will all teach you things that you can apply to whatever you happen to be doing for your day job.

Here's a random list of things a well rounded programmer should be comfortable with:
  • Macros (C- and Lisp-style) and their pros and cons.
  • At least two assembly languages (x86 and SPARC for preference): pipelines, caches, scheduling and calling conventions.
  • At least one dynamic language such as Ruby or Lisp
  • Data structures: hash tables, red-black trees, B-trees, graphs, linked lists.
  • Complexity theory: O(n) notations and how to reason about algorithmic efficiency.
  • NP-complete problems and methods for solving them (branch-and-bound search; simulated annealing; genetic algorithms)
  • Memory management: C-style (malloc/free), garbage collection (mark and sweep, and generational)
  • Running a business - if you've founded a startup, even if it failed, you'll have learned a lot from it.
  • Compilers (to machine code and to bytecode). You should definitely implement an optimizing compiler at least once. Extra points if you bootstrap it.
  • Efficient bytecode interpretation and Just-in-time compilation (plus specific bytecode languages for at least 2 languages).
  • Relational database theory and practice: data normalization, optimizers and ORM.
  • Threads in Java and in C; patterns for synchronization and locking.
  • Game development - sprites, 3D rendering; maze generation and search; game AI. A good place to start is Geoff Howland's article on GameDev.net.
  • Web site development: HTML, HTTP and forms of RPC such as SOAP, REST, XML-RPC.
    • Server-side web frameworks such as Ruby on Rails, Django, PHP and Struts, and their pros and cons.
  • Data presentation and techniques for communicating trends. Edward Tufte is the undisputed king of the realm.
  • Unix internals: inodes, memory management, disk management, kernel-space versus user-space. Ideally, you should implement a miniature operating system at least once in your life.
  • Linux: at least make your own distro from scratch.
  • Network protocols: SMTP, HTTP, and Bittorrent at minimum.
Is anything above necessary for being a professional programmer? Absolutely not. But the insights you get from learning them are important in ways you won't appreciate until you try.

Labels: ,




Unit testing: not just for testing

I've said it before and I'll say it again: unit testing is not a testing technique; it's a design technique.

Got that? The point of the technique is to reduce coupling between modules, which promotes code reuse and means you don't get nasty surprises when you come to make changes to it later. The fact that it also inoculates your code against future breakage is mostly incidental.

Vikar Hokstad gets it. If you can't instantiate business objects without connecting to a database, your code is too tightly coupled and it's going to drag you down.

Over the years, I've heard a lot of excuses for why code isn't unit tested, and nearly all of them boil down to "it's too hard". Michael Feathers wrote a whole book on this, which turned out to be a 450-page euphemism for "try harder then". There's always a way, and your code will be better for you spending the time to find it.

Labels: ,




Friday, May 16, 2008

Gordon Ramsay and programmers' egos

If you've never watched Ramsay's Kitchen Nightmares (the British version, not the Fox travesty), you should. Gordon Ramsay visits troubled restaurants and tries to turn them around - sometimes succeeding, sometimes not.

As you watch the episodes, you start to see a few common themes:
  • Chefs cooking for an imagined audience that doesn't really exist
  • Overcomplicated food
  • Egos getting in the way of a quality product
All of this can be translated pretty easily into software:
  • Programmers adding features that make the product harder to use
  • Software that is so over-designed that it has lost its flexibility
  • Beloved design patterns that don't apply to the task at hand.
Today we're getting ready to release up.time 5. I started here at uptime when it was back at version 3, and quickly realised that my first priority had to be simplifying the code base. As the product had evolved over the years, techniques that people had thought were really cool at the time were just getting in the way; if they had used the direct, obvious methods, it would have been much easier to change them with the product's needs. A few months and a lot of hair-tearing later, we'd managed to simplify most of the main areas, and by now, though I say it myself, the code base is really quite good.

Ramsay's advice for pretty much all the restaurants he visits is this: Stick to simple dishes with good, fresh ingredients. He applies this rule to small "hole in the wall" places right up to French restaurants with Michelin stars. And the software equivalent of this looks something like:
  • Keep the design obvious so that it can change later on;
  • Focus on end-user features ("ingredients"), not programming techniques.
That is: if you're building a whizzy new database library before there's an easy way for customers to get your product up and running, there's something wrong - and that something is that your ego is getting in the way. Don't let it happen to you!

Labels: ,




Friday, May 9, 2008

Monitor This!

In case you were wondering what the inside of one of these looks like:

There are pictures over at Royal Pingdom. Very tasty.

Labels: