NPR logo Behind the Code: Avoiding Spaghetti HTML


Behind the Code: Avoiding Spaghetti HTML

Spaghetti code never looks this artistic papisc/Flickr hide caption

toggle caption

Just a warning, this is going to delve pretty deeply into the technical issues of rendering HTML on a complex, data-driven website. So if your eyes start to glaze over, you've been warned. I think we do it in a way that Web developers might find interesting.

Ok, you're still reading. Here we go. First, some background.

NPR is a data-driven website. That means that the programmers write the code, in our case PHP, that can take editorial content and turn it into the HTML that makes up The editorial staff uses our homegrown CMS, which we call Seamus, to input stories that are then stored in a MySQL database. After the story is published to the site, the PHP code takes over, reading the content, and putting the story's text and assets (like images) into HTML form.

There are two main ways to go about writing the code for this kind of website.

HTML Templates

You can start with the HTML templates, and insert little chunks of code between the HTML to display the content. This makes it easy to have nice, clean, valid HTML that's a snap to maintain, but it makes it hard to do anything that requires complicated logic. The templates get more and more complicated to handle all possible cases and, soon, the page becomes a nightmare of spaghetti code.


Code Generation

You can start with the code, and have functions or classes which output HTML depending on the logic of the page. This makes it a lot easier to have interesting display logic, but now the HTML is separated into tiny little chunks throughout the code.

It's a real chore to keep track of start and end tags, and the whole site might look bad because there's one obscure case where a div tag isn't closed. It's also very difficult to move page elements around because the HTML for them is scattered in so many places.

Our Solution

How can you write code that can have the complex logic we need, allows you to easily change the look and feel of the page and creates valid HTML that won't mess up the formatting of the site?

We wrote a class we call BaseHtmlNode. It's actually pretty simple. Each HTML tag roughly cooresponds to one instance of BaseHtmlNode.

To create a new link, we would write:

$linkElm = new BaseHtmlNode('a', 'href=""', 'To NPR');

To print the html for this tag, we would write:

print $linkElm->toString();

And we would get:

<a href="">To NPR</a>

As you can see, by putting all the information for this tag into this object, we no longer have to worry about closing tags. The class itself will take care of that for us.

We can start putting these HTML nodes together by adding children to an existing node.

For example:

$divElm = new BaseHtmlNode('div');

$divElm->addChild('p', '', 'This is a paragraph in a div tag');

And when the div element is printed, it also prints its children.


prints out:


<p>This is a paragraph in a div tag</p>


That's basically all there is to BaseHtmlNode. We've added some extra methods for convenience, such as addBr(), addDiv(), and addComment(), but they're basically just special-case wrappers for addChild().


We get the benefit of correctness. BaseHtmlNode produces good HTML that won't break the page. All elements are properly closed, the attributes are properly spaced and quoted and the HTML is as close to XHTML compliant as we can get.

Also, BaseHTMLNode gives us more flexibility. Each element keeps track of its own children, so we can move large portions of HTML around just by moving a single line of code. We can iterate through several different designs during the development process. This helps us to focus on the user experience instead of worrying about the underlying technical plumbing.

Since we've been using BaseHTMLNode, the NPR website has gone through some big changes, but the dev team has been able to keep up.