Example: Optimising a Web Page Generated by Microsoft Word

Here is a web page (HTML document) created using Microsoft Word.

The contents are just the words "Hello World":

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 10">
<meta name=Originator content="Microsoft Word 10">
<link rel=File-List href="Hello%20World_files/filelist.xml">
<title>Hello World</title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>User</o:Author>
  <o:Template>Normal</o:Template>
  <o:LastAuthor>User</o:LastAuthor>
  <o:Revision>1</o:Revision>
  <o:TotalTime>1</o:TotalTime>
  <o:Created>2007-07-16T14:36:00Z</o:Created>
  <o:LastSaved>2007-07-16T14:37:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Words>1</o:Words>
  <o:Characters>11</o:Characters>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:CharactersWithSpaces>11</o:CharactersWithSpaces>
  <o:Version>10.2625</o:Version>
 </o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:GrammarState>Clean</w:GrammarState>
  <w:Compatibility>
   <w:BreakWrappedTables/>
   <w:SnapToGridInCell/>
   <w:WrapTextWithPunct/>
   <w:UseAsianBreakRules/>
  </w:Compatibility>
  <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
 </w:WordDocument>
</xml><![endif]-->
<style>
<!--
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {mso-style-parent:"";
        margin:0cm;
        margin-bottom:.0001pt;
        mso-pagination:widow-orphan;
        font-size:12.0pt;
        font-family:"Times New Roman";
        mso-fareast-font-family:"Times New Roman";}
@page Section1
        {size:612.0pt 792.0pt;
        margin:72.0pt 90.0pt 72.0pt 90.0pt;
        mso-header-margin:35.4pt;
        mso-footer-margin:35.4pt;
        mso-paper-source:0;}
div.Section1
        {page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
        {mso-style-name:"Table Normal";
        mso-tstyle-rowband-size:0;
        mso-tstyle-colband-size:0;
        mso-style-noshow:yes;
        mso-style-parent:"";
        mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
        mso-para-margin:0cm;
        mso-para-margin-bottom:.0001pt;
        mso-pagination:widow-orphan;
        font-size:10.0pt;
        font-family:"Times New Roman";}
</style>
<![endif]-->
</head>

<body lang=EN-US style='tab-interval:36.0pt'>

<div class=Section1>

<p class=MsoNormal>Hello World</p>

</div>

</body>

</html>

Most of this code is only used when loading the document back into Microsoft Word.

We remove all of that and make a few more optimisations:

  • Remove the formatting information that we didn't actually ask for. We used the default font and size in Word, which should indicate that we don't actually care, and that the browser's default font and size should be used as well.
  • Remove the advertisement that the page was created by Microsoft Word, as we had to work so hard to fix it afterwards.
  • Remove the charset from the Content-Type meta tag, as this is a Windows-specific character set and will not work on non-Windows machines anyway.
  • Add the missing quotes around attribute values. W3C "recommends using quotation marks even when it is possible to eliminate them."

Having done that, we are left with:

<html xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html">
<title>Hello World</title>
</head>
<body lang="EN-US">
<p>Hello World</p>
</body>
</html>

The original version was 2,454 bytes long. Our replacement is 204 bytes, a saving of over 90%. Admittedly, the document conveys only 11 bytes of useful text, but we have significantly reduced the HTML overhead.