WP Interop: The stylesheet
This is a work in progress, transplanted from http://ptsefton.com where it used to live. It will grow into the reference guide to the WP Interop project stylesheet, a set of style names for use in MS Word and OpenOffice?.org to allow interchange of documents and good quality XHTML output into a content management (or at least a blogging) system.
There is one big change of previous versions - the characters '#' for number and '*' for bullet in style names have been replaced with 'n' and 'b'.
General notes
This set of style names is:
- Very terse - this is so you can type the entire name of a style very quickly, and so styles will show up in MS Word at the left of the screen.
- As regular as possible, so that they can be easily parsed by machines.
- Not based on the built in styles of any word processor, as these are usually too long, not consistent between versions and certainly not between packages, and more trouble to parse.
Overall document structure
Word processors have some notion of document structure but the ones we are dealing with don't do nested sections like XHTML (with div elements), so we will use headings to infer structure.
Paragraphs will be in 'p' style.
The names for headings are pretty simple.
h1, h2, h3, h4, h5
And for those who like the technical or legalistic numbered approach:
h1n, h2n, h3n, h4n, h5n
(At this stage, in OpenOffice.org you need to manually add these to the document outline as there seems to be a bug outline configuration from saving properly, at least in some versions)
To give you third level heading numbered like so:
1.2.5 This is the result of applying style h3n
Where possible, the system will add <div> elements to create a structure. Things that would make this impossible include stuff like headings in tables, which can't reasonably be allowed to play in the main document structure.
It is also possible to extend the outline numbering down to the paragraph level, so you can have things like 'p1n' and 'p2n'. At Standards Australia we used this (they still do): so that you could have sub-parts numbered (a-z) with blocks of indented text, and further nested (i, ii, iii ...) and so on - like ordered lists in XHTML --- but we still need general purpose lists. But the basic WpInterop? stylesheet will not have p1n styles to start with.
Block-quotes will be 'bq1' - 'bq6', so you will be able to nest quotes, and nest quotes inside lists etc.
Lists
- li1b = List item with a bullet, level 1
- li2b = List item with a bullet, level 2
- li1n = Numbered list item, first level
- li1i = roman numbered
- li1a = alpha numbered list
- li1p = 'Continued' paragraph in a list item
Quiz:
Q. What's this? li5A
A. List item in a list numbered with uppercase Alpha, embedded in four other lists.
Why do we have to do this trick with styles?
Because word processors are not to be trusted to get this right given arbitrary input. True, there are some circumstances under which Word and OO.o will output reasonably well nested lists using their default save as HTML (or XHTML), but they are not the same circumstances in both application, and the interfaces in both cases give you little clue as to where lists start and end or what is going on. At least with styles you tell the authors 'use the styles', and warn them off all the bullets and numbering dialogues.
The conversion to XHTML will have to normalise lists that skip levels, so the over enthusiastic use of nesting like this:
{l1b} First item, outer list
{l3b} First item, embedded-too-deep list
{l3b} Second item, embedded-too-deep list
{l1b} First item, outer list
Will end up as:
<ul>
<li>First item, outer list
<ul>
<li>First item, embedded-too-deep list
<li>Second item, embedded-too-deep list</li>
<li>First item, outer list</li>
</li>
</ul>
Sub-paragraph Emphasis styles
Here we will provide all of the XHTML inline elements, pulled straight from the XHTML strict DTD, but with an 'i' prefix so all the character styles are identifiable:
- i-em, emphasis
- i-strong, strong emphasis
- i-dfn, definitional
- i-code program code
- i-samp , sample
- i-kbd, something user would type
- i-var, variable
- i-cite, citation
- i-abbr, abbreviation
- i-acronym, acronym
- i-q inlined quote
- i-sub, subscript
- i-sup, superscript
- i-tt, fixed pitch font
- i-i , italic font
- i-b, bold font
- i-big, bigger font
- i-small, smaller font
There will have to be some mechanism to supply detail for some of the attributes such as 'cite' on a 'q'-for-quote element, probably using another style.
Extensions
At one point I had ideas about setting up lists of domain specific styles for use in various contexts, so there would be a set of [:/blog/2004/06/02/recipe_schema semantics for recipes] that would be somehow agreed. But that's not going to happen any time soon. However, you can add semantics to any style with a dash. When converted to XTHML the rule will be to apply a class to the element based on your extension (normalised to lower case).
- Style 'h2-Ingredients' will give you <h2 class='wpi-ingredients'> and an enclosing DIV that runs down to the next h2.
- A bunch of items styled with l1*-ingredients would result in an enclosing <ul class='wpi-ingredients'>
This behaviour will open up a lot of opportunities for XPath search, described here by Jon Udell and for extracting metadata from items so they can be shared with others, described here by the same Jon Udell.
