Beginning ThML

Loutzenhiser's picture

There are a number of books on Project Gutenberg (and elsewhere) that would make good additions for the CCEL. They have to be converted to ThML (the CCEL's XML-based markup language) in order to be added. In fact, there are some books already at the CCEL that still haven't been converted to ThML.

This process requires some knowledge of XML, but it isn't too difficult for someone who knows something about XML and can use a text editor and XML parser (or XML editor). It would be helpful if they can also run a couple of perl scripts that are already written.

This work is the main bottleneck in adding books to the CCEL.

hplantin's picture

Brief overview of ThML markup process

To create a ThML document, I normally go through the following steps.

1. Create the ThML.head section with metadata such as the author's name, title, subject, call number, etc. This can be done by changing an existing ThML document or running the "mkmeta" perl script that will help create one for you.

2. Get the HTML version of the book if possible. Run it through HTML Tidy to convert it to XHTML and get all the formatting into the stylesheet. (Google HTML Tidy to download it.)

3. Combine the ThML.head section and the HTML document into one big file. Edit it to add <div1> <div2> tags etc. as appropriate. Use a parser to make sure the result is valid. E.g. at the start of Chapter 1, you might add

<div1 title="Chapter 1. The First Chapter">

4. If you have page images, add <pb> tags at the start of each page. E.g. at the start of page xii, you'd add

<pb n="xii"/>

5. Run the perl script cleanThML to check for some additional requirements and find more errors

Then, at the CCEL, the book has to be "finalized" and installed with a number of additional perl scripts.

There are other bits of markup that may also be helpful. For example, if the book is a commentary or collection of sermons, <scripCom elements should be added.

The time required to mark up an HTML book in ThML varies a lot on the size of the book and how much markup is required. It might be anything from less than an hour to many hours.

Harry Plantinga
CCEL Director

Harry Plantinga
CCEL Director