Skip to content

johanley/anne-of-green-gables-fmt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Anne of Green Gables: formatted

Formatted versions of my transcription of Anne of Green Gables.

This repo was created mainly for producing items for Project Gutenberg. The PG ebook was deployed on 2021-01-22. Jacqueline Jeremy of Distributed Proofreaders assisted me with creating the PG ebook.

The PG (HTML) version of the transcription is here.

This repo has a Creative Commons Zero v1.0 Universal license. That's roughly equivalent to public domain.

Inputs to Project Gutenberg

Project Gutenberg has two main formats:

  • a single plain text file (UTF-8, .txt extension), with standards on formatting
  • an XHTML file (as .html), used in turn to generate .epub and .mobi formats. There are some general standards, but they aren't numerous. They focus on identifying chapters and ensuring basic XHTML/CSS validity.

Project Gutenberg (PG) works pretty closely with Distributed Proofreaders (DP). DP has a lot of good documentation about the required formats and standards (links below). The general idea is to follow their recommendations, and to use validators before submitting files to PG.

Main links

After you submit files to PG, they will be examined closely by PG whitewashers, who prepare the text for final publication. File names are formatted

like_this_or-that-01.txt

all in lower case.

When you upload, you zip together all of the formats you have into a single zip file. There are validators that you can use to help you through the process.

Distributed Proofreaders acts as the main input into Project Gutenberg. They have tons of information on what is required:

Three main parts

There are 3 main parts (wrapped later by PG boilerplate):

  • front matter
  • main body (38 chapters, in this case)
    • plain text
    • poetry
    • letters
    • illustrations
  • transcriber's notes, at the end

Process

Make a branch called 1908-LCP-4th-pg off the uncorrected branch, to hold the desired text. Apply a small number of corrections to that branch.

After that, the text needs to be formatted for PG. My formatting is only semi-automated:

  • create a template file to hold the title page, transcriber notes and so on, entered manually
  • run a script (Java, in Eclipse, against the correct branch) to inject the bulk of the text into the template file (with line wrap at 72)
  • apply manual formatting for some items
  • zip the result and send it to PG

Manual edits

  • THE END text; classes: mt3, center
  • the name of the novel, at the start of chapter 1; classes: center, p180
  • poetry: copy-paste divs with classes; careful with quotes; careful with no-indent on the following para
  • correspondence; careful with quotes; careful with no-indent on the following para; classes: mb0, center; mt0, center-right4, smcap
  • placement of illustrations: put near the corresponding text

Poetry/Correspondence

These are special places where I need to format the text manually.

P: poetry, L: correspondence (a letter).

  • dedication: P
  • Ch 02: P little birds sang
  • Ch 07: L (odd, starts as normal text) Gracious Heavenly Father
  • Ch 11: P Midian
  • Ch 17: 2P, 2L; when twilight drops; shorn of Brutus
  • Ch 18: P nothing but death
  • Ch 19: P not a sister
  • Ch 24: P heart farewell (not special, inline!)
  • Ch 29: P stubborn spearsmen
  • Ch 31: P hills peeped
  • Ch 32: L (embedded in normal text, long)
  • Ch 33: P one moonbeam

Text requirements/recommendations

  • UTF-8 encoding, .txt extension
  • end of line is CR-LF (Windows style)
  • byte-order mark (BOM) removed (they strip it out if present; don't worry about it)
  • line width 60-70, max 75 (recommend 72)
  • italic like _this_, bold like =this=
  • using em-dash is OK
  • using curly quotes is OK
  • no tab characters allowed
  • no spaces at end of lines
  • no extra spaces between words
  • use ligatures when the source copy-text does
  • match the copy-text closely, including errors; they can be noted in the transcriber's notes
  • it's ok to end a line in Mr. or Mrs.; no need for a non-breaking space - example
  • 4 empty lines: at very top (gap after PG boilerplate)
  • 4 empty lines: at the top of a chapter
  • 4 empty lines: between frontispiece and title page
  • 2 empty lines between chapter headings and chapter body
  • hyphens at the end of the line: compare the spelling of the same word as found elsewhere in the text
  • transcriber's note at the end; this helps to flag items to PG whitewashers
  • poetry: indent by 1-4 spaces, so that line wrapping by tools is turned off
  • blockquotes: treat as poetry
  • letters/correspondence? those are trickiest items
  • front matter: seems to be some freedom there, as long as it's reasonable

Line-width/line-wrapping is a bit tricky. Be careful with that. (Java has a BreakIterator class, but it's a bit quirky in its behaviour.)

In my case, most items are auto-generated. Poetry and letters are two items that are handled manually, by editing the generated output.

HTML requirements/recommendations

  • UTF-8 encoding
  • XHTML 1.0 Strict or 1.1 (epub uses XHTML)
  • modern example
  • CSS 2.1 or below; CSS 3 can also be used if needed; CSS appears to be embedded, not in a separate .css file
  • handheld is used in the CSS media query; that setting is OK in CSS 2.1, deprecated in CSS 3.
  • use W3C validators for markup and CSS; use HTML Tidy; remove unused styles
  • use PG's ebookmaker converter to convert your book and review the result carefully (checks epub/mobi formats)
  • image file formats: .jpg (.png for vector drawings)
  • font: not specific, font-family only
  • don't use: <br> (except in poems?), &nbsp;, or empty tags to control spacing; use CSS margin, padding
  • title: H1
  • chapter: H2
  • images are placed in a conventional images directory beside the html
  • cover image: no larger than 256K, width-height from 650x1000 to 5000x5000. Name is cover.jpg.
  • inline image: no larger than 256K, up to 5000x5000; state the width-height explicitly for all images.
  • example of title-tag: The Project Gutenberg eBook of Alice's Adventures in Wonderland, by Lewis Carroll
  • metadata or comments about the text can be placed in the header, or in HTML comments
  • page numbers optional, not mandatory; some people put them in comments
  • no external links allowed
  • css: div.chapter, p.poem, p.letter
  • the text flow uses max width similar to plain text files

The PG (HTML) version of the transcription is deployed here.