Skip to content

Latest commit

 

History

History
33 lines (19 loc) · 1.49 KB

unicode.md

File metadata and controls

33 lines (19 loc) · 1.49 KB

Unicode

MARC-8 and Unicode

"MARC (ISO 2709)" records could be encoded in two different character coding schemes: MARC-8 or UCS/Unicode.

Use yaz-marcdump to convert the encoding of MARC records. Specify the encoding with options -f and -t. With option -l you can set the character coding scheme in the MARC leader position 09.

$ yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw > marc21.utf8.raw

A conversion from UTF-8 to MARC-8 is not recommended, because it could be lossy.

Unicode normalization

Unicode provides single code points for many characters that could be viewed as combinations of two or more characters, e.g. German umlauts:

Composed/NFC Decomposed/NFD
ä (Latin Small Letter A with Diaeresis U+00E4) a (Latin Small Letter A U+0061) + ◌̈ (Combining Diaeresis U+0308)

With the command-line utility uconv you can transliterate data between different Unicode normalization forms:

$ uconv -x NFC marc21.nfd.xml > marc21.nfc.xml
$ uconv -x NFD marc21.nfc.xml > marc21.nfd.xml

You should only normalize "MARC XML" data, as the normalization of "MARC (ISO 2709)" would result in corrupted records, due to changed field lengths.