detect metadata right from epub files #4

IzzySoft · 2015-04-30T19:06:36Z

This issue has been raised with ticket:17 in the original tracker, and was transferred here slightly altered:

it would be nice if minicalope could fetch the title, author, lang, etc right from the epub file, instead of taking the dir/filename or relying on additional *.data files that will be hard to maintain for the user in the long run.

This might be tricky to do in PHP, so an alternative idea could be to allow the user to use a 'backend' that will parse the epub file and return metadat in the format expected by minicalope.

The referenced "ticket" includes a patch, relying on a backend written in C and also attached.

The text was updated successfully, but these errors were encountered:

IzzySoft · 2015-04-30T20:41:37Z

I strongly advise against using such a feature in automatic runs, especially when unsupervised:

epub description might contain "invalid HTML" (e.g. missing closing tags for lists), which then would break OPDS (while working fine in HTML)
the same author might turn up in many different spellings

For the latter, an example: Bertha von Suttner. Most times (to my experience – running checks against the ~7,000 books in the German catalog on ebooks.qumran.org), she turns up as either "Bertha Suttner", or as "Suttner, Bertha". Makes two entries for the same author. But she got a title, so she might also turn up as "Bertha von Suttner", "Suttner, Bertha von" and even "von Suttner, Bertha" – making 5 different variants. Now, her title really would be "Freifrau von Suttner". And her full name is "Bertha Sophia Felicita Freifrau von Suttner" (and if you think that's already the most complicated name, check Ida Marie Louise Sophie Friederike Gustave Gräfin von Hahn😇). Unsupervised automated runs would leave all possible combinations – making "books by author X" quite … well, a broken concept.

So what I plan in a first run is:

creating a class for reading epub metadata (done and in testing currently: class.epub.php)
creating a class extending this, taking care for creating the .desc and .data files for a given book (next on my schedule; class.epubdesc.php)
creating a simple script making use of the two, and including it within e.g. the doc/ directory (script already exists and is tested by me for the past couple of weeks; needs rework incl. splitting-out the class.epubdesc.php)

That way you can at least have all the metadata extracted semi-automatically (e.g. epubmeta book.epub would create the .desc and .data in the same place), and you can check (and fix/extend) the created files.

This is the next feature I have planned (of course, bug-fixes have higher priority, if bugs pop up 😉)

- class epub analyzes and extracts metadata from .epub files - class epubdesc extends that, creates .data + .desc files, extracts cover - script doc/epubmeta for those who want to use it :)

IzzySoft · 2015-05-02T21:08:26Z

Implemented as described above 😇

- global config switch to enable/disable this altogether - config switch for cover extraction

) - class epubdesc adjusted so we can pick which fields to extract - corresponding config option extract2data

IzzySoft · 2015-05-26T06:38:56Z

This feature has now been added for Metadata (by default, the .data files). As lined out above, there might be a few issues – depending on who built the .epub and how they've set up the metadata. I will line out possible fields and their culprits here:

author: see above
isbn: safe. This is either an ISBN, or not present at all.
publisher: to my experience, in many cases holds more than just the publisher. Usually also the publication place and year. Up to you if you wish that.
rating: not sure. Rarely found in epubs.
series: Might not be the one you wish to file it under
series_index: ditto
tag: probably not one of those you are using to file your books, but you might wish to try
title: should be pretty safe, but no guarantees
uri: also pretty safe (and rarely used)

…ee issue #4) - class epubdesc adjusted so we can pick which parts to extract - corresponding config option extract2desc

- now permitting for 'all','desc','toc'

IzzySoft · 2015-05-26T19:15:07Z

5cd0535 completed this task, so I'll close the issue now. Some remarks on extracting book description you should be aware of:

though TOC is always present in .epub files, it's not always really useful (even if it fills the page)
a book description may be available. If it is, it might contain HTML tags which might break the XML for the OPDS part (make sure to have $skip_broken_xml set to TRUE if you care for OPDS – otherwise OPDS users might be unable to access such a book)
whether the head is useful or not is your decision. Doesn't usually break anything, but you never know how the metadata are set up (believe me, there are strange things around).

You can always check ebooks manually using the doc/epubmeta script, which extracts the full load of available values. Now enjoy!

- setup as CSV in lang/ebookterms.{lang} - only specified terms are overwritten, so partial files are possible

IzzySoft added type:enhancement status:in-progress affects:scan-scripts labels Apr 30, 2015

IzzySoft self-assigned this Apr 30, 2015

IzzySoft added a commit that referenced this issue May 20, 2015

start implementation on data extraction (see issue #4)

b1d759e

- global config switch to enable/disable this altogether - config switch for cover extraction

IzzySoft added a commit that referenced this issue May 25, 2015

code reorg in preparation for #4

bdc56bb

IzzySoft added a commit that referenced this issue May 26, 2015

added possibility to auto-extract Metadata from epub files (see issue #4

a3b10c3

) - class epubdesc adjusted so we can pick which fields to extract - corresponding config option extract2data

IzzySoft added a commit that referenced this issue May 26, 2015

added possibility to auto-extract book description from epub files (s…

d3fc306

…ee issue #4) - class epubdesc adjusted so we can pick which parts to extract - corresponding config option extract2desc

IzzySoft added a commit that referenced this issue May 26, 2015

more diversification for book desc extraction (see issue #4)

5cd0535

- now permitting for 'all','desc','toc'

IzzySoft closed this as completed May 26, 2015

IzzySoft added status:fixed and removed status:in-progress labels May 26, 2015

IzzySoft added a commit that referenced this issue May 27, 2015

language specific terms in book desc headers (PS to #4)

a180fbd

- setup as CSV in lang/ebookterms.{lang} - only specified terms are overwritten, so partial files are possible

IzzySoft added a commit that referenced this issue May 27, 2015

making sure data extracted from epub is used in the same run (PS to #4)

e909931

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect metadata right from epub files #4

detect metadata right from epub files #4

IzzySoft commented Apr 30, 2015

IzzySoft commented Apr 30, 2015

IzzySoft commented May 2, 2015

IzzySoft commented May 26, 2015

IzzySoft commented May 26, 2015

detect metadata right from epub files #4

detect metadata right from epub files #4

Comments

IzzySoft commented Apr 30, 2015

IzzySoft commented Apr 30, 2015

IzzySoft commented May 2, 2015

IzzySoft commented May 26, 2015

IzzySoft commented May 26, 2015