Skip to content
This repository has been archived by the owner on Aug 1, 2019. It is now read-only.

detect metadata right from epub files #4

Closed
IzzySoft opened this issue Apr 30, 2015 · 4 comments
Closed

detect metadata right from epub files #4

IzzySoft opened this issue Apr 30, 2015 · 4 comments

Comments

@IzzySoft
Copy link
Owner

This issue has been raised with ticket:17 in the original tracker, and was transferred here slightly altered:

it would be nice if minicalope could fetch the title, author, lang, etc right from the epub file, instead of taking the dir/filename or relying on additional *.data files that will be hard to maintain for the user in the long run.

This might be tricky to do in PHP, so an alternative idea could be to allow the user to use a 'backend' that will parse the epub file and return metadat in the format expected by minicalope.

The referenced "ticket" includes a patch, relying on a backend written in C and also attached.

@IzzySoft
Copy link
Owner Author

I strongly advise against using such a feature in automatic runs, especially when unsupervised:

  • epub description might contain "invalid HTML" (e.g. missing closing tags for lists), which then would break OPDS (while working fine in HTML)
  • the same author might turn up in many different spellings

For the latter, an example: Bertha von Suttner. Most times (to my experience – running checks against the ~7,000 books in the German catalog on ebooks.qumran.org), she turns up as either "Bertha Suttner", or as "Suttner, Bertha". Makes two entries for the same author. But she got a title, so she might also turn up as "Bertha von Suttner", "Suttner, Bertha von" and even "von Suttner, Bertha" – making 5 different variants. Now, her title really would be "Freifrau von Suttner". And her full name is "Bertha Sophia Felicita Freifrau von Suttner" (and if you think that's already the most complicated name, check Ida Marie Louise Sophie Friederike Gustave Gräfin von Hahn😇). Unsupervised automated runs would leave all possible combinations – making "books by author X" quite … well, a broken concept.

So what I plan in a first run is:

  • creating a class for reading epub metadata (done and in testing currently: class.epub.php)
  • creating a class extending this, taking care for creating the .desc and .data files for a given book (next on my schedule; class.epubdesc.php)
  • creating a simple script making use of the two, and including it within e.g. the doc/ directory (script already exists and is tested by me for the past couple of weeks; needs rework incl. splitting-out the class.epubdesc.php)

That way you can at least have all the metadata extracted semi-automatically (e.g. epubmeta book.epub would create the .desc and .data in the same place), and you can check (and fix/extend) the created files.

This is the next feature I have planned (of course, bug-fixes have higher priority, if bugs pop up 😉)

IzzySoft added a commit that referenced this issue May 2, 2015
- class epub analyzes and extracts metadata from .epub files
- class epubdesc extends that, creates .data + .desc files, extracts cover
- script doc/epubmeta for those who want to use it :)
@IzzySoft
Copy link
Owner Author

IzzySoft commented May 2, 2015

Implemented as described above 😇

IzzySoft added a commit that referenced this issue May 20, 2015
- global config switch to enable/disable this altogether
- config switch for cover extraction
IzzySoft added a commit that referenced this issue May 25, 2015
IzzySoft added a commit that referenced this issue May 26, 2015
)

- class epubdesc adjusted so we can pick which fields to extract
- corresponding config option extract2data
@IzzySoft
Copy link
Owner Author

This feature has now been added for Metadata (by default, the .data files). As lined out above, there might be a few issues – depending on who built the .epub and how they've set up the metadata. I will line out possible fields and their culprits here:

  • author: see above
  • isbn: safe. This is either an ISBN, or not present at all.
  • publisher: to my experience, in many cases holds more than just the publisher. Usually also the publication place and year. Up to you if you wish that.
  • rating: not sure. Rarely found in epubs.
  • series: Might not be the one you wish to file it under
  • series_index: ditto
  • tag: probably not one of those you are using to file your books, but you might wish to try
  • title: should be pretty safe, but no guarantees
  • uri: also pretty safe (and rarely used)

IzzySoft added a commit that referenced this issue May 26, 2015
…ee issue #4)

- class epubdesc adjusted so we can pick which parts to extract
- corresponding config option extract2desc
IzzySoft added a commit that referenced this issue May 26, 2015
- now permitting for 'all','desc','toc'
@IzzySoft
Copy link
Owner Author

5cd0535 completed this task, so I'll close the issue now. Some remarks on extracting book description you should be aware of:

  • though TOC is always present in .epub files, it's not always really useful (even if it fills the page)
  • a book description may be available. If it is, it might contain HTML tags which might break the XML for the OPDS part (make sure to have $skip_broken_xml set to TRUE if you care for OPDS – otherwise OPDS users might be unable to access such a book)
  • whether the head is useful or not is your decision. Doesn't usually break anything, but you never know how the metadata are set up (believe me, there are strange things around).

You can always check ebooks manually using the doc/epubmeta script, which extracts the full load of available values. Now enjoy!

IzzySoft added a commit that referenced this issue May 27, 2015
- setup as CSV in lang/ebookterms.{lang}
- only specified terms are overwritten, so partial files are possible
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant