Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

ENAMDICT/JMnedict clutter #2111

Open
stephenmk opened this issue Apr 5, 2022 · 6 comments
Open

ENAMDICT/JMnedict clutter #2111

stephenmk opened this issue Apr 5, 2022 · 6 comments

Comments

@stephenmk
Copy link
Contributor

stephenmk commented Apr 5, 2022

The current version of the JMnedict dictionary for yomichan is somewhat notorious for cluttering users' workspaces with terms. For example, a search for ひろこ pulls up over 30 term-reading pairs.

I'm wondering if everyone would be on-board with an update to this dictionary in which many of these personal name terms are consolidated.

For example, a search for ろはん would bring up a term containing all of the the relevant kanji forms in the glossary (24 of them) instead of 24 individual term-reading pairs. A search for 紗子 would bring up a term with all possible readings (6 of them) in its glossary instead of 6 different terms.

If this sounds good, I can begin working on updating yomichan-import to produce this new version of the dictionary.

(This issue is technically with yomichan-import, but I'm posting here because it's the more active repo.)


Here's a list of codes currently used in JMnedict. I'm thinking "fem", "given", "masc", "surname", and "unclass" are the relevant categories that should be consolidated, and possibly also "person" and "oth" depending on how they look after I do some research.

So if a particular name belongs to more than one of those categories, then the consolidated term would have one "sense" for each category (with the appropriate tag), and the sense would contain a gloss with a semicolon delimited list of the relevant readings or kanji forms of the name.

JMnedict code table
code description
char "character"
company "company name"
creat "creature"
dei "deity"
doc "document"
ev "event"
fem "female given name or forename"
fict "fiction"
given "given name or forename, gender not specified"
group "group"
leg "legend"
masc "male given name or forename"
myth "mythology"
obj "object"
organization "organization name"
oth "other"
person "full name of a particular person"
place "place name"
product "product name"
relig "religion"
serv "service"
station "railway station"
surname "family or surname"
unclass "unclassified name"
work "work of art, literature, music, etc. name"
@Thermospore
Copy link
Contributor

Personally I put jmnedict in a different profile and assigned it a different hotkey, so it wouldn't clutter up my main dictionaries

@stephenmk
Copy link
Contributor Author

I have also disabled it in my main profile, but I wish I didn't have to. So it's my hope that this update would solve that problem.

@stephenmk
Copy link
Contributor Author

Here are some mockups of what I'm imagining (click the summaries to expand the images)

Query for 伊勢原八幡台石器時代住居跡

ise2

Query for いせはらはちまんだいせっきじだいじゅうきょあと

ise

(The only definition for 伊勢原八幡台石器時代住居跡 in JMdictDB is "Iseharahachimandaisekkijidaijuukyoato")

Query for はるか

haruka_kana

Query for 春香

haruka_kanji


What I'm discovering is that JMnedict contains two kinds of entries: those with glosses that merely transcribe the name into latin characters (generally generic name entries -- given names, surnames, etc.), and those that have more details (specific people, famous places, brands, etc.). The former category represents the overwhelming majority of entries. I want to consolidate those entries (as pictured above) while leaving the other entries in the same format as they are in the current yomichan dictionary.

It might also be worthwhile to split these two categories into two dictionary files. I imagine more people would be interested in a lightweight dictionary file with these more specific entries.

So the only challenge here is devising a way to determine whether or not an entry's gloss is merely a transcription of the corresponding kana. I tried using this golang library, but it doesn't seem robust enough to handle many situations (characters with macrons like ō, ん written as n', アイ written as "ay", etc). So I'd need to implement a new comparison tool.

Thoughts? Does this sound interesting to anyone?

@MarvNC
Copy link

MarvNC commented Jul 21, 2022

Been interested in this dictionary for a while, might you have a testing version or something available to try? I wouldn't really mind having some random latin character names as it seems it would still be a huge improvement in the amount of clutter.

@stephenmk
Copy link
Contributor Author

stephenmk commented Jul 22, 2022

The code I made for this is quite a mess, so I haven't published it anywhere.

As I explained in my post above, my first prototype contained three kinds of entries:

  1. The same sort of normal entries that you can find in the current version of JMnedict for Yomichan. I.e., kanji headwords, kana readings, and English-language glosses. These are usually entries for specific people, companies, and organizations.
  2. Kanji-to-kana lookups for generic names.
  3. Kana-to-kanji lookups for generic names.

I'm not so sure about how useful this third category is. Most of these entries look like a giant mess of kanji.

Example: よしたけ

yoshitake

I've uploaded two versions of the test dictionary: one which contains these kana-to-kanji lookups, and one which does not. The former is about 50% larger than the latter, but it doesn't take too much longer to import into Yomichan (in a clean environment with no other dictionaries installed, anyway). So maybe it's not so bad. Let me know what you think.

Full version with kana-to-kanji lookups: jmnedict_2022_07_22_with_kana.zip

Smaller version without the kana lookups: jmnedict_2022_07_22.zip

@MarvNC
Copy link

MarvNC commented Jul 26, 2022

I've been using the full version for a few days, it works great. No complaints really, it reduces clutter by a lot. I'm not sure if the kana lookups help but they don't hurt to have. Thanks for creating these dictionaries!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants