ENAMDICT/JMnedict clutter #2111

stephenmk · 2022-04-05T17:07:14Z

The current version of the JMnedict dictionary for yomichan is somewhat notorious for cluttering users' workspaces with terms. For example, a search for ひろこ pulls up over 30 term-reading pairs.

I'm wondering if everyone would be on-board with an update to this dictionary in which many of these personal name terms are consolidated.

For example, a search for ろはん would bring up a term containing all of the the relevant kanji forms in the glossary (24 of them) instead of 24 individual term-reading pairs. A search for 紗子 would bring up a term with all possible readings (6 of them) in its glossary instead of 6 different terms.

If this sounds good, I can begin working on updating yomichan-import to produce this new version of the dictionary.

(This issue is technically with yomichan-import, but I'm posting here because it's the more active repo.)

Here's a list of codes currently used in JMnedict. I'm thinking "fem", "given", "masc", "surname", and "unclass" are the relevant categories that should be consolidated, and possibly also "person" and "oth" depending on how they look after I do some research.

So if a particular name belongs to more than one of those categories, then the consolidated term would have one "sense" for each category (with the appropriate tag), and the sense would contain a gloss with a semicolon delimited list of the relevant readings or kanji forms of the name.

JMnedict code table

code	description
char	"character"
company	"company name"
creat	"creature"
dei	"deity"
doc	"document"
ev	"event"
fem	"female given name or forename"
fict	"fiction"
given	"given name or forename, gender not specified"
group	"group"
leg	"legend"
masc	"male given name or forename"
myth	"mythology"
obj	"object"
organization	"organization name"
oth	"other"
person	"full name of a particular person"
place	"place name"
product	"product name"
relig	"religion"
serv	"service"
station	"railway station"
surname	"family or surname"
unclass	"unclassified name"
work	"work of art, literature, music, etc. name"

Thermospore · 2022-04-05T17:11:47Z

Personally I put jmnedict in a different profile and assigned it a different hotkey, so it wouldn't clutter up my main dictionaries

stephenmk · 2022-04-05T17:15:45Z

I have also disabled it in my main profile, but I wish I didn't have to. So it's my hope that this update would solve that problem.

stephenmk · 2022-04-07T00:02:51Z

Here are some mockups of what I'm imagining (click the summaries to expand the images)

Query for 伊勢原八幡台石器時代住居跡

Query for いせはらはちまんだいせっきじだいじゅうきょあと

(The only definition for 伊勢原八幡台石器時代住居跡 in JMdictDB is "Iseharahachimandaisekkijidaijuukyoato")

Query for はるか

Query for 春香

What I'm discovering is that JMnedict contains two kinds of entries: those with glosses that merely transcribe the name into latin characters (generally generic name entries -- given names, surnames, etc.), and those that have more details (specific people, famous places, brands, etc.). The former category represents the overwhelming majority of entries. I want to consolidate those entries (as pictured above) while leaving the other entries in the same format as they are in the current yomichan dictionary.

It might also be worthwhile to split these two categories into two dictionary files. I imagine more people would be interested in a lightweight dictionary file with these more specific entries.

So the only challenge here is devising a way to determine whether or not an entry's gloss is merely a transcription of the corresponding kana. I tried using this golang library, but it doesn't seem robust enough to handle many situations (characters with macrons like ō, ん written as n', アイ written as "ay", etc). So I'd need to implement a new comparison tool.

Thoughts? Does this sound interesting to anyone?

MarvNC · 2022-07-21T18:37:11Z

Been interested in this dictionary for a while, might you have a testing version or something available to try? I wouldn't really mind having some random latin character names as it seems it would still be a huge improvement in the amount of clutter.

stephenmk · 2022-07-22T06:41:05Z

The code I made for this is quite a mess, so I haven't published it anywhere.

As I explained in my post above, my first prototype contained three kinds of entries:

The same sort of normal entries that you can find in the current version of JMnedict for Yomichan. I.e., kanji headwords, kana readings, and English-language glosses. These are usually entries for specific people, companies, and organizations.
Kanji-to-kana lookups for generic names.
Kana-to-kanji lookups for generic names.

I'm not so sure about how useful this third category is. Most of these entries look like a giant mess of kanji.

Example: よしたけ

I've uploaded two versions of the test dictionary: one which contains these kana-to-kanji lookups, and one which does not. The former is about 50% larger than the latter, but it doesn't take too much longer to import into Yomichan (in a clean environment with no other dictionaries installed, anyway). So maybe it's not so bad. Let me know what you think.

Full version with kana-to-kanji lookups: jmnedict_2022_07_22_with_kana.zip

Smaller version without the kana lookups: jmnedict_2022_07_22.zip

MarvNC · 2022-07-26T20:35:45Z

I've been using the full version for a few days, it works great. No complaints really, it reduces clutter by a lot. I'm not sure if the kana lookups help but they don't hurt to have. Thanks for creating these dictionaries!

stephenmk mentioned this issue Aug 1, 2022

Transliteration gloss type for JMnedict JMdictProject/JMdictIssues#73

Closed

stephenmk mentioned this issue Feb 2, 2023

New version of JMnedict (the proper name dictionary) FooSoft/yomichan-import#41

Merged

ShiroiKuma0 mentioned this issue Jul 8, 2023

JMNedict not importing - even slow import arianneorpilla/jidoujisho#272

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENAMDICT/JMnedict clutter #2111

ENAMDICT/JMnedict clutter #2111

stephenmk commented Apr 5, 2022 •

edited

Loading

Thermospore commented Apr 5, 2022

stephenmk commented Apr 5, 2022

stephenmk commented Apr 7, 2022

MarvNC commented Jul 21, 2022

stephenmk commented Jul 22, 2022 •

edited

Loading

MarvNC commented Jul 26, 2022

ENAMDICT/JMnedict clutter #2111

ENAMDICT/JMnedict clutter #2111

Comments

stephenmk commented Apr 5, 2022 • edited Loading

Thermospore commented Apr 5, 2022

stephenmk commented Apr 5, 2022

stephenmk commented Apr 7, 2022

MarvNC commented Jul 21, 2022

stephenmk commented Jul 22, 2022 • edited Loading

MarvNC commented Jul 26, 2022

stephenmk commented Apr 5, 2022 •

edited

Loading

stephenmk commented Jul 22, 2022 •

edited

Loading