Dictionary import format 辞書形式

od-dict/1 · the contract for importing custom dictionaries

The app can load your own dictionaries. One dictionary is one JSON file in the od-dict/1 format. You produce it by converting an existing dictionary (for example a Yomitan dictionary) — by hand or with an LLM — and the app builds its own database from it.

The format is a superset of jmdict-simplified (scriptin/jmdict-simplified): a valid jmdict-simplified word is already valid here, so almost no conversion is needed for that source.

This page is a readable summary. The full, code-verified contract lives in the repository: docs/od-dict-format.md.

1. The whole file

{
  "format": "od-dict/1",          // required, exact string
  "source": "custom",             // required; an id for this dictionary (§2)
  "metadata": {                   // optional; for display/provenance only
    "title": "My Dictionary",
    "license": "CC BY-SA 4.0",
    "url": "https://…"
  },
  "tags": { "v1": "Ichidan verb" }, // optional; NOT consumed (§6)
  "words": [
    {
      "id": "1259420",            // required; UNIQUE within this file (§3)
      "kanji": ["食べる"],         // optional; omit for kana-only words
      "kana":  ["たべる"],         // at least one form (kanji or kana) must exist
      "sense": [                  // required; ≥1
        {
          "partOfSpeech": ["v1"], // optional; coded POS ONLY, never free text (§4)
          "misc": ["transitive"], // optional; free labels, shown verbatim
          "gloss": [              // the translations
            { "lang": "en", "text": "to eat" },
            "пожирать"            // bare string = { lang: "ru", origin: "original" }
          ],
          "examples": [ ["ご飯を食べる", "to eat a meal"] ],  // optional; [japanese, translation]
          "references": []        // optional; cross-links (§5)
        }
      ]
    }
  ]
}

2. Top level

field	req.	meaning
`format`	yes	Exactly `"od-dict/1"`.
`source`	yes	An id string for the whole dictionary. A stable slug (`"custom"`, `"jitendex"`). It is the namespace for `id`s and cross-references. An unknown source is simply ranked after the built-in dictionaries — never rejected.
`metadata`	no	Free object (title, license, url). Stored for display only; the builder does not read it.
`tags`	no	A `code → label` map for jmdict-simplified compatibility. Not consumed (§6). Put labels you want shown directly in `misc`.
`words`	yes	The array of entries.

Security: the design intends to force source = "custom" at import so a file can't claim to be "jmdict". That is not implemented yet — treat source as advisory until the validating importer lands.

3. Words and identity

{
  "id": "1259420",
  "homograph": "II",          // optional; disambiguator when head+reading collide
  "label": "colloquial",      // optional; entry-level tag shown on the card
  "isExpression": true,       // optional; multi-word phrase
  "common": true,             // optional; a high-frequency entry
  "kanji": ["食べる", "喰べる"],  // 0…n written forms
  "kana":  ["たべる"],          // 0…n readings
  "sense": [ /* … */ ]
}

Identity is (source, id), and it is UNIQUE.

id must be unique within the file. The table is written with a plain INSERT against UNIQUE(source, source_entry_id), so a duplicate id throws and aborts the import batch — it is not silently ignored. Use your source's own id; if it has none, synthesise a stable one (a counter, or the headword).
kanji/kana items may be a bare string (the normal case) or a jmdict-simplified object { "text": "…" }; only text is read, the rest is accepted but dropped. A bare string is preferred.
A word needs at least one form. For Japanese, give kana; omit kanji for kana-only words.

Grouping. The app shows one card per (head + reading) group, merging across dictionaries. In your file, emit one word per entry — collect all written forms into kanji, all readings into kana, all meanings into sense. If your source splits one entry across rows (Yomitan does, via a shared sequence), merge them into one word first, or you get several cards for one entry.

4. `partOfSpeech` — the one field that can silently break search

This is the only field where a wrong value removes the word from search with no error. POS does not affect finding a word by its dictionary form or reading — 食べる and たべる are always found. POS is used in exactly one place: the deinflection gate — finding a word from an inflected surface the user typed (食べた, 静かな).

entry POS	verdict	effect
absent / not a conjugation class	`weak`	still found, ranked a little lower
a conjugation class that matches	`confirmed`	found, ranked normally
a conjugation class that mismatches	`rejected`	dropped — not found from that inflection

If you are not sure of the conjugation class, omit partOfSpeech entirely. Absent is safe (weak), wrong is fatal (rejected). Omit beats guess.

Codes only, never free text. Put readable labels ("verb", "honorific") in misc, where they are shown verbatim.
Only the conjugation classes matter. Other codes (n, exp, vt, adv …) are ignored by search — harmless but useless.
The Yomitan trap: never copy the collapsed v5. The engine has no v5 class — it needs the specific one (v5k, v5r …). A bare "v5" matches nothing → rejected, i.e. worse than untagged. Take the specific code from definitionTags, or expand by the dictionary form's final kana, or omit.

The conjugation classes the engine understands:

v1                              ichidan (-ru):       食べる, 見る
v5u v5k v5g v5s v5t v5n v5b v5m v5r   godan, by the dictionary-form final kana
vk                              来る (kuru)
vs vs-i vs-s                    する verbs
adj-i                           い-adjectives:        高い, 良い
adj-na                          な-adjectives:        静か, 綺麗

5. Glosses, languages and cross-references

"гулять"                                  // bare string = { "lang": "ru", "origin": "original" }
{ "lang": "en", "text": "to take a walk" }
{ "lang": "ru", "text": "то же, что {0}", "origin": "original" }

A bare string is Russian (lang: "ru"). For any other language use the object form with lang.
origin is "original" (default) or "machine" (AI-produced); display only.
How lang affects search: a gloss with lang: "ru" goes in the Russian search channel; every other lang goes in the English channel. The channel is chosen by the script the user typed: Cyrillic → ru, Latin → en. So a Japanese-English dictionary's glosses must be lang: "en" to be reachable by an English search. A monolingual Japanese gloss (lang: "ja") lands in the en channel and is effectively display-only — still found by headword and reading, but not by its definition text.

Cross-references — ICU placeholders {0}, {1} … in the gloss text, resolved against a references array:

{
  "gloss": [ { "lang": "en", "text": "polite form of {0}" } ],
  "references": [ { "label": "", "to": "1234567", "text": "言う" } ]
}

{0} is replaced in place by references[0].text as a tappable link.
to is the id of the target word in this same file; the link opens { source: <this dictionary>, id: to }. A to that matches nothing simply opens nothing.
label is a relation marker ("see", "cf.") for trailing references (no {n}). A literal { in gloss text must be written '{' (ICU quoting).

6. What the format does not carry

These are added by the app from its own shared data, keyed by surface/reading, so every dictionary agrees — you do not provide them: pitch accent, furigana, romaji, frequency / "common" ranking, audio, JLPT/WaniKani levels, kanji breakdown. Also not consumed (though accepted for compatibility): the top-level tags map and per-form/per-gloss extras. The card renders misc/field/dialect as free strings, verbatim — no code expansion — so put readable labels there directly.

7. Minimal examples

The smallest word — kana only with one gloss:

{ "id": "w1", "kana": ["ありがとう"], "sense": [ { "gloss": [ { "lang": "en", "text": "thank you" } ] } ] }

// (a) JA→EN godan. POS is the SPECIFIC class v5k, not Yomitan's "v5".
{ "id": "1578850", "kanji": ["書く"], "kana": ["かく"],
  "sense": [ { "partOfSpeech": ["v5k", "vt"],
    "gloss": [ { "lang": "en", "text": "to write; to compose; to pen" } ] } ] }

// (b) JA→EN with an inline cross-reference.
{ "id": "敷衍", "kanji": ["敷衍", "敷延"], "kana": ["ふえん"],
  "sense": [ { "gloss": [ { "lang": "en", "text": "amplification (cf. {0})" } ],
    "references": [ { "label": "", "to": "演繹", "text": "演繹" } ] } ] }

// (c) kana-only な-adjective. adj-na IS a deinflection class, so tag it.
{ "id": "1000230", "kana": ["きれい"],
  "sense": [ { "partOfSpeech": ["adj-na"], "gloss": [ { "lang": "en", "text": "pretty; clean; neat" } ] } ] }

8. POS allowlist

Emit a code into partOfSpeech only if it is one of these (or a jmdict subclass starting with one). Anything else: put the human label in misc and omit it here.

v1
v5u v5k v5g v5s v5t v5n v5b v5m v5r
vk
vs vs-i vs-s
adj-i
adj-na

Collapsed Yomitan "v5" → expand by the dictionary-form final kana: う→v5u く→v5k ぐ→v5g す→v5s つ→v5t ぬ→v5n ぶ→v5b む→v5m る→v5r. Unsure → omit.

← Home

Dictionary import format 辞書形式

1. The whole file

2. Top level

3. Words and identity

4. partOfSpeech — the one field that can silently break search

5. Glosses, languages and cross-references

6. What the format does not carry

7. Minimal examples

8. POS allowlist

4. `partOfSpeech` — the one field that can silently break search