Mapping a language to a font charset

That’s interesting. But must be a bit tricky in practise (caveat being this is not my area of expertise)

  • A given language take may map to more than script:
iex> Cldr.Locale.language_data["ja"]
%{primary: %{scripts: [:Jpan], territories: [:JP]}}
iex> Cldr.Locale.language_data["zh"]
%{
  primary: %{scripts: [:Hans, :Hant], territories: [:CN, :HK, :MO, :SG, :TW]},
  secondary: %{scripts: [:Bopo, :Phag], territories: [:ID, :MY, :TH, :US, :VN]}
}
  • A given script can map to a non-contiguous set of codepoints:
iex> Unicode.Script.scripts.hangul  
[
  {4352, 4607},
  {12334, 12335},
  {12593, 12686},
  ...
  {65490, 65495},
  {65498, 65500}
]
  • Some language, like Japanese, have multiple scripts, For Japanese, hiragana, katakana and kanji (Han in Unicode speak, which has 94,215 code points!):
iex> Unicode.Script.scripts.hangul  
[
  {4352, 4607},
  {12334, 12335},
  ...
  {65490, 65495},
  {65498, 65500}
]
iex> Unicode.Script.scripts.katakana
[
  {12449, 12538},
  ...
  {110592, 110592},
  {110880, 110882},
  {110948, 110951}
]

So mapping a language code to the right scripts to the right code points is not a trivial exercise. Possible for sure but hard to see this being a general purpose solution.

1 Like

I suspect it won’t be that straightforward, awesome library btw!
I tried to find out the length if I had to put all the ranges in CSS, and it is arguably okay (especially with gzip/brotli) if the script isn’t common or arabic.

[
  common: 2363,
  arabic: 781,
  ethiopic: 474,
  latin: 471,
  greek: 450,
  inherited: 385,
  han: 276,
  grantha: 223,
  tamil: 204,
  hangul: 180,
  katakana: 179,
  ...
]

So mapping a language code to the right scripts to the right code points is not a trivial exercise.

Indeed. However, unless OP is creating their own font subsets, it is not necessary to check the relationship between scripts and codepoints (Noto variants did that for us), and the list of glyphs available in each font file can be used instead to generate the list of codepoint ranges, which I imagine could be automated with a FontForge script. Then I remembered where I first see this CSS attribute: Google Fonts. The CSS files they provide already have codepoint ranges for the fonts they offer:

2 Likes

Oh, that seems straight forward indeed. Just need a mapping from the ISO 639 language and/or ISO 4217 territory code to the Google Fonts language and it seems good to go!

Was a fun excursion!

2 Likes