Products, display names, and user generated content rarely fits in the 94 displayable characters in the ASCII table for a world wide audience. While you've migrated storage of such content to UTF-8 or Punycode as a hack, actually displaying it is another matter.
Support for multiple fonts
Microsoft , Apple
, and Google
implement font substitution in their text rendering APIs so that when you write "Hello 世界" with only
Arial.ttf loaded, the other characters are drawn nearby as if nothing unusual occurred. Though, the styling of the fonts may clash. That is better than seeing tofu □□□□□ □□□ □□□□□□□!
These three also maintain the most impactful software in history: the web browser
&
. As long as the encoding is understood on both sides, you can be sure that glyphs will render with a fallback when the woff used on the page is lacks a grapheme (
e => 0x65) or grapheme cluster (🇺🇸 => 0xf09f 87ba + 0xf09f 87b8). Whether a customer's name is bolded in a div or drawn text in a canvas, font substitution is there to preserve the consistent delivery of symbols across languages, cultures, and time periods.
What if a symbol is not available in the primary font and you have an Ideal second font for other glyphs? Merging all possible fonts to cover every grapheme that could be seen is not reasonable. Browsers give the option to style multiple font families at once for this reason. Coupled with unicode-range, the browser will only download the font files necessary to render the page and fallback to system font substitution when the available fonts cannot render the remaining graphemes. Noto Color Emoji breaks the web font up to nine or so files on the web. When the US flag grapheme cluster is rendered, the browser does not need the font data for a frog or a coffee.
@font-face { font-family: 'Noto Color Emoji'; font-style: normal; font-weight: 400; font-display: swap; src: url(https://fonts.gstatic.com/s/notocoloremoji/v38/Yq6P-KqIXTD0t4D9z1ESnKM3-HpFabsE4tq3luCC7p-aXxcn.1.woff2) format('woff2'); unicode-range: U+200d, U+2620, U+26a7, U+fe0f, U+1f308, U+1f38c, U+1f3c1, U+1f3f3-1f3f4, U+1f6a9, U+e0062-e0063, U+e0065, U+e0067, U+e006c, U+e006e, U+e0073-e0074, U+e0077, U+e007f; }
Font splitting and substitution allow native and web applications to use memory and bandwidth resources more efficiently. For this to work, the font engine necessarily needs to support multiple fonts at once to correctly implement a consistent baseline, along with many other details that only font authors and font engine creators have an interest in.
Without Font Substitution
Rendering multilingual text gets a lot harder when you’re outside the safety net of native OS rendered applications and web browsers. Think game engines, embedded systems, or serverless wasm workloads like @levischuck/render-html. If you're lucky like me with satori, the tools you use will natively support multiple fonts. If you're unlucky and the tool you use merely wraps the FreeType API like the Pillow ImageFont module, then you'll have to segment text and perform baseline alignment and line wrapping yourself, or integrate with a library like libraqm.
Assuming that you have a solution to sequence multiple fonts while looking great 🌟, the next issue is knowing which fonts to load into memory before the expensive task of fetching fonts from disk or over the network.
Take a little inspiration from the CSS @font-faceat-rule and its unicode-range property and we can design a small schema that can can help us identify what fonts to load by iterating over each codepoint.
Here's what Google serves for Noto Sans Thai:
@font-face { font-family: 'Noto Sans Thai'; font-style: normal; font-weight: 100 900; font-stretch: 100%; font-display: swap; src: url(https://fonts.gstatic.com/s/notosansthai/v29/iJWQBXeUZi_OHPqn4wq6hQ2_hbJ1xyN9wd43SofNWcdfKI2hX2g.woff2) format('woff2'); unicode-range: U+02D7, U+0303, U+0331, U+0E01-0E5B, U+200C-200D, U+25CC; }
You'll see several individual codepoints and a few ranges in these CSS responses. However, repeating this verbatim with many fonts loaded may surprise you later. U+02D7 is "Modifier Letter Minus Sign." How many other fonts do you suppose has the same exact symbol?
/* latin-ext */@font-face { font-family: 'Noto Sans KR'; font-style: normal; font-weight: 100 900; font-display: swap; src: url(https://fonts.gstatic.com/s/notosanskr/v39/PbykFmXiEBPT4ITbgNA5CgmG337t0JM.woff2) format('woff2'); unicode-range: U+0100-02BA, U+02BD-02C5, U+02C7-02CC, U+02CE-02D7, U+02DD-02FF, U+0304, U+0308, U+0329, U+1D00-1DBF, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF; }
Just about every Noto font contains a duplicate of the latin and latin-ext glyph set, and so each has U+02D7. Any loader you prepare will need to ensure not to accidentally pull every font available merely because the data you have available contains overlapping codepoints.
Deduplicating codepoint ranges
For every language you plan to support (and more if you care about Linear B), download them and reference their static regular TTF files with a script like this. FontTools in python is the best casual font tool chain you could ask for.
import jsonimport sys from fontTools.ttLib import TTFont def get_font_data(font_path): font = TTFont(font_path) # Extract family name from the 'name' table (ID 1) family = "" for record in font['name'].names: if record.nameID == 1: family = record.toUnicode() break # Extract all codepoints from the best unicode cmap table cmap = font.getBestCmap() codepoints = sorted(cmap.keys()) if not codepoints: return {"ranges": [], "family": family} # Group codepoints into contiguous ranges ranges = [] if codepoints: start = codepoints[0] end = codepoints[0] for i in range(1, len(codepoints)): if codepoints[i] == end + 1: end = codepoints[i] else: ranges.append(f"{start:04X}-{end:04X}") start = end = codepoints[i] ranges.append(f"{start:04X}-{end:04X}") return { "ranges": ranges, "family": family } if __name__ == "__main__": if len(sys.argv) < 2: print("Usage: python ranges.py <path_to_ttf>") sys.exit(1) data = get_font_data(sys.argv[1]) print(json.dumps(data, indent=4))
This script outputs the same structure as what I export in my library @levischuck/tiny-font-ranges. As described before, we see that inside each TTF file, the Latin set 0020-007E and plenty more are embedded. You can reduce each font file for performance if you need to after the dedupe stage with a technique like I show in Converting Fonts to WOFF2 for the web. Though I find this step to be unnecessary at my scale.
python3 ranges.py Noto_Sans_Symbols/static/NotoSansSymbols-Regular.ttf{ "ranges": [ "0000-0000", "000D-000D", "0020-007E", ... "1F54F-1F54F", "1F610-1F610", "1F700-1F773" ], "family": "Noto Sans Symbols" }
Once you have a set of fonts mapped with data like so, is relatively straight forward to find overlaps and to restructure all ranges without said overlaps.
export const SYMBOLS1: { ranges: string[]; family: string } = { "ranges": [ "20DD-20E0", // ... "1F546-1F549", "1F54F-1F54F", "1F700-1F773" ], "family": "Noto Sans Symbols" };
Removing overlaps introduces a sharp edge. If your intention is to support a proper / strict subset of these fonts, and these specific conditions...
- A codepoint (like ⪔) is in your data that which is not covered in the subset of selected font ranges
- It was supported in the full set of ranges so any overlaps are omitted in the mapping data
- It technically exists in a font that is in the subset of fonts, but because it overlaps with the font that was removed
Then you may find that the glyph (e.g. ⪔) sometimes renders and sometimes doesn't depending on whether the another font is pulled in which happens to contain that glyph by coincidence.
Determining which fonts to fetch and use
Once all your fonts are mapped, the ranges they support are recorded, and a way to draw text with multiple fonts (left as an exercise to the reader) is accessible, all that remains is looping over the input codepoints to find a matching font.
import { scanTextForFontRanges, WITHOUT_LATIN } from "@levischuck/tiny-font-ranges";const extra = scanTextForFontRanges("(╯°□°)╯︵ ┻━┻", WITHOUT_LATIN); // => [ 'Noto Sans JP', 'Noto Sans KR' ]
From there, it's a matter of importing the fonts by name and adding it to the font-family at the root of the DOM — if you have one 😉.
At the time of publication, the scan algorithm in @levischuck/tiny-font-ranges is simple to understand but naïve. A faster lookup structure would be a Trie with a state on which leaves have already been visited.
Footnotes
- Chrome (and Chrome derivatives like Edge) makes up 83% of the market share, while Safari takes nearly 15%. Firefox
counts too... at 2.2% (source: statcounter ).
- For geo-political reasons, the font available on Windows Segoe UI Emoji does not include country maps. Your ability to see it on this blog with Windows is possible because I added Noto Color Emoji.
- Rendering emojis next to ascii sequences in Pillow is terribly frustrating. Once you experience this frustration, you'll appreciate an emoji once in a while (when and only when it is written by a person, not by ChatGPT).
- This algorithm is O(n^2), which isn't ideal for anything in real-time. It is sufficient for local execution when adding more fonts to the pool, though.