Font Range Detection with Noto

Products, display names, and user generated content rarely fits in the 94 displayable characters in the ASCII table for a world wide audience. While you've migrated storage of such content to UTF-8 or Punycode as a hack, actually displaying it is another matter.

Support for multiple fonts

Microsoft , Apple , and Google implement font substitution in their text rendering APIs so that when you write "Hello 世界" with only Arial.ttf loaded, the other characters are drawn nearby as if nothing unusual occurred. Though, the styling of the fonts may clash. That is better than seeing tofu □□□□□ □□□ □□□□□□□!

These three also maintain the most impactful software in history: the web browser & ^[1]. As long as the encoding is understood on both sides, you can be sure that glyphs will render with a fallback when the woff used on the page is lacks a grapheme (e => 0x65) or grapheme cluster (🇺🇸 => 0xf09f 87ba + 0xf09f 87b8).^[2] Whether a customer's name is bolded in a div or drawn text in a canvas, font substitution is there to preserve the consistent delivery of symbols across languages, cultures, and time periods.

What if a symbol is not available in the primary font and you have an Ideal second font for other glyphs? Merging all possible fonts to cover every grapheme that could be seen is not reasonable. Browsers give the option to style multiple font families at once for this reason. Coupled with unicode-range, the browser will only download the font files necessary to render the page and fallback to system font substitution when the available fonts cannot render the remaining graphemes. Noto Color Emoji breaks the web font up to nine or so files on the web. When the US flag grapheme cluster is rendered, the browser does not need the font data for a frog or a coffee.

https://fonts.googleapis.com/css2?family=Noto+Color+Emojicss

@font-face {
  font-family: 'Noto Color Emoji';
  font-style: normal;
  font-weight: 400;
  font-display: swap;
  src: url(https://fonts.gstatic.com/s/notocoloremoji/v38/Yq6P-KqIXTD0t4D9z1ESnKM3-HpFabsE4tq3luCC7p-aXxcn.1.woff2) format('woff2');
  unicode-range: U+200d, U+2620, U+26a7, U+fe0f, U+1f308, U+1f38c, U+1f3c1, U+1f3f3-1f3f4, U+1f6a9, U+e0062-e0063, U+e0065, U+e0067, U+e006c, U+e006e, U+e0073-e0074, U+e0077, U+e007f;
}

Font splitting and substitution allow native and web applications to use memory and bandwidth resources more efficiently. For this to work, the font engine necessarily needs to support multiple fonts at once to correctly implement a consistent baseline, along with many other details that only font authors and font engine creators have an interest in.

Without Font Substitution

Rendering multilingual text gets a lot harder when you’re outside the safety net of native OS rendered applications and web browsers. Think game engines, embedded systems, or serverless wasm workloads like @levischuck/render-html. If you're lucky like me with satori, the tools you use will natively support multiple fonts. If you're unlucky and the tool you use merely wraps the FreeType API like the Pillow ImageFont module, then you'll have to segment text and perform baseline alignment and line wrapping yourself, or integrate with a library like libraqm.

Assuming that you have a solution to sequence multiple fonts while looking great 🌟^[3], the next issue is knowing which fonts to load into memory before the expensive task of fetching fonts from disk or over the network.

Take a little inspiration from the CSS @font-faceat-rule and its unicode-range property and we can design a small schema that can can help us identify what fonts to load by iterating over each codepoint.

Here's what Google serves for Noto Sans Thai:

https://fonts.googleapis.com/css2?family=Noto+Sans+Thai:wght@100..900&display=swapcss

@font-face {
  font-family: 'Noto Sans Thai';
  font-style: normal;
  font-weight: 100 900;
  font-stretch: 100%;
  font-display: swap;
  src: url(https://fonts.gstatic.com/s/notosansthai/v29/iJWQBXeUZi_OHPqn4wq6hQ2_hbJ1xyN9wd43SofNWcdfKI2hX2g.woff2) format('woff2');
  unicode-range: U+02D7, U+0303, U+0331, U+0E01-0E5B, U+200C-200D, U+25CC;
}

You'll see several individual codepoints and a few ranges in these CSS responses. However, repeating this verbatim with many fonts loaded may surprise you later. U+02D7 is "Modifier Letter Minus Sign." How many other fonts do you suppose has the same exact symbol?

https://fonts.googleapis.com/css2?family=Noto+Sans+KR:wght@100..900&display=swapcss

/* latin-ext */
@font-face {
  font-family: 'Noto Sans KR';
  font-style: normal;
  font-weight: 100 900;
  font-display: swap;
  src: url(https://fonts.gstatic.com/s/notosanskr/v39/PbykFmXiEBPT4ITbgNA5CgmG337t0JM.woff2) format('woff2');
  unicode-range: U+0100-02BA, U+02BD-02C5, U+02C7-02CC, U+02CE-02D7, U+02DD-02FF, U+0304, U+0308, U+0329, U+1D00-1DBF, U+1E00-1E9F, U+1EF2-1EFF, U+2020, U+20A0-20AB, U+20AD-20C0, U+2113, U+2C60-2C7F, U+A720-A7FF;
}

Just about every Noto font contains a duplicate of the latin and latin-ext glyph set, and so each has U+02D7. Any loader you prepare will need to ensure not to accidentally pull every font available merely because the data you have available contains overlapping codepoints.

Deduplicating codepoint ranges

For every language you plan to support (and more if you care about Linear B), download them and reference their static regular TTF files with a script like this. FontTools in python is the best casual font tool chain you could ask for.

ranges.pypython

import json
import sys
from fontTools.ttLib import TTFont
def get_font_data(font_path):
    font = TTFont(font_path)
    # Extract family name from the 'name' table (ID 1)
    family = ""
    for record in font['name'].names:
        if record.nameID == 1:
            family = record.toUnicode()
            break
    # Extract all codepoints from the best unicode cmap table
    cmap = font.getBestCmap()
    codepoints = sorted(cmap.keys())
    if not codepoints:
        return {"ranges": [], "family": family}
    # Group codepoints into contiguous ranges
    ranges = []
    if codepoints:
        start = codepoints[0]
        end = codepoints[0]
        for i in range(1, len(codepoints)):
            if codepoints[i] == end + 1:
                end = codepoints[i]
            else:
                ranges.append(f"{start:04X}-{end:04X}")
                start = end = codepoints[i]
        ranges.append(f"{start:04X}-{end:04X}")
    return {
        "ranges": ranges,
        "family": family
    }
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python ranges.py <path_to_ttf>")
        sys.exit(1)
    data = get_font_data(sys.argv[1])
    print(json.dumps(data, indent=4))

This script outputs the same structure as what I export in my library @levischuck/tiny-font-ranges. As described before, we see that inside each TTF file, the Latin set 0020-007E and plenty more are embedded. You can reduce each font file for performance if you need to after the dedupe stage with a technique like I show in Converting Fonts to WOFF2 for the web. Though I find this step to be unnecessary at my scale.

examplebash

python3 ranges.py Noto_Sans_Symbols/static/NotoSansSymbols-Regular.ttf
{
    "ranges": [
        "0000-0000",
        "000D-000D",
        "0020-007E",
        ...
        "1F54F-1F54F",
        "1F610-1F610",
        "1F700-1F773"
    ],
    "family": "Noto Sans Symbols"
}

Once you have a set of fonts mapped with data like so, is relatively straight forward to find overlaps and to restructure all ranges without said overlaps. ^[4]

symbol.tstypescript

export const SYMBOLS1: { ranges: string[]; family: string } = {
  "ranges": [
    "20DD-20E0",
    // ...
    "1F546-1F549",
    "1F54F-1F54F",
    "1F700-1F773"
  ],
  "family": "Noto Sans Symbols"
};

Warning

Removing overlaps introduces a sharp edge. If your intention is to support a proper / strict subset of these fonts, and these specific conditions...

A codepoint (like ⪔) is in your data that which is not covered in the subset of selected font ranges
It was supported in the full set of ranges so any overlaps are omitted in the mapping data
It technically exists in a font that is in the subset of fonts, but because it overlaps with the font that was removed

Then you may find that the glyph (e.g. ⪔) sometimes renders and sometimes doesn't depending on whether the another font is pulled in which happens to contain that glyph by coincidence.

Determining which fonts to fetch and use

Once all your fonts are mapped, the ranges they support are recorded, and a way to draw text with multiple fonts (left as an exercise to the reader) is accessible, all that remains is looping over the input codepoints to find a matching font.

example.tstypescript

import { scanTextForFontRanges, WITHOUT_LATIN } from "@levischuck/tiny-font-ranges";
const extra = scanTextForFontRanges("(╯°□°）╯︵ ┻━┻", WITHOUT_LATIN);
// => [ 'Noto Sans JP', 'Noto Sans KR' ]

From there, it's a matter of importing the fonts by name and adding it to the font-family at the root of the DOM — if you have one 😉.

Note

At the time of publication, the scan algorithm in @levischuck/tiny-font-ranges is simple to understand but naïve. A faster lookup structure would be a Trie with a state on which leaves have already been visited.

Footnotes

Chrome (and Chrome derivatives like Edge) makes up 83% of the market share, while Safari takes nearly 15%. Firefox counts too... at 2.2% (source: statcounter ).
For geo-political reasons, the font available on Windows Segoe UI Emoji does not include country maps. Your ability to see it on this blog with Windows is possible because I added Noto Color Emoji.
Rendering emojis next to ascii sequences in Pillow is terribly frustrating. Once you experience this frustration, you'll appreciate an emoji once in a while (when and only when it is written by a person, not by ChatGPT).
This algorithm is O(n^2), which isn't ideal for anything in real-time. It is sufficient for local execution when adding more fonts to the pool, though.

#Font Range Detection with Noto#

#Support for multiple fonts#

#Without Font Substitution#

#Deduplicating codepoint ranges#

#Determining which fonts to fetch and use#

Footnotes

Font Range Detection with Noto

Support for multiple fonts

Without Font Substitution

Deduplicating codepoint ranges

Determining which fonts to fetch and use