Holy fuck, the GPT2 tokenizer, which …

· 20 Tammuz 5783 (Sunday 9 July 2023) · 1 minute to read

Holy fuck, the GPT2 tokenizer, which is used by nearly all AI models which have anything to do with text or prompting, is really, really bad with non-english languages. In English, even uncommon words like “amplification” or infographic take up one token while words like “socialization” and “enterprising” take two, but in Polish, one token is usually two or three letters, there are common 6-letter words that take 4 tokens. This matters because more tokens means slower text generation, higher pricing and smaller supported prompt lengths.

—@miki

Leave a Reply Cancel reply