Holy fuck, the GPT2 tokenizer, which is used by nearly all AI models which have anything to do with text or prompting, is really, really bad with non-english languages. In English, even uncommon words like “amplification” or infographic take up one token while words like “socialization” and “enterprising” take two, but in Polish, one token is usually two or three letters, there are common 6-letter words that take 4 tokens. This matters because more tokens means slower text generation, higher pricing and smaller supported prompt lengths.

@miki

Syndication Links

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.