Holy fuck, the GPT2 tokenizer, which is used by nearly all AI models which have anything to do with text or prompting, is really, really bad with non-english languages. In English, even uncommon words like “amplification” or infographic take up one token while words like “socialization” and “enterprising” take two, but in Polish, one token is usually two or three letters, there are common 6-letter words that take 4 tokens. This matters because more tokens means slower text generation, higher pricing and smaller supported prompt lengths.

@miki

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)