Arabic Tokenizers Leaderboard

What is the best tokenizer for Arabic?

The Total Number of Tokens in this leaderboard is based on the total number of tokens got from the Arabic section of rasaif-translations dataset (This dataset was chosen because it represents Arabic Fusha text in a small and concentrated manner).

A tokenizer that scores high in this leaderboard should be efficient in parsing Arabic in its different dialects and forms.

Updates/Notes:

  1. New datasets is added for the evaluation (e.g. arabic-quotes, Moroccan_Arabic_Wikipedia_20230101_nobots).
  2. Fertility Score is calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).
  3. Tokenize Tashkeel is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (โœ… for yes, โŒ for no).
  4. Vocab Size is the total number of tokens in the tokenizer's vocabulary (e.g. 10000 tokens).
  5. Tokenizer Class is the class of the tokenizer (e.g. BertTokenizer or GPT2Tokenizer)
  6. Total Number of Tokens is the total number of tokens in the dataset after tokenization (Lower is better).

Note: Press Refresh to get the latest data available in the leaderboard (The initial state may be deceiving).

๐Ÿ‘ณ Tokenize Tashkeel
๐Ÿ“› Models
๐Ÿชบ Fertility Score
โž• Total Number of Tokens
๐Ÿ“˜ Vocab Size
Tokenizer Class
โŒ
FreedomIntelligence/AceGPT-v1.5-13B-Chat
1.6139999999999999
3485116
200000
PreTrainedTokenizerFast
Models Fertility Score (Lower is better)