Arabic Tokenizers Leaderboard
What is the best tokenizer for Arabic?
The Total Number of Tokens in this leaderboard is based on the total number of tokens got from the Arabic section of rasaif-translations dataset (This dataset was chosen because it represents Arabic Fusha text in a small and concentrated manner).
A tokenizer that scores high in this leaderboard should be efficient in parsing Arabic in its different dialects and forms.
Updates/Notes:
- New datasets is added for the evaluation (e.g. arabic-quotes, Moroccan_Arabic_Wikipedia_20230101_nobots).
Fertility Scoreis calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).Tokenize Tashkeelis an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (โfor yes,โfor no).Vocab Sizeis the total number of tokens in the tokenizer's vocabulary (e.g.10000tokens).Tokenizer Classis the class of the tokenizer (e.g.BertTokenizerorGPT2Tokenizer)Total Number of Tokensis the total number of tokens in the dataset after tokenization (Lower is better).
Note: Press Refresh to get the latest data available in the leaderboard (The initial state may be deceiving).
๐ณ Tokenize Tashkeel | ๐ Models | ๐ชบ Fertility Score | โ Total Number of Tokens | ๐ Vocab Size | Tokenizer Class |
|---|---|---|---|---|---|
โ | FreedomIntelligence/AceGPT-v1.5-13B-Chat | 1.6139999999999999 | 3485116 | 200000 | PreTrainedTokenizerFast |
Models Fertility Score (Lower is better)
Select a model