Arabic Tokenizers Leaderboard
What is the best tokenizer for Arabic?
The Total Number of Tokens
in this leaderboard is based on the total number of tokens got from the Arabic section of rasaif-translations dataset (This dataset was chosen because it represents Arabic Fusha text in a small and concentrated manner).
A tokenizer that scores high in this leaderboard should be efficient in parsing Arabic in its different dialects and forms.
Updates/Notes:
- New datasets is added for the evaluation (e.g. arabic-quotes, Moroccan_Arabic_Wikipedia_20230101_nobots).
Fertility Score
is calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).Tokenize Tashkeel
is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (โ
for yes,โ
for no).Vocab Size
is the total number of tokens in the tokenizer's vocabulary (e.g.10000
tokens).Tokenizer Class
is the class of the tokenizer (e.g.BertTokenizer
orGPT2Tokenizer
)Total Number of Tokens
is the total number of tokens in the dataset after tokenization (Lower is better).
Note: Press Refresh
to get the latest data available in the leaderboard (The initial state may be deceiving).
๐ณ Tokenize Tashkeel | ๐ Models | ๐ชบ Fertility Score | โ Total Number of Tokens | ๐ Vocab Size | Tokenizer Class |
---|---|---|---|---|---|
โ | FreedomIntelligence/AceGPT-v1.5-13B-Chat | 1.6139999999999999 | 3485116 | 200000 | PreTrainedTokenizerFast |
Models Fertility Score (Lower is better)
Select a model