Gradio

The Total Number of Tokens in this leaderboard is based on the total number of tokens got from the Arabic section of rasaif-translations dataset (This dataset was chosen because it represents Arabic Fusha text in a small and concentrated manner).

A tokenizer that scores high in this leaderboard should be efficient in parsing Arabic in its different dialects and forms.

Updates/Notes:

New datasets is added for the evaluation (e.g. arabic-quotes, Moroccan_Arabic_Wikipedia_20230101_nobots).
Fertility Score is calculated by dividing the total number of tokens by the total number of words in the dataset (Lower is better).
Tokenize Tashkeel is an indicator of whether the tokenizer maintains the tashkeel when tokenizing or not (✅ for yes, ❌ for no).
Vocab Size is the total number of tokens in the tokenizer's vocabulary (e.g. 10000 tokens).
Tokenizer Class is the class of the tokenizer (e.g. BertTokenizer or GPT2Tokenizer)
Total Number of Tokens is the total number of tokens in the dataset after tokenization (Lower is better).

Note: Press Refresh to get the latest data available in the leaderboard (The initial state may be deceiving).

👳 Tokenize Tashkeel	📛 Models	🪺 Fertility Score	➕ Total Number of Tokens	📘 Vocab Size	Tokenizer Class
❌	riotu-lab/Aranizer-SP-86k	1.491	3485116	86000	PreTrainedTokenizerFast
❌	riotu-lab/Aranizer-SP-64k	1.545	3613024	64000	PreTrainedTokenizerFast
❌	asafaya/bert-base-arabic	1.6139999999999999	1242530	32000	BertTokenizerFast
✅	ALLaM-AI/ALLaM-7B-Instruct-preview	1.638	3829982	64000	LlamaTokenizerFast
✅	core42/jais-30b-chat-v3	1.6680000000000001	1284508	84992	PreTrainedTokenizerFast
✅	core42/jais-13b	1.6680000000000001	1284508	84992	PreTrainedTokenizerFast
❌	riotu-lab/Aranizer-SP-32k	1.696	3965772	32000	PreTrainedTokenizerFast
✅	FreedomIntelligence/AceGPT-v1.5-13B-Chat	1.888	1453838	44800	LlamaTokenizerFast
✅	inceptionai/jais-family-30b-16k	1.905	4453284	84992	PreTrainedTokenizerFast
✅	riotu-lab/Aranizer-PBE-86k	1.942	4540271	86000	PreTrainedTokenizerFast
✅	riotu-lab/Aranizer-PBE-64k	2.001	4679549	64000	PreTrainedTokenizerFast
✅	Xenova/gpt-4o	2.115	1628374	200000	GPT2TokenizerFast
❌	CohereForAI/c4ai-command-r-plus	2.154	1658463	255000	CohereTokenizerFast
❌	CohereForAI/c4ai-command-r-v01	2.154	1658463	255000	CohereTokenizerFast
✅	riotu-lab/Aranizer-PBE-32k	2.175	5085246	32000	PreTrainedTokenizerFast
✅	unsloth/gemma-2b-bnb-4bit	2.199	1692826	256000	GemmaTokenizerFast
✅	NousResearch/Meta-Llama-3-8B	2.374	1827816	128000	PreTrainedTokenizerFast
✅	unsloth/Llama-3.3-70B-Instruct	2.397	5605099	128000	PreTrainedTokenizerFast
❌	Qwen/Qwen1.5-7B-Chat	2.444	1881958	151643	Qwen2TokenizerFast
❌	Qwen/Qwen1.5-110B-Chat	2.444	1881958	151643	Qwen2TokenizerFast
✅	CohereForAI/aya-101	2.488	5817435	250100	T5TokenizerFast
❌	Qwen/Qwen2.5-72B-Instruct	2.516	5883504	151643	Qwen2TokenizerFast
✅	FreedomIntelligence/AceGPT-13B	5.46	4203685	32000	LlamaTokenizerFast
✅	microsoft/Phi-3-mini-128k-instruct	5.46	4203685	32000	LlamaTokenizerFast
✅	01-ai/Yi-1.5-34B-Chat	6.674	5138447	64000	LlamaTokenizerFast

Arabic Tokenizers Leaderboard

What is the best tokenizer for Arabic?

Updates/Notes: