Converted LaTeX to HTML

Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking

1Koç University, KUIS AI Center, 2Koç University, Department of Computer Engineering


Abstract

Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.

Advancing Turkish Large Language Models (LLMs)

Our contributions are as follows:

  • Hamza LLMs. We release the Hamza LLM series: Hamza-small, Hamza-medium, Hamza-large, and Hamza-xlarge. Notably, Hamza-xlarge with 1.3B parameters marks the premier and most expansive open-source, scientifically vetted Turkish LLM that is trained on 300B tokens. We also introduce HamzaMistral and HamzaGPT2-xl, adapted from Mistral 7B and GPT2-xl, respectively.
  • Fine-Tuning vs. From-Scratch Training. Our analysis explores two distinct methodologies for developing Turkish LLMs in resource and computational power-constrained environments: (i) extending pretrained models (Mistral-7b and GPT2-xl) with Turkish-only data (called as HamzaMistral and HamzaGPT2-xl), and (ii) constructing a model from scratch, similar to the GPT2 approach. This paper thoroughly discusses the merits and drawbacks of these strategies.
  • Turkish LLM Benchmarking. We have curated new Turkish evaluation datasets TruthfulQA-TR and ARC-TR by carefully validating each with multiple annotators, offering meticulously cleaned datasets, and launching a leaderboard to catalyze ongoing advancements in Turkish LLMs.
  • Open Source Community. Committing to open science principles, we make all source codes, model checkpoints, and datasets open-source and publicly accessible.

By detailing the development of specialized datasets and methodologies, we offer a comprehensive guide for building LLMs for languages with limited resources. Additionally, our contributions substantially enrich the field by providing critical resources that will support future research in Turkish language processing and the broader area of Natural Language Processing (NLP) for under-resourced languages.

Method 1: Further Training a Base Model

We aim to enhance base LLMs with Turkish linguistic capabilities. After a detailed evaluation based on perplexity, we selected an LLM that did not specifically train on Turkish data during its initial pretraining phase. We subjected it to further training using Turkish-only data, accomplished through the next-token prediction objective implemented in an autoregressive manner. Essentially, this process can be regarded as a continuation of the pretraining phase of LLMs, but training on a specific portion of Turkish dataset this time.

Selecting Base Model. For the successful development of an advanced Turkish LLM with a 7 billion parameter scale, choosing the most suitable base model is essential. To this end, we have selected Mistral 7B as one of our base models, owing to its recent success across various tasks. Additionally, we opted for GPT2-xlarge, since our Hamza model is trained from scratch on the GPT2 architecture. This selection allows for a meaningful comparison between models trained from scratch and those initially trained in English and subsequently continued with pre-training in the same architectural setup.

Dataset. In order to inject Turkish into Mistral and GPT-2 base LLMs, we followed a strategy of incremental continued pretraining on Turkish-specific segments of our dataset. Beginning with an initial 100MB of pure Turkish data, we progressively expanded the training corpus, culminating in the model being trained on 5GB of data. This volume aligns closely with the dataset size used for GPT, ensuring a comprehensive and effective adaptation of the model to handle Turkish linguistic nuances.

Training. As a continual learning approach, we conducted a series of experiments by progressively enlarging the pretraining corpus size and halting upon observing convergence. The models are initialized with the pretraining weights of the Mistral-7B and GPT2-xlarge and then further trained on segments of our text corpus with a casual language modeling objective. Throughout our continued pretraining experiments, we employed LoRA and updated only the additional bottleneck adapter weights while freezing the original model weights to make the training cost-efficient and avoid any catastrophic forgetting from the models' previous capabilities. During our LoRA trainings, we used r=32 and alpha=32, along with a dropout rate of 0.05, applying LoRA exclusively to the projection layers. We used AdamW optimizer and cosine scheduler with a learning rate of 0.0001. Based on our experiments, we opted for a batch size of 1 and avoided gradient accumulation due to its significant impact on convergence. To simplify the execution of our experiments and ensure the reproducibility of our results, we used the LLaMA-Factory repository, only in our LoRA-based continued pretraining experiments.

Method 2: Pretraining from Scratch

In our final approach for developing a Turkish base-LLM, we adopted the most straightforward method: training from scratch using Turkish-only datasets. We follow a similar framework as in GPT2, with similarities in training procedures and architectural settings. However, we differed in our approach by utilizing a pretraining corpus nearly double the size of GPT2.



Pretraining Data. The construction of a robust LLM hinges on the aggregation and processing of high-quality text data. To develop Hamza, we used the Turkish split of CulturaX includes a meticulous process of data curation. It gathers a comprehensive dataset from open-sources mC4 and OSCAR. Our pretraining data contains 128 parquet files each 1.4GB, totaling almost 179.2GB. The compiled training dataset contains 129,486,207,634 (130B) training tokens. Further details of the data gathering, structure, and preparation can be found in CulturaX work.

Architecture. To develop an inaugural Turkish base model, we followed prior works, establishing a solid model for Turkish language modeling akin to earlier studies on other languages. Our approach led to the creation of four variants of Hamza, following GPT-2: hamza-small (124M parameters), hamza-medium (354M parameters), hamza-large (772M parameters), and our largest model, hamza-xlarge (1.3B parameters). The architectural specifications of these models are given in the table above.

Optimizer. During our training, AdamW optimizer is used with hyper-parameters beta1=0.9 and beta2=0.95. A cosine learning rate schedule is implemented, designed to reduce the learning rate to 10% of its maximum value. Additionally, we applied a weight decay rate of 0.1 and limited the gradient norm to 1.0 to prevent overfitting. The training process includes 2,000 warm-up steps. We used a learning rate 0f 0.0006 and batch size 491,520 in our smallest model Hamza-small. We varied the learning rate and batch size according to the model size.

Training. Our from-scratch Hamza models are built on the GPT2 architecture and incorporate the flash-attention mechanism for efficient training. The hyperparameters of the model follow the scaling principles set by GPT2, except for the largest variant, Hamza-xlarge, which is inspired by a recent CroissantLLM. All model versions were trained for 300 billion tokens, with a uniform batch size of 500,000 tokens. The learning rate was fine-tuned for each model variant. We standardized the context window across all models at 1024 tokens and did not employ any dropout techniques during their training process. All training sessions were conducted in half-precision (fp16) settings by utilizing both tensor and data parallelism across eight A100 GPUs each with 80GB of memory.

Benchmarks and Results

Evaluating the accuracy of Turkish LLMs on various tasks can be challenging due to concerns about dataset quality. Many reasoning datasets have been directly machine-translated from English without validation, resulting in biased and inaccurate results. To address this, we introduce two Turkish datasets: TruthfulQA-TR, which assesses a model’s tendency to reproduce common falsehoods, and ARC-TR, a set of grade-school science questions. We used state-of-the-art tools for translation and validated the samples with multiple annotators, cleaning them as needed. We also evalute the Bits-Per-Character (BPC) rate of each model and reported them in detail.



Bits-Per-Character (BPC) Evaluations. Auto-regressive language modeling is trained on optimizing the Negative Log-Likelihood (NLL) of the data in the training set and the effectiveness of the model is then calculated on the unseen test data. Furthermore, the most common metric to evaluate these models is perplexity, which measures the uncertainty of an LLM in predicting the next token in a sequence and is derived by taking the exponential average of the NLL. However, as various tokenizers can divide each sentence into differing numbers of tokens, NLL and PPL may produce incomparable results for models utilizing different tokenizers. To tackle this, we use Bits-Per-Character (BPC), which is another critical metric derived from NLL, used for evaluating the performance of LLMs at character-level. Consequently, our comparisons mainly relied on BPC, which normalizes the impact of tokenization differences. For the BPC evaluation, we utilized the test set of the trnews-64 corpus, comprising 5,000 samples.

Prompting and Few-Shot. Evaluating the reasoning capabilities of LLMs in downstream Question Answering (QA) tasks is crucial to assess their abilities. However, finding such datasets in languages other than English is challenging due to the limited availability of benchmarks in these languages. To bridge this gap, we developed TruthfulQA-TR and ARC-TR Turkish question-answering datasets, that are designed to evaluate the ability of LLMs to generate truthful and accurate responses to questions. To develop the Turkish version of the main TruthfulQA Multiple Choice (MC) dataset and ARC (AI2 Reasoning Challenge) dataset, we translated each example of these datasets using the advanced DeepL Machine Translation (MT) framework by its Python-supported API. After translating to Turkish, each sample was reviewed for errors or superficial translations. We used the test sets from TruthfulQA-MC2 and ARC-Challenge for evaluations. Our experiments followed the same prompting settings with LLM-Leaderboard.

Results. We present the BPC results of different models evaluated on trnews-64 in the last column of resuts table above; including our models together with various open-source multi-lingual and Turkish LLMs. Looking at the BPC results, we observe a wide range of values across the models. Lower BPC values indicate better performance in terms of compression, suggesting that the model is more efficient in representing the text. The most favorable outcomes are attained with the pretrained Kanarya-2b and Hamza-xlarge models. The adapted models which are originally pretrained on English but extended to Turkish, yielded promising results as well, lower than 1 BPC, whereas the multilingual models had a relatively lower performance. Our Prompting and Few-shot evaluations were conducted on the newly established Turkish Benchmarks, ARC-TR, in 25-shot settings, as well as on TruthfulQA-TR, adhering to the same settings as outlined by the LLM-Leaderboard. In ARC-TR, Google's Gemma 7B model leads with an accuracy of 46.16 even though it is not specifically tuned for Turkish, closely followed by Sambalingo-tr with 44.37 accuracy. Moreover, in the TruthfulQA-TR evaluation, Trendyol's DPO model emerges as the top performer with an accuracy of 50.11, while Mistral-7b-chat-v2 secures the second position with 48.34 accuracy. The accuracy scores for ARC-TR range from 24 to 47, and for TruthfulQA-TR, from 33 to 50. These results underscore the necessity for substantial improvements in these models to reach the proficiency levels observed in English benchmarks.

Case Studies

We investigate three main questions during experiments:

  • (i) Enhancing Non-English Models: Fine-Tuning vs. From-Scratch Training. The analysis of Turkish language models, specifically comparing models trained from scratch, continued pretraining from GPT2-xl, and those adapted using Mistral 7B, shows insightful trends. According to table, the Mistral 7B adapted model exhibits superior performance on Turkish question-answering tasks, compared to other methods. Moreover, starting from scratch surpasses the continued pretraining approach within the same model architecture, underscoring the significance of the base language model when undertaking continued pretraining. This is evidenced by the discrepancy in accuracy between models fine-tuned from Mistral 7B versus those from GPT2. Therefore, applying continued pretraining to a robust base language model emerges as the most effective strategy for low-resource languages, considering both data scarcity and hardware constraints.


  • (ii) Effect of Supervised Fine-Tuning: Assessing Model Performance with the Proposed IT Dataset. Supervised Fine-Tuning (SFT) plays a crucial role in enhancing the reasoning capabilities of LLMs, as highlighted in existing research. In this context, we introduced a novel Turkish IT Dataset, meticulously crafted from the ground up, inspired by the Self-Insturct and Alpaca. By fine-tuning our largest model Hamza-xlarge with this bespoke Turkish IT Dataset, we observed an improvement in model performance across downstream benchmarks. This improvement underscores the effectiveness of SFT when applied to our tailored IT dataset, bolstering our model's reasoning proficiency slightly.


  • (iii) Retention after Fine-Tuning: Will Models Forget English-Learned Skills When Fine-Tuning on Another Language? According to table and plots, further pretraining of base English language models such as GPT2 and Mistral results in a decrease in accuracy proportional to the number of samples used during continued pretraining on the English downstream tasks TruthfulQA and ARC, compared to their original base scores before fine-tuning on Turkish. This indicates catastrophic forgetting, where the models lose their prior knowledge upon being fine-tuned on a smaller language dataset, as evidenced by a decline in baseline accuracy compared to the versions not previously trained, even after applying techniques like LoRA training. One further work for this could be including some English data along with Turkish in each batch during continued pretraining.


Conclusion

Our work advances the development of Turkish LLMs, presenting a new series of models both trained from scratch (Hamza) and also adapted from other base LLMs (HamzaMistral and HamzaGPT2-xl), together with new Instruction Tuning dataset and a meticulously crafted Turkish LLM Leaderboard. In our analysis, we noted that the base LLMs exhibited catastrophic forgetting of their primary language knowledge during continued pretraining. Additionally, through the creation of a novel Turkish LLM evaluation benchmark, we have identified a significant performance gap between current Turkish LLMs and their English counterparts, underscoring the need for further improvements in Turkish language modeling. Our fully open-source work and detailed observations plays a pivotal role in the field of Turkish language modeling, providing insights on construction methodologies and offering a comparative framework for evaluating performance, thereby paving the way for future advancements.

BibTeX


        @misc{acikgoz2024bridging,
            title={Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking}, 
            author={Emre Can Acikgoz and Mete Erdogan and Deniz Yuret},
            year={2024},
            eprint={2405.04685},
            archivePrefix={arXiv},
            primaryClass={cs.CL}
        }
        

Acknowledgements

This work is supported in part provided by the KUIS AI Center. The numerical calculations reported in this paper were fully/partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources). Last but not least, we also acknowledge VSB – Technical University of Ostrava, IT4Innovations National Supercomputing Center, Czech Republic, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (grant ID: 90254)