1
0
Fork 0

Add Slovenian language support (#653)

---------

Co-authored-by: sspanak <doftor.livain@gmail.com>
This commit is contained in:
Matjaž Finžgar 2025-03-03 13:38:09 +00:00 committed by GitHub
parent 04231b7707
commit ffae563b95
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 1052665 additions and 0 deletions

View file

@ -0,0 +1,22 @@
Slovenian wordlist 1 by: Fran Ramovš Institute of Slovenian Language - ZRC SAZU
Version: ZIP exported on 2025-02-25
Source: http://bos.zrc-sazu.si/besede_en.html
License: Public domain source from ZRC SAZU, zrc@zrc-sazu.si
Slovenian wordlist 2 by: Aleksander Simonic and Uros Lotric
Version: 2012-11-30
Source: https://www.winedt.org/dict.html
License: https://www.winedt.org/dict/Slovenian.html
Frequencies obtained from LatinIME dictionaries:
Source: https://android.googlesource.com/platform/packages/inputmethods/LatinIME
Version: 66093bf509ea92fa31d796326d5f30a8d9582ffe (2023-12-21)
License: https://android.googlesource.com/platform/packages/inputmethods/LatinIME/+/refs/heads/main/NOTICE
Slovenian word list 3 by: CC-100;
Version: 2020
Source: https://data.statmt.org/cc-100/
References (PDF links are available in the source URL):
- Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020.
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
Remark: Only the words that appear at least 4 times were used.