1
0
Fork 0
This commit is contained in:
Dimo Karaivanov 2025-07-01 12:54:47 +03:00 committed by sspanak
parent b114370e91
commit b0cb1ffad9
8 changed files with 1470591 additions and 1 deletions

View file

@ -0,0 +1,29 @@
Slovak wordlists and frequencies by: id.psycho
Source: https://p.brm.sk/sk_wordlist/
Version: 2013-04-11
License: Public domain
Lists used for spell checking, validation and capitalization:
1. Slovak Hunspell dictionary by: sk-spell
Source: https://github.com/sk-spell/hunspell-sk
Version: 64b1afbe98fed61506acdfba67a9bfd4b07023e0 (2025-05-16)
License: Mozilla Public License 2.0 (https://github.com/sk-spell/hunspell-sk/blob/master/LICENSE)
2. Slovak wordlists by: Wortschatz Leipzig @ Uni Leipzig
Source: https://wortschatz.uni-leipzig.de/en/download/
Lists used: Newscrawl, Web, Wikipedia
Version: 2025-05-22
License: CC-BY
Reference:
> D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
> In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012
> http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf
3. Slovak wordlist by: CC-100
Version: 2020
Source: https://data.statmt.org/cc-100/
References (PDF links are available in the source URL):
- Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020.
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
Remark: Only the words that appear at least 10 times were used.