Latvian (#737)
This commit is contained in:
parent
cad27907c7
commit
52532a6d26
4 changed files with 1192092 additions and 0 deletions
14
app/languages/definitions/Latvian.yml
Normal file
14
app/languages/definitions/Latvian.yml
Normal file
|
|
@ -0,0 +1,14 @@
|
||||||
|
locale: lv-LV
|
||||||
|
dictionaryFile: lv-utf8.csv
|
||||||
|
abcString: abc
|
||||||
|
layout:
|
||||||
|
- [SPECIAL] # 0
|
||||||
|
- [PUNCTUATION] # 1
|
||||||
|
- [a, ā, b, c, č] # 2
|
||||||
|
- [d, e, ē, f] # 3
|
||||||
|
- [g, ģ, h, i, ī] # 4
|
||||||
|
- [j, k, ķ, l, ļ] # 5
|
||||||
|
- [m, n, ņ, o] # 6
|
||||||
|
- [p, q, r, s, š] # 7
|
||||||
|
- [t, u, ū, v] # 8
|
||||||
|
- [w, x, y, z, ž] # 9
|
||||||
1192057
app/languages/dictionaries/lv-utf8.csv
Normal file
1192057
app/languages/dictionaries/lv-utf8.csv
Normal file
File diff suppressed because it is too large
Load diff
21
docs/dictionaries/lvWordlistReadme.txt
Normal file
21
docs/dictionaries/lvWordlistReadme.txt
Normal file
|
|
@ -0,0 +1,21 @@
|
||||||
|
Latvian wordlist 1 from: Tēzaurs.lv
|
||||||
|
Source: https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/119
|
||||||
|
Version: 2025 (Winter Edition)
|
||||||
|
License: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/
|
||||||
|
|
||||||
|
Latvian wordlist 2 by: Eymen Efe Altun
|
||||||
|
Source: https://github.com/eymenefealtun/all-words-in-all-languages
|
||||||
|
Version: edc173a7554731fe644319796a167eae8a5a8eaf (2024-11-28)
|
||||||
|
|
||||||
|
Latvian word list 3 by: CC-100;
|
||||||
|
Version: 2020
|
||||||
|
Source: https://data.statmt.org/cc-100/
|
||||||
|
References (PDF links are available in the source URL):
|
||||||
|
- Unsupervised Cross-lingual Representation Learning at Scale, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), p. 8440-8451, July 2020.
|
||||||
|
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
|
||||||
|
Remark: Only the 4-letter or longer words that appear at least 2 times were used.
|
||||||
|
|
||||||
|
Word frequencies from LatinIME dictionaries:
|
||||||
|
Source: https://android.googlesource.com/platform/packages/inputmethods/LatinIME
|
||||||
|
Version: 66093bf509ea92fa31d796326d5f30a8d9582ffe (2023-12-21)
|
||||||
|
License: https://android.googlesource.com/platform/packages/inputmethods/LatinIME/+/refs/heads/main/NOTICE
|
||||||
BIN
downloads/lv-utf8.zip
Normal file
BIN
downloads/lv-utf8.zip
Normal file
Binary file not shown.
Loading…
Add table
Add a link
Reference in a new issue