* added missing words to the Bulgarian dictionary * English dictionary update * removed repeating words from the Italian and Bulgarian dictionaries * fixed incorrectly broken words and removed repeating ones from the Ukrainian dictionary * Russian dictionary update * documentation update * made it possible to type words with apostrophes (Dutch, English and Ukrainian)
753 lines
30 KiB
Text
753 lines
30 KiB
Text
Custom wordlist generated from http://app.aspell.net/create using SCOWL
|
|
|
|
with parameters (words with 2-3 letters):
|
|
diacritic: strip
|
|
max_size: 50
|
|
max_variant: 0
|
|
special: <none>
|
|
spelling: US
|
|
|
|
with parameters (words with 4 or more letters):
|
|
diacritic: strip
|
|
max_size: 70
|
|
max_variant: 2
|
|
special: hacker
|
|
spelling: US GBz
|
|
|
|
Using Git Commit From: Mon Dec 7 20:14:35 2020 -0500 [5ef55f9]
|
|
|
|
=====
|
|
|
|
Spell Checking Oriented Word Lists (SCOWL)
|
|
|
|
Mon Dec 7 20:14:35 2020 -0500 [5ef55f9]
|
|
by Kevin Atkinson (kevina@gnu.org)
|
|
|
|
The SCOWL is a collection of word lists split up in various sizes, and
|
|
other categories, intended to be suitable for use in spell checkers.
|
|
However, I am sure it will have numerous other uses as well.
|
|
|
|
The latest version can be found at http://wordlist.aspell.net/.
|
|
|
|
The directory final/ contains the actual word lists broken up into
|
|
various sizes and categories. The r/ directory contains Readmes from
|
|
the various sources used to create this package.
|
|
|
|
The misc/ contains a small list of taboo words, see the README file
|
|
for more info. The speller/ directory contains scripts for creating
|
|
spelling dictionaries for Aspell and Hunspell.
|
|
|
|
The other directories contain the necessary information to recreate the
|
|
word lists from the raw data. Unless you are interested in improving the
|
|
words lists you should not need to worry about what's here. See the
|
|
section on recreating the words lists for more information on what's
|
|
there.
|
|
|
|
Except for the special word lists the files follow the following
|
|
naming convention:
|
|
<spelling category>-<sub-category>.<size>
|
|
Where the spelling category is one of
|
|
english, american, british, british_z, canadian, australian
|
|
variant_1, variant_2, variant_3,
|
|
british_variant_1, british_variant_2,
|
|
canadian_variant_1, canadian_variant_2,
|
|
australian_variant_1, australian_variant_2
|
|
Sub-category is one of
|
|
abbreviations, contractions, proper-names, upper, words
|
|
And size is one of
|
|
10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large),
|
|
80 (huge), 95 (insane)
|
|
The special word lists follow are in the following format:
|
|
special-<description>.<size>
|
|
Where description is one of:
|
|
roman-numerals, hacker
|
|
|
|
The perl script "mk-list" can be used to create a word list of the
|
|
desired size, its usage is:
|
|
./mk-list [-f] [-v#] <spelling categories> <size>
|
|
where <spelling categories> is one of the above spelling categories
|
|
(the english and special categories are automatically included as well
|
|
as all sub-categories) and <size> is the desired size. The
|
|
"-v" option can be used to also include the appropriate
|
|
variants file up to level '#'. The normal output will be a sorted
|
|
word list. If you rather see what files will be included, use the
|
|
"-f" option.
|
|
|
|
When manually combining the words lists the "english" spelling
|
|
category should be used as well as one of "american", "british",
|
|
"british_z" (british with ize spelling), "canadian" or "australian".
|
|
Great care has been taken so that only one spelling for any particular
|
|
word is included in the main list (with some minor exceptions). When
|
|
two variants were considered equal I randomly picked one for inclusion
|
|
in the main word list. Unfortunately this means that my choice in how
|
|
to spell a word may not match your choice. If this is the case you
|
|
can try including one of the "variant_1" spelling categories which
|
|
includes most variants which are considered almost equal. The
|
|
"variant_1" spelling category corresponds mostly to American variants,
|
|
while the "british_variant_1", "canadian_variant_1" and
|
|
"australian_variant_1" are for British, Canadian and Australian
|
|
variants, respectively. The "variant_2" spelling categories include
|
|
variants which are also generally considered acceptable, and
|
|
"variant_3" contains variants which are seldom used and may not even
|
|
be considered correct. There is no "british_variant_3",
|
|
"canadian_variant_3" or "australian_variant_3" spelling category since
|
|
the distinction would be almost meaningless.
|
|
|
|
The "abbreviation" category includes abbreviations and acronyms which
|
|
are not also normal words. The "contractions" category should be self
|
|
explanatory. The "upper" category includes upper case words and proper
|
|
names which are common enough to appear in a typical dictionary. The
|
|
"proper-names" category includes all the additional uppercase words.
|
|
Finally the "words" category contains all the normal English words.
|
|
|
|
To give you an idea of what the words in the various sizes look like
|
|
here is a sample of 25 random words found only in that size:
|
|
|
|
10: blow convert delete enables flow hot individual job maintains occurred
|
|
pointless political population provided quits recovering results settles
|
|
simultaneous situation source tickets uncertain uses why
|
|
|
|
20: additions advertisement akin applicants appoints celebrated contracts
|
|
crime degradation discriminate enforcing escapes fabric funeral
|
|
genetically inconsistencies initialized innovative lodge lurking
|
|
photographic punches tiring trumpet wary
|
|
|
|
35: bagel brewed bushel charting commutative consigning dabbed displacements
|
|
fatties flotillas flung gunshots harrow hull hungriest kangaroos math
|
|
memoirs negatives nonresident rampages ranchers submissive subtractions
|
|
tipped
|
|
|
|
40: astrologers bedraggles buzzword cupcakes eyeglass gridlock grungy
|
|
hairpiece hallucinates hotcakes inebriated leakier nymphomania papergirls
|
|
patchier patrolman predisposed reshuffled sasses snowmobiling
|
|
southeasterly teargas testiest topographer wimpy
|
|
|
|
50: apiaries besmirching boozier caducei communicant drainpipe ductile
|
|
exigencies gammas grouted harbinger hyphenations licentiate lynxes
|
|
maidenly malingerer palmettos pinwheeled prepackage propellant scrimmaged
|
|
sculleries senselessly unscrambled viburnums
|
|
|
|
55: bloodstock bodge bruiting bumbag carthorse clumpy dandifying etiolated
|
|
fleabite guestrooms marge moi overdeveloped owlishly perisher plebes
|
|
pseudy pukka putzes sangria splodges stocktaking subspecies tiebreaks
|
|
touchpapers
|
|
|
|
60: autobiographic cytologist fellowman footraces gypsters hardihood
|
|
headshrinker homo interfile nonoperational nonsupporting outdraw
|
|
profligately readopted revetments semanticist stagnantly tapper thanes
|
|
thetas uncloaking uncross versifiers wasabi xylene
|
|
|
|
70: biltongs bookcraft bouilli bouse bronchiole cirrostrati coenurus
|
|
desorption feculence hackbuts heterolysis hylophagous ichthyosaur
|
|
iguanodon jillion lapidated mistranslating pullulating redd shylock skink
|
|
storaxes thalluses vermiculations voiture
|
|
|
|
80: cellulolytic chomper costrels ditheistic doddard dwarfest fellwalkers
|
|
fernless gammoners gasolinic introductive labrets macaber
|
|
perspicaciousness pharmacodynamics pitchwomen pleuritical protore
|
|
repurifies ristras rolamite rumping sedimenting smithereening tolans
|
|
|
|
95: amherstite appropinquations arsefoot assur commodate craspedia cutitis
|
|
disciferous endeavourments endocondensation glyoxalase hatherlite
|
|
interreticular interspicular lipothymy prieved reconvergence rousette
|
|
septerium superdonation tenaim topepo trachelitis transgeneses
|
|
ultraenthusiastic
|
|
|
|
|
|
And here is a count on the number of words in each spelling category
|
|
(american + english spelling category):
|
|
|
|
Size Words Names Running Total %
|
|
10 4,425 13 4,438 0.7
|
|
20 8,126 0 12,564 1.9
|
|
35 37,260 220 50,044 7.6
|
|
40 6,858 489 57,391 8.7
|
|
50 25,289 18,683 101,363 15.4
|
|
55 6,487 0 107,850 16.4
|
|
60 14,551 850 123,251 18.7
|
|
70 35,294 7,897 166,442 25.3
|
|
80 144,158 33,368 343,968 52.3
|
|
95 227,633 86,630 658,231 100.0
|
|
|
|
|
|
(The "Words" column does not include the name count.)
|
|
|
|
Size 35 is the recommended small size, 50 the medium and 70 the large.
|
|
Sizes 70 and below contain words found in most dictionaries while the
|
|
80 size contains all the strange and unusual words people like to use
|
|
in word games such as Scrabble (TM). While a lot of the words in the
|
|
80 size are not used very often, they are all generally considered
|
|
valid words in the English language. The 95 contains just about every
|
|
English word in existence and then some. Many of the words at the 95
|
|
level will probably not be considered valid English words by most
|
|
people.
|
|
|
|
For spell checking I recommend using size 60. This size is the
|
|
largest size that I am fairly confident does not contain any
|
|
misspellings or invalid words. In addition an effort is made to
|
|
exclude valid yet problematic words (such as "calender") from the 60
|
|
size that are likely to be a misspelling of a more common word. The
|
|
70 size is reasonable for those wanting a larger list and don't mind a
|
|
few errors. The 80 or larger sizes are not reasonable for spell
|
|
checking.
|
|
|
|
Accents are present on certain words such as café in iso8859-1 format.
|
|
|
|
CHANGES:
|
|
|
|
From Version 2019.10.06 to 2020.12.07
|
|
|
|
Various new words.
|
|
|
|
Variant cleanups.
|
|
|
|
Bump irregardless, froward (+ derivatives) and perpend to level 70.
|
|
|
|
From Version 2018.04.16 to 2019.10.06
|
|
|
|
Various new words.
|
|
|
|
Remove compare's and fail's.
|
|
|
|
From Version 2017.08.24 to 2018.04.16
|
|
|
|
Various new words.
|
|
|
|
Fix build problems on macOS.
|
|
|
|
From Version 2017.01.22 to 2017.08.24
|
|
|
|
Various new words.
|
|
|
|
From Version 2016.11.20 to 2017.01.22
|
|
|
|
Various new words.
|
|
|
|
From Version 2016.06.26 to 2016.11.20
|
|
|
|
New Australian spelling category thanks to the work of Benjamin
|
|
Titze (btitze@protonmail.ch)
|
|
|
|
Various new words.
|
|
|
|
From Version 2016.01.19 to 2016.06.26
|
|
|
|
Various new words.
|
|
|
|
Updated to Version 6.0.2 of 12dicts
|
|
|
|
Other minor changes.
|
|
|
|
From Version 2015.08.24 to 2016.01.19
|
|
|
|
Various new words.
|
|
|
|
Clarified README to indicate why the 60 size is the preferred size
|
|
for spell checking.
|
|
|
|
Remove some very uncommon possessive forms.
|
|
|
|
Change "SET UTF8" to "SET UTF-8" in hunspell affix file.
|
|
|
|
From Version 2015.05.18 to 2015.08.24 (Aug 24, 2015)
|
|
|
|
Various new words.
|
|
|
|
From Version 2015.04.24 to 2015.05.18 (May 18, 2015)
|
|
|
|
Added some new words found to have a high frequency in the COCA
|
|
corpus. (http://corpus.byu.edu/coca/).
|
|
|
|
Fix en spelling suggestions for 'alot' and 'exersize' in hunspell
|
|
dictionary (upstreamed from the changes made in Firefox).
|
|
|
|
From Version 2015.02.15 to 2015.04.24 (April 24, 2015)
|
|
|
|
Added some new words.
|
|
|
|
Convert hunspell dictionary to UTF-8 in order to handle smart
|
|
quotes correctly.
|
|
|
|
From Version 2015.01.28 to 2015.02.15 (February 15, 2015)
|
|
|
|
Added a large number of neologisms (newly invented words)
|
|
such as "selfie" and "smartwatch" thanks to Alan Beale.
|
|
|
|
Various other new words.
|
|
|
|
Clean up the special-hacker category by removing some words that
|
|
didn't exist in the Google Book's Corpus (1980 - 2008) and
|
|
originated from the "Unofficial Jargon File Word Lists".
|
|
|
|
From Version 2014.11.17 to 2015.01.28 (January 28, 2015)
|
|
|
|
Various new words, many from analyzing the Google Book's Corpus
|
|
(1980 - 2008). See http://app.aspell.net/lookup-freq.
|
|
|
|
Moved some uncommon words that can easily hide a misspelling of a
|
|
more common word to level 70. (calender, adrenalin and Joesph)
|
|
|
|
Removed several -er and -est forms from adjectives that were so
|
|
uncommon that they were not found anywhere is the Google Book's
|
|
Corpus (1980 - 2008).
|
|
|
|
From Version 2014.08.11.1 to 2014.11.17 (November 17, 2014)
|
|
|
|
Various new words.
|
|
|
|
Fix typo in Hunspell readme.
|
|
|
|
From Version 2014.08.11 to 2014.08.11.1 (August 13, 2014)
|
|
|
|
Forgot to mention this important change from 7.1 to 2014.08.11:
|
|
|
|
Shifted the variant levels up by one: variant_0 is now variant_1,
|
|
variant_1 is now variant_2, and variant_2 is now variant_3.
|
|
|
|
Other minor fixes in this README.
|
|
|
|
No changes to the contents of the lists.
|
|
|
|
From Revision 7.1 to Version 2014.08.11 (August 11, 2014)
|
|
|
|
Added some missing possessive forms.
|
|
|
|
Added some new words and proper names.
|
|
|
|
Clean up the categories (words, upper, proper-names etc) so that they
|
|
are more accurate.
|
|
|
|
Convert documentation to UTF-8. For now, the wordlist are still in
|
|
ISO-8859-1 to prevent compatibility problems.
|
|
|
|
Add schema and scripts for creating a SQLite database from SCOWL.
|
|
Add some utility and library functions using them. This database is
|
|
used by the new web app's (http://app.aspell.net/lookup & create).
|
|
|
|
Enhance speller/make-hunspell-dict. The biggest improvement is that
|
|
it that it now generates several more dictionaries in addition to
|
|
the official ones. These additional dictionaries are ones for
|
|
British English and larger dictionaries that include up to SCOWL
|
|
size 70.
|
|
|
|
From Revision 7 to 7.1 (January 6, 2011)
|
|
|
|
Updated to revision 5.1 of Varcon which corrected several errors.
|
|
|
|
Fixed various problems with the variant processing which corrected a
|
|
few more errors.
|
|
|
|
Added several now common proper names and some other words now
|
|
in common use.
|
|
|
|
Include misc/ and speller/ directory which were in SVN but left
|
|
out of the release tarball.
|
|
|
|
Other minor fixes, including some fixes to the taboo word lists.
|
|
|
|
From Revision 6 to 7 (December 27, 2010)
|
|
|
|
Updated to revision 5.0 of Varcon which corrected many errors,
|
|
especially in the British and Canadian spelling categories. Also
|
|
added new spelling categories for the British and Canadian spelling
|
|
variants and separated them out from the main variant_* categories.
|
|
|
|
Moved Moby names lists (3897male.nam 4946fema.len 21986na.mes) to 95
|
|
level since they contain too many errors and rare names.
|
|
|
|
Moved frequently class 0 from Brian Kelk's Wordlist from
|
|
level 60 to 70, and also filter it with level 80 due to, too many
|
|
misspellings.
|
|
|
|
Many other minor fixes.
|
|
|
|
From Revision 5 to 6 (August 10, 2004)
|
|
|
|
Updated to version 4.0 of the 12dicts package.
|
|
|
|
Included the 3esl, 2of4brif, and 5desk list from the new 12dicts
|
|
package. The 3esl was included in the 40 size, the 2of4brif in the
|
|
55 size and the 5desk in the 70 size.
|
|
|
|
Removed the Ispell word list as it was a source of too many errors.
|
|
This eliminated the 65 size.
|
|
|
|
Removed clause 4 from the Ispell copyright with permission of Geoff
|
|
Kuenning.
|
|
|
|
Updated to version 4.1 of VarCon.
|
|
|
|
Added the "british_z" spelling category which is British using the
|
|
"ize" spelling.
|
|
|
|
From Revision 4a to 5 (January 3, 2002)
|
|
|
|
Added variants that were not really spelling variants (such as
|
|
forwards) back into the main list.
|
|
|
|
Fixed a bug which caused variants of words to incorrectly appear in
|
|
the non-variant lists.
|
|
|
|
Moved rarely used inflections of a word into higher number lists.
|
|
|
|
Added other inflections of a words based on the following criteria
|
|
If the word is in the base form: only include that word.
|
|
If the word is in a plural form: include the base word and the plural
|
|
If the word is a verb form (other than plural): include all verb forms
|
|
If the word is an ad* form: include all ad* forms
|
|
If the word is in a possessive form: also include the non-possessive
|
|
|
|
Updated to the latest version of many of the source dictionaries.
|
|
|
|
Removed the DEC Word List due to the questionable licence and
|
|
because removing it will not seriously decrease the quality of SCOWL
|
|
(there are a few less proper names).
|
|
|
|
From Revision 4 to 4a (April 4, 2001)
|
|
|
|
Reran the scripts on a never version of AGID (3a) which fixes a bug
|
|
which caused some common words to be improperly marked as variants.
|
|
|
|
From Revision 3 to 4 (January 28, 2001)
|
|
|
|
Split the variant "spelling category" up into 3 different levels.
|
|
|
|
Added words in the Ispell word list at the 65 level.
|
|
|
|
Other changes due to using more recent versions of various sources
|
|
included a more accurate version of AGID thanks to the work of
|
|
Alan Beale
|
|
|
|
From Revision 2 to 3 (August 18, 2000)
|
|
|
|
Renamed special-unix-terms to special-hacker and added a large
|
|
number of commonly used words within the hacker (not cracker)
|
|
community.
|
|
|
|
Added a couple more signature words including "newbie".
|
|
|
|
Minor changes due to changes in the inflection database.
|
|
|
|
From Revision 1 to 2 (August 5, 2000)
|
|
|
|
Moved the male and female name lists from the mwords package and the
|
|
DEC name lists form the 50 level to the 60 level and moved Alan's
|
|
name list from the 60 level to the 50 level. Also added the top
|
|
1000 male, female, and last names from the 1990 Census report to the
|
|
50 level. This reduced the number of names in the 50 level from
|
|
17,000 to 7,000.
|
|
|
|
Added a large number of Uppercase words to the 50 level.
|
|
|
|
Properly accented the possessive form of some words.
|
|
|
|
Minor other changes due to changes in my raw data files which have
|
|
not been released yet. Email if you are interested in these files.
|
|
|
|
COPYRIGHT, SOURCES, and CREDITS:
|
|
|
|
The collective work is Copyright 2000-2018 by Kevin Atkinson as well
|
|
as any of the copyrights mentioned below:
|
|
|
|
Copyright 2000-2018 by Kevin Atkinson
|
|
|
|
Permission to use, copy, modify, distribute and sell these word
|
|
lists, the associated scripts, the output created from the scripts,
|
|
and its documentation for any purpose is hereby granted without fee,
|
|
provided that the above copyright notice appears in all copies and
|
|
that both that copyright notice and this permission notice appear in
|
|
supporting documentation. Kevin Atkinson makes no representations
|
|
about the suitability of this array for any purpose. It is provided
|
|
"as is" without express or implied warranty.
|
|
|
|
Alan Beale <biljir@pobox.com> also deserves special credit as he has,
|
|
in addition to providing the 12Dicts package and being a major
|
|
contributor to the ENABLE word list, given me an incredible amount of
|
|
feedback and created a number of special lists (those found in the
|
|
Supplement) in order to help improve the overall quality of SCOWL.
|
|
|
|
The 10 level includes the 1000 most common English words (according to
|
|
the Moby (TM) Words II [MWords] package), a subset of the 1000 most
|
|
common words on the Internet (again, according to Moby Words II), and
|
|
frequently class 16 from Brian Kelk's "UK English Wordlist
|
|
with Frequency Classification".
|
|
|
|
The MWords package was explicitly placed in the public domain:
|
|
|
|
The Moby lexicon project is complete and has
|
|
been place into the public domain. Use, sell,
|
|
rework, excerpt and use in any way on any platform.
|
|
|
|
Placing this material on internal or public servers is
|
|
also encouraged. The compiler is not aware of any
|
|
export restrictions so freely distribute world-wide.
|
|
|
|
You can verify the public domain status by contacting
|
|
|
|
Grady Ward
|
|
3449 Martha Ct.
|
|
Arcata, CA 95521-4884
|
|
|
|
grady@netcom.com
|
|
grady@northcoast.com
|
|
|
|
The "UK English Wordlist With Frequency Classification" is also in the
|
|
Public Domain:
|
|
|
|
Date: Sat, 08 Jul 2000 20:27:21 +0100
|
|
From: Brian Kelk <Brian.Kelk@cl.cam.ac.uk>
|
|
|
|
> I was wondering what the copyright status of your "UK English
|
|
> Wordlist With Frequency Classification" word list as it seems to
|
|
> be lacking any copyright notice.
|
|
|
|
There were many many sources in total, but any text marked
|
|
"copyright" was avoided. Locally-written documentation was one
|
|
source. An earlier version of the list resided in a filespace called
|
|
PUBLIC on the University mainframe, because it was considered public
|
|
domain.
|
|
|
|
Date: Tue, 11 Jul 2000 19:31:34 +0100
|
|
|
|
> So are you saying your word list is also in the public domain?
|
|
|
|
That is the intention.
|
|
|
|
The 20 level includes frequency classes 7-15 from Brian's word list.
|
|
|
|
The 35 level includes frequency classes 2-6 and words appearing in at
|
|
least 11 of 12 dictionaries as indicated in the 12Dicts package. All
|
|
words from the 12Dicts package have had likely inflections added via
|
|
my inflection database.
|
|
|
|
The 12Dicts package and Supplement is in the Public Domain.
|
|
|
|
The WordNet database, which was used in the creation of the
|
|
Inflections database, is under the following copyright:
|
|
|
|
This software and database is being provided to you, the LICENSEE,
|
|
by Princeton University under the following license. By obtaining,
|
|
using and/or copying this software and database, you agree that you
|
|
have read, understood, and will comply with these terms and
|
|
conditions.:
|
|
|
|
Permission to use, copy, modify and distribute this software and
|
|
database and its documentation for any purpose and without fee or
|
|
royalty is hereby granted, provided that you agree to comply with
|
|
the following copyright notice and statements, including the
|
|
disclaimer, and that the same appear on ALL copies of the software,
|
|
database and documentation, including modifications that you make
|
|
for internal use or for distribution.
|
|
|
|
WordNet 1.6 Copyright 1997 by Princeton University. All rights
|
|
reserved.
|
|
|
|
THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
|
|
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
|
|
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
|
|
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
|
|
ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE
|
|
LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY
|
|
THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
|
|
|
|
The name of Princeton University or Princeton may not be used in
|
|
advertising or publicity pertaining to distribution of the software
|
|
and/or database. Title to copyright in this software, database and
|
|
any associated documentation shall at all times remain with
|
|
Princeton University and LICENSEE agrees to preserve same.
|
|
|
|
The 40 level includes words from Alan's 3esl list found in version 4.0
|
|
of his 12dicts package. Like his other stuff the 3esl list is also in the
|
|
public domain.
|
|
|
|
The 50 level includes Brian's frequency class 1, words appearing
|
|
in at least 5 of 12 of the dictionaries as indicated in the 12Dicts
|
|
package, and uppercase words in at least 4 of the previous 12
|
|
dictionaries. A decent number of proper names is also included: The
|
|
top 1000 male, female, and Last names from the 1990 Census report; a
|
|
list of names sent to me by Alan Beale; and a few names that I added
|
|
myself. Finally a small list of abbreviations not commonly found in
|
|
other word lists is included.
|
|
|
|
The name files form the Census report is a government document which I
|
|
don't think can be copyrighted.
|
|
|
|
The file special-jargon.50 uses common.lst and word.lst from the
|
|
"Unofficial Jargon File Word Lists" which is derived from "The Jargon
|
|
File". All of which is in the Public Domain. This file also contain
|
|
a few extra UNIX terms which are found in the file "unix-terms" in the
|
|
special/ directory.
|
|
|
|
The 55 level includes words from Alan's 2of4brif list found in version
|
|
4.0 of his 12dicts package. Like his other stuff the 2of4brif is also
|
|
in the public domain.
|
|
|
|
The 60 level includes all words appearing in at least 2 of the 12
|
|
dictionaries as indicated by the 12Dicts package.
|
|
|
|
The 70 level includes Brian's frequency class 0 and the 74,550 common
|
|
dictionary words from the MWords package. The common dictionary words,
|
|
like those from the 12Dicts package, have had all likely inflections
|
|
added. The 70 level also included the 5desk list from version 4.0 of
|
|
the 12Dics package which is in the public domain.
|
|
|
|
The 80 level includes the ENABLE word list, all the lists in the
|
|
ENABLE supplement package (except for ABLE), the "UK Advanced Cryptics
|
|
Dictionary" (UKACD), the list of signature words from the YAWL package,
|
|
and the 10,196 places list from the MWords package.
|
|
|
|
The ENABLE package, mainted by M\Cooper <thegrendel@theriver.com>,
|
|
is in the Public Domain:
|
|
|
|
The ENABLE master word list, WORD.LST, is herewith formally released
|
|
into the Public Domain. Anyone is free to use it or distribute it in
|
|
any manner they see fit. No fee or registration is required for its
|
|
use nor are "contributions" solicited (if you feel you absolutely
|
|
must contribute something for your own peace of mind, the authors of
|
|
the ENABLE list ask that you make a donation on their behalf to your
|
|
favorite charity). This word list is our gift to the Scrabble
|
|
community, as an alternate to "official" word lists. Game designers
|
|
may feel free to incorporate the WORD.LST into their games. Please
|
|
mention the source and credit us as originators of the list. Note
|
|
that if you, as a game designer, use the WORD.LST in your product,
|
|
you may still copyright and protect your product, but you may *not*
|
|
legally copyright or in any way restrict redistribution of the
|
|
WORD.LST portion of your product. This *may* under law restrict your
|
|
rights to restrict your users' rights, but that is only fair.
|
|
|
|
UKACD, by J Ross Beresford <ross@bryson.demon.co.uk>, is under the
|
|
following copyright:
|
|
|
|
Copyright (c) J Ross Beresford 1993-1999. All Rights Reserved.
|
|
|
|
The following restriction is placed on the use of this publication:
|
|
if The UK Advanced Cryptics Dictionary is used in a software package
|
|
or redistributed in any form, the copyright notice must be
|
|
prominently displayed and the text of this document must be included
|
|
verbatim.
|
|
|
|
There are no other restrictions: I would like to see the list
|
|
distributed as widely as possible.
|
|
|
|
The 95 level includes the 354,984 single words, 256,772 compound
|
|
words, 4,946 female names and the 3,897 male names, and 21,986 names
|
|
from the MWords package, ABLE.LST from the ENABLE Supplement, and some
|
|
additional words found in my part-of-speech database that were not
|
|
found anywhere else.
|
|
|
|
Accent information was taken from UKACD.
|
|
|
|
The VarCon package was used to create the American, British, Canadian,
|
|
and Australian word list. It is under the following copyright:
|
|
|
|
Copyright 2000-2016 by Kevin Atkinson
|
|
|
|
Permission to use, copy, modify, distribute and sell this array, the
|
|
associated software, and its documentation for any purpose is hereby
|
|
granted without fee, provided that the above copyright notice appears
|
|
in all copies and that both that copyright notice and this permission
|
|
notice appear in supporting documentation. Kevin Atkinson makes no
|
|
representations about the suitability of this array for any
|
|
purpose. It is provided "as is" without express or implied warranty.
|
|
|
|
Copyright 2016 by Benjamin Titze
|
|
|
|
Permission to use, copy, modify, distribute and sell this array, the
|
|
associated software, and its documentation for any purpose is hereby
|
|
granted without fee, provided that the above copyright notice appears
|
|
in all copies and that both that copyright notice and this permission
|
|
notice appear in supporting documentation. Benjamin Titze makes no
|
|
representations about the suitability of this array for any
|
|
purpose. It is provided "as is" without express or implied warranty.
|
|
|
|
Since the original words lists come from the Ispell distribution:
|
|
|
|
Copyright 1993, Geoff Kuenning, Granada Hills, CA
|
|
All rights reserved.
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions
|
|
are met:
|
|
|
|
1. Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
2. Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
3. All modifications to the source code must be clearly marked as
|
|
such. Binary redistributions based on modified source code
|
|
must be clearly marked as modified versions in the documentation
|
|
and/or other materials provided with the distribution.
|
|
(clause 4 removed with permission from Geoff Kuenning)
|
|
5. The name of Geoff Kuenning may not be used to endorse or promote
|
|
products derived from this software without specific prior
|
|
written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
|
|
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
|
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
SUCH DAMAGE.
|
|
|
|
|
|
The variant word lists were created from a list of variants found in
|
|
the 12dicts supplement package as well as a list of variants I created
|
|
myself.
|
|
|
|
The Readmes for the various packages used can be found in the
|
|
appropriate directory under the r/ directory.
|
|
|
|
FUTURE PLANS:
|
|
|
|
The process of "sort"s, "comm"s, and Perl scripts to combine the many
|
|
word lists and separate out the variant information is inexact and
|
|
error prone. The whole things needs to be rewritten to deal with
|
|
words in terms of lemmas. When the exact lemma is not known a best
|
|
guess should be made. I'm not sure what form this should be in. I
|
|
originally thought this should be some sort of database, but maybe I
|
|
should just slurp all that data into memory and process it in one
|
|
giant perl script. With the amount of memory available these days (at
|
|
least 2 GB, often 4 GB or more) this should not really be a problem.
|
|
|
|
In addition, there is a very nice frequency analyze of the BNC corpus
|
|
done by Adam Kilgarriff. Unlike Brian's word lists the BNC lists
|
|
include part of speech information. I plan on somehow using these
|
|
lists as Adam Kilgarriff has given me the OK to use it in SCOWL.
|
|
These lists will greatly reduce the problem of inflected forms of a
|
|
word appearing at different levels due to the part-of-speech
|
|
information.
|
|
|
|
There is frequency information for some other corpus such as COCA
|
|
(Corpus of Contemporary American English) and ANS (American National
|
|
Corpus) which I might also be able to use. The former will require
|
|
permission, and the latter is of questionable quality.
|
|
|
|
RECREATING THE WORD LISTS:
|
|
|
|
In order to recreate the word lists you need a modern version of Perl,
|
|
bash, the traditional set of shell utilities, a system that supports
|
|
symbolic links, and quite possibly GNU Make. The easiest way to
|
|
recreate the word lists is to checkout the corresponding Git version
|
|
(see the version string at the start of the file) and simply type
|
|
"make" (see http://wordlist.aspell.net). You can try to download all
|
|
the pieces manually, but this method is not no longer tested nor
|
|
supported.
|
|
|
|
The src/ directory contains the numerous scripts used in the creation
|
|
of the final product.
|
|
|
|
The r/ directory contains the raw data used to create the final
|
|
product. If you checkout from Git this directory should be populated
|
|
automatically for you. If you insist on doing it the hard way see the
|
|
README file in the r/ directory for more information.
|
|
|
|
The l/ directory contains symbolic links used by the actual scripts.
|
|
|
|
Finally, the working/ directory is where all the intermittent files go
|
|
that are not specific to one source.
|