Skip to content

Commit

Permalink
feat: improve word splitting options (#846)
Browse files Browse the repository at this point in the history
To support dictionary words with -, _, and digits, the word splitting algorithm was improved to try multiple
permutations.

* fix: support `-` in dictionary words.
* dev: Improve support for mixed case spell checking.
* dev: Have the tool generate word lists with formatting
* dev: map the word to lower case before testing ignore words.
* dev: Do not use the code splitter when loading word lists
* Remove support for code splitter in word lists.
* dev: ignore case when looking for repeated characters
* dev: add a more extensive word splitter
* dev: reduce calls to has word
* If we need support for Non-BMP, it can be added later.
* Use isFound to match validator
* dev: use splitter with text validator
* move regex escape into its own file
* Use native unicode support
* support custom word breaks
* dev ignore trailing endings
* Update snapshots
* Remove dependency upon cspell-util-bundle
* Remove cspell-util-bundle
* Remove support for 10
  • Loading branch information
Jason3S committed Jan 20, 2021
1 parent 83589d4 commit b4dc108
Show file tree
Hide file tree
Showing 70 changed files with 10,950 additions and 7,880 deletions.
1 change: 0 additions & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ _Issue with supporting library?_
- [ ] cspell-io -- thin file i/o library
- [ ] cspell-trie-lib - trie lib
- [ ] cspell-trie2-lib - trie lib alternate format
- [ ] cspell-util-bundle - util bundle to reduce install size

**OS:**

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ jobs:
strategy:
matrix:
node-version:
- 10.x
- 12.x
- 14.x
- 15.x

os:
- ubuntu-latest
Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,3 @@ The cspell mono-repo, a spell checker for code.
- [cspell-tools](packages/cspell-tools) -- tool used to compile dictionaries.
- [cspell-trie-lib](packages/cspell-trie-lib) -- trie data structure used to store words.
- [cspell-trie](packages/cspell-trie) -- trie data tool used to store words.
- [cspell-util-bundle](packages/cspell-util-bundle) -- webpack bundle used to reduce the size of the distributed package.
5 changes: 3 additions & 2 deletions cspell-dict.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
DAWG
WORDCHARS
backreference
backreferences
bitjson
codecov
coverallsapp
DAWG
deserialize
deserializer
deserializers
Expand All @@ -20,9 +21,9 @@ popd
pushd
repo
repos
retryable
serializers
streetsidesoftware
submodule
tsdk
WORDCHARS
xregexp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Repository: Azure/azure-rest-api-specs
Url: "https://github.com/Azure/azure-rest-api-specs.git"
Args: ["--config=cSpell.json","**/*.{md,ts,js}"]
Lines:
CSpell: Files checked: 1467, Issues found: 2387 in 613 files
CSpell: Files checked: 1467, Issues found: 2384 in 612 files
exit code: 1
./Azure/azure-rest-api-specs/README.md:20:179 - Unknown word (Dataplane)
./Azure/azure-rest-api-specs/README.md:30:134 - Unknown word (dataplane)
Expand All @@ -15,7 +15,6 @@ Lines:
./Azure/azure-rest-api-specs/documentation/Semantic-and-Model-Violations-Reference.md:597:45 - Unknown word (exmaple)
./Azure/azure-rest-api-specs/documentation/Semantic-and-Model-Violations-Reference.md:623:43 - Unknown word (requried)
./Azure/azure-rest-api-specs/documentation/Semantic-and-Model-Violations-Reference.md:662:39 - Unknown word (resouce)
./Azure/azure-rest-api-specs/documentation/SwaggerValidationTools.md:3:46 - Unknown word (FTEs)
./Azure/azure-rest-api-specs/documentation/ci-fix.md:9:76 - Unknown word (supress)
./Azure/azure-rest-api-specs/documentation/code-gen/configure-go-sdk.md:114:12 - Unknown word (yourservicename)
./Azure/azure-rest-api-specs/documentation/code-gen/configure-go-sdk.md:290:92 - Unknown word (onever)
Expand Down Expand Up @@ -82,7 +81,6 @@ Lines:
./Azure/azure-rest-api-specs/specification/appconfiguration/resource-manager/readme.md:141:50 - Unknown word (proxied)
./Azure/azure-rest-api-specs/specification/applicationinsights/data-plane/Monitor.Exporters/readme.md:29:39 - Unknown word (schemaregistry)
./Azure/azure-rest-api-specs/specification/applicationinsights/resource-manager/readme.azureresourceschema.md:31:51 - Unknown word (livetoken)
./Azure/azure-rest-api-specs/specification/applicationinsights/resource-manager/readme.cli.md:22:32 - Unknown word (Ikeys)
./Azure/azure-rest-api-specs/specification/applicationinsights/resource-manager/readme.cli.md:27:22 - Unknown word (Gorup)
./Azure/azure-rest-api-specs/specification/applicationinsights/resource-manager/readme.cli.md:41:13 - Unknown word (operatoin)
./Azure/azure-rest-api-specs/specification/applicationinsights/resource-manager/readme.cli.md:8:17 - Unknown word (Powershell's)
Expand Down
11 changes: 2 additions & 9 deletions integration-tests/snapshots/TheAlgorithms/Python/snapshot.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Repository: TheAlgorithms/Python
Url: "https://github.com/TheAlgorithms/Python.git"
Args: ["**/*.{md,py}"]
Lines:
CSpell: Files checked: 695, Issues found: 2484 in 393 files
CSpell: Files checked: 695, Issues found: 2470 in 391 files
exit code: 1
./TheAlgorithms/Python/CONTRIBUTING.md:118:33 - Unknown word (pytest)
./TheAlgorithms/Python/CONTRIBUTING.md:121:14 - Unknown word (doctest)
Expand Down Expand Up @@ -129,14 +129,14 @@ Lines:
./TheAlgorithms/Python/ciphers/a1z26.py:12:17 - Unknown word (myname)
./TheAlgorithms/Python/ciphers/affine_cipher.py:60:35 - Unknown word (Mpyx)
./TheAlgorithms/Python/ciphers/affine_cipher.py:7:42 - Unknown word (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
./TheAlgorithms/Python/ciphers/affine_cipher.py:83:15 - Unknown word (Ofkey)
./TheAlgorithms/Python/ciphers/affine_cipher.py:8:9 - Unknown word (abcdefghijklmnopqrstuvwxyz)
./TheAlgorithms/Python/ciphers/atbash.py:42:10 - Unknown word (timeit)
./TheAlgorithms/Python/ciphers/atbash.py:5:5 - Unknown word (atbash)
./TheAlgorithms/Python/ciphers/atbash.py:64:23 - Unknown word (ABCDEFGH)
./TheAlgorithms/Python/ciphers/atbash.py:7:22 - Unknown word (ABCDEFG)
./TheAlgorithms/Python/ciphers/atbash.py:8:6 - Unknown word (ZYXWVUT)
./TheAlgorithms/Python/ciphers/base64_cipher.py:43:24 - Unknown word (BQUFBQUFBQUFB)
./TheAlgorithms/Python/ciphers/base64_cipher.py:5:24 - Unknown word (AÅᐃ𐀏)
./TheAlgorithms/Python/ciphers/base64_cipher.py:8:86 - Unknown word (QUFB)
./TheAlgorithms/Python/ciphers/beaufort_cipher.py:17:6 - Unknown word (SECRETSECRETSECRE)
./TheAlgorithms/Python/ciphers/beaufort_cipher.py:2:15 - Unknown word (Radadiya)
Expand Down Expand Up @@ -365,7 +365,6 @@ Lines:
./TheAlgorithms/Python/data_structures/heap/skew_heap.py:54:5 - Unknown word (Visualisation)
./TheAlgorithms/Python/data_structures/linked_list/deque_doubly.py:103:8 - Unknown word (Equeu)
./TheAlgorithms/Python/data_structures/linked_list/deque_doubly.py:2:14 - Unknown word (Deque)
./TheAlgorithms/Python/data_structures/linked_list/deque_doubly.py:87:8 - Unknown word (Eque)
./TheAlgorithms/Python/data_structures/linked_list/from_sequence.py:1:13 - Unknown word (Prorgam)
./TheAlgorithms/Python/data_structures/linked_list/skip_list.py:372:17 - Unknown word (doesnt)
./TheAlgorithms/Python/data_structures/queue/queue_on_list.py:14:8 - Unknown word (Enqueues)
Expand All @@ -391,7 +390,6 @@ Lines:
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:221:34 - Unknown word (CDVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:339:9 - Unknown word (ndvi)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:47:15 - Unknown word (ARVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:480:11 - Unknown word (Iself)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:48:15 - Unknown word (CCCI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:51:15 - Unknown word (NDVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:52:15 - Unknown word (BNDVI)
Expand All @@ -402,7 +400,6 @@ Lines:
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:57:15 - Unknown word (RBNDVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:58:15 - Unknown word (PNDVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:60:15 - Unknown word (BWDRVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:61:16 - Unknown word (Igreen)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:64:15 - Unknown word (CTVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:65:15 - Unknown word (GDVI)
./TheAlgorithms/Python/digital_image_processing/index_calculation.py:67:15 - Unknown word (GEMI)
Expand Down Expand Up @@ -440,7 +437,6 @@ Lines:
./TheAlgorithms/Python/dynamic_programming/max_sub_array.py:93:9 - Unknown word (ylabel)
./TheAlgorithms/Python/dynamic_programming/optimal_binary_search_tree.py:100:5 - Unknown word (freqs)
./TheAlgorithms/Python/dynamic_programming/optimal_binary_search_tree.py:102:51 - Unknown word (which's)
./TheAlgorithms/Python/dynamic_programming/optimal_binary_search_tree.py:14:3 - Unknown word (BSTs)
./TheAlgorithms/Python/dynamic_programming/optimal_binary_search_tree.py:72:22 - Unknown word (CLRS)
./TheAlgorithms/Python/dynamic_programming/subset_generation.py:43:31 - Unknown word (Ambuj)
./TheAlgorithms/Python/dynamic_programming/subset_generation.py:43:37 - Unknown word (sahu)
Expand Down Expand Up @@ -627,7 +623,6 @@ Lines:
./TheAlgorithms/Python/maths/kadanes.py:61:37 - Unknown word (sepatated)
./TheAlgorithms/Python/maths/krishnamurthy_number.py:26:5 - Unknown word (krishnamurthy)
./TheAlgorithms/Python/maths/krishnamurthy_number.py:45:51 - Unknown word (Krisnamurthy)
./TheAlgorithms/Python/maths/kth_lexicographic_permutation.py:3:11 - Unknown word (k'th)
./TheAlgorithms/Python/maths/largest_of_very_large_numbers.py:1:11 - Unknown word (Abhijeeth)
./TheAlgorithms/Python/maths/least_common_multiple.py:17:12 - Unknown word (mult)
./TheAlgorithms/Python/maths/line_length.py:46:24 - Unknown word (hypot)
Expand All @@ -642,7 +637,6 @@ Lines:
./TheAlgorithms/Python/maths/qr_decomposition.py:37:24 - Unknown word (triu)
./TheAlgorithms/Python/maths/quadratic_equations_complex_numbers.py:27:35 - Unknown word (imag)
./TheAlgorithms/Python/maths/quadratic_equations_complex_numbers.py:3:6 - Unknown word (cmath)
./TheAlgorithms/Python/maths/radix2_fft.py:113:20 - Unknown word (DFTs)
./TheAlgorithms/Python/maths/radix2_fft.py:5:8 - Unknown word (mpmath)
./TheAlgorithms/Python/maths/radix2_fft.py:91:14 - Unknown word (ncol)
./TheAlgorithms/Python/maths/relu.py:17:5 - Unknown word (relu)
Expand Down Expand Up @@ -683,7 +677,6 @@ Lines:
./TheAlgorithms/Python/other/dijkstra_bankers_algorithm.py:80:27 - Unknown word (alloc)
./TheAlgorithms/Python/other/doomsday.py:42:5 - Unknown word (centurian)
./TheAlgorithms/Python/other/euclidean_gcd.py:20:26 - Unknown word (euclicedan)
./TheAlgorithms/Python/other/fischer_yates_shuffle.py:11:6 - Unknown word (Yshuffle)
./TheAlgorithms/Python/other/frequency_finder.py:32:1 - Unknown word (ETAOIN)
./TheAlgorithms/Python/other/frequency_finder.py:32:11 - Unknown word (ETAOINSHRDLCUMWFGYPBVKJXQZ)
./TheAlgorithms/Python/other/gauss_easter.py:24:5 - Unknown word (metonic)
Expand Down
10 changes: 1 addition & 9 deletions integration-tests/snapshots/alexiosc/megistos/snapshot.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Repository: alexiosc/megistos
Url: "https://github.com/alexiosc/megistos.git"
Args: ["--config=../../../../config/repositories/alexiosc/megistos/cspell.json","**/*.{md,c,h,html}"]
Lines:
CSpell: Files checked: 789, Issues found: 31149 in 754 files
CSpell: Files checked: 789, Issues found: 31123 in 754 files
exit code: 1
./alexiosc/megistos/doc/examples/mod_dialog.c:18:13 - Unknown word (margv)
./alexiosc/megistos/doc/examples/mod_dialog.c:9:29 - Unknown word (FOOVT)
Expand Down Expand Up @@ -102,7 +102,6 @@ Lines:
./alexiosc/megistos/doc/html/bbsinclude_8h-source.html:43:51 - Unknown word (libtool's)
./alexiosc/megistos/doc/html/bbsinclude_8h-source.html:43:61 - Unknown word (libltdl)
./alexiosc/megistos/doc/html/bbsinclude_8h-source.html:50:64 - Unknown word (overwitten)
./alexiosc/megistos/doc/html/bbsinclude_8h-source.html:54:52 - Unknown word (d'oh)
./alexiosc/megistos/doc/html/bbsinclude_8h-source.html:64:47 - Unknown word (bbsconfig)
./alexiosc/megistos/doc/html/bbsmod_8h-source.html:38:159 - Unknown word (userdel)
./alexiosc/megistos/doc/html/bbsmod_8h-source.html:86:97 - Unknown word (isbot)
Expand Down Expand Up @@ -718,7 +717,6 @@ Lines:
./alexiosc/megistos/doc/html/group__dialog.html:142:144 - Unknown word (heartedly)
./alexiosc/megistos/doc/html/group__dialog.html:16:57 - Unknown word (minimise)
./alexiosc/megistos/doc/html/group__dialog.html:170:108 - Unknown word (internationalisation)
./alexiosc/megistos/doc/html/group__dialog.html:48:308 - Unknown word (Rred)
./alexiosc/megistos/doc/html/group__dialog.html:79:51 - Unknown word (submenu)
./alexiosc/megistos/doc/html/group__dialog.html:85:125 - Unknown word (Yessi)
./alexiosc/megistos/doc/html/group__dialog.html:94:253 - Unknown word (Borland's)
Expand Down Expand Up @@ -900,7 +898,6 @@ Lines:
./alexiosc/megistos/doc/html/structmessage__t.html:100:227 - Unknown word (achment)
./alexiosc/megistos/doc/html/structmessage__t.html:97:336 - Unknown word (tokenised)
./alexiosc/megistos/doc/html/structmonitor.html:18:83 - Unknown word (inpuit)
./alexiosc/megistos/doc/html/structonlinerec__t.html:121:124 - Unknown word (Rring)
./alexiosc/megistos/doc/html/structonlinerec__t.html:84:83 - Unknown word (telecons)
./alexiosc/megistos/doc/html/structsysvar.html:151:54 - Unknown word (centralised)
./alexiosc/megistos/doc/html/structsysvar.html:33:66 - Unknown word (dialup)
Expand Down Expand Up @@ -958,7 +955,6 @@ Lines:
./alexiosc/megistos/intl/localcharset.c:226:26 - Unknown word (HANYU)
./alexiosc/megistos/intl/localcharset.c:226:6 - Unknown word (DECHANYU)
./alexiosc/megistos/intl/localcharset.c:227:6 - Unknown word (DECHANZI)
./alexiosc/megistos/intl/localcharset.c:234:46 - Unknown word (DLL's)
./alexiosc/megistos/intl/localcharset.c:326:9 - Unknown word (cplen)
./alexiosc/megistos/intl/localealias.c:138:15 - Unknown word (nmap)
./alexiosc/megistos/intl/localealias.c:243:34 - Unknown word (BYCALLER)
Expand Down Expand Up @@ -1108,7 +1104,6 @@ Lines:
./alexiosc/megistos/libltdl/ltdl.c:1438:5 - Unknown word (bedl)
./alexiosc/megistos/libltdl/ltdl.c:1647:6 - Unknown word (lerno)
./alexiosc/megistos/libltdl/ltdl.c:1656:34 - Unknown word (nsmodule)
./alexiosc/megistos/libltdl/ltdl.c:1721:14 - Unknown word (Slookup)
./alexiosc/megistos/libltdl/ltdl.c:1789:30 - Unknown word (ofirc)
./alexiosc/megistos/libltdl/ltdl.c:1870:14 - Unknown word (nssym)
./alexiosc/megistos/libltdl/ltdl.c:1924:8 - Unknown word (DLPREOPEN)
Expand Down Expand Up @@ -3588,7 +3583,6 @@ Lines:
./alexiosc/megistos/src/modules/mailer/download.c:125:1 - Unknown word (mkfiles)
./alexiosc/megistos/src/modules/mailer/download.c:238:41 - Unknown word (chgdnl)
./alexiosc/megistos/src/modules/mailer/download.c:262:11 - Unknown word (stpncnf)
./alexiosc/megistos/src/modules/mailer/download.c:55:58 - Unknown word (OLRs)
./alexiosc/megistos/src/modules/mailer/download.c:65:26 - Unknown word (archiver)
./alexiosc/megistos/src/modules/mailer/mailer.c:101:9 - Unknown word (auddnl)
./alexiosc/megistos/src/modules/mailer/mailer.c:98:9 - Unknown word (defgrk)
Expand All @@ -3607,7 +3601,6 @@ Lines:
./alexiosc/megistos/src/modules/remsys/channels.c:300:12 - Unknown word (RBNO)
./alexiosc/megistos/src/modules/remsys/channels.c:303:16 - Unknown word (RNBO)
./alexiosc/megistos/src/modules/remsys/channels.c:68:10 - Unknown word (rsys)
./alexiosc/megistos/src/modules/remsys/channels.c:85:49 - Unknown word (SIGSEGVs)
./alexiosc/megistos/src/modules/remsys/channels.c:92:23 - Unknown word (blad)
./alexiosc/megistos/src/modules/remsys/classed/classed.c:92:9 - Unknown word (NDEL)
./alexiosc/megistos/src/modules/remsys/classed/classed.c:94:9 - Unknown word (AVONLY)
Expand Down Expand Up @@ -3878,7 +3871,6 @@ Lines:
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:333:1 - Unknown word (mkinjoth)
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:381:32 - Unknown word (EMUPTY)
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:415:22 - Unknown word (elligible)
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:46:64 - Unknown word (UIDs)
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:472:2 - Unknown word (setutent)
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:486:1 - Unknown word (notifybbsd)
./alexiosc/megistos/src/system/bbslogin/bbslogin.c:501:22 - Unknown word (userf)
Expand Down
Loading

0 comments on commit b4dc108

Please sign in to comment.