Spellcheck with aspell
Author: | akeil |
---|---|
Date: | 2014-05-10 |
Version: | 1 |
aspell [1] is an open-source spell checker which can be used from the command line. Aspell can be used when working with text-based files such as ReStructured Text or markdown (or plain text, of course).
aspell has filters to spellcheck E-Mails or HTML documents.
There are also Python bindings for aspell [2].
Invoke aspell like this to open an interactive terminal:
$ aspell check document.txt
By default, aspell uses its built-in dictionary and an additional user dictionary located at ~/.aspell.en.pws (for an English localization). The user dictionary holds any words that were selected as "add to dictionary" during an aspell session and it can be edited manually.
If the file to be checked contains specific terms which should not be treated as misspellings but should also not be included in the global or user based dictionary, one can define an additional dictionary on the command line. This is useful if the document contains for example technical or domain-specific terms, abbreviations, etc.
To check using an additional dictionary:
$ aspell --add-extra-dicts=project-dict.pws check document.txt
This can be useful if the file to be checked contains specific terms that should not be treated as misspellings but should also not be included in the global dictionary (like technical or domain-specific terms, abbreviations, ...).
If the file is modified during the aspell session, a backup with the original content is created. To run without backing up the uncorrected original:
$ aspell -x check document.txt
To list misspelled words from a file, use list.
$ aspell list < document.txt
Aspell does not accept shell-wildcards in arguments, so it is not possible to aspell check directory/*.txt. To spellcheck multiple files, use find (from paulbradley [3]):
$ find *.txt -exec aspell check {} \;
Interesting Command Line Options
- -c FILE, check FILE
- specify a file to be checked
- list
- Generate a list of misspelled words. will not correct anything.
- clean
- Clean up a wordlist and output a new wordlist where every line is a valid word.
--add-extra-dicts FILE | |
Use FILE as an additional wordlist. The absolute path must be specified and the path may not contain shell variables. | |
-b, --backup | Create a backup-file (.bak) with the uncorrected file. |
-x, --dont-backup | |
Do not create a backup (.bak) file. | |
-l LANGUAGE, --lang LANGUAGE | |
Specify the language to use with the two-letter language code (e.g. "en") optionally followed by the country code (e.g. "en-US"). |
Install
Install with:
# apt-get install aspell
on a Debian based system. Or
# pacman -S aspell
on ArchLinux.
Configure
Relevant configuration files for aspell are:
- /etc/aspell.conf
- The System-wide configuration.
- ~/.aspell.conf
- The per-user configuration file.
- ~/.aspell.en.pws
- The user-specific wordlist (for the en locale).
Use additional wordlists by specifying a path in the config file:
# ~/.aspell.conf # ------------------------------------------------------- add-extra-dicts /home/USERNAME/.aspell-tech-terms.en.pws add-extra-dicts /home/USERNAME/.aspell-names.en.pws
Use multiple lines of add-extra-dicts to add several wordlists. The wordlists will be used in every invocation of aspell.
Note
The paths must be absolute. Shell variables/expansions like ~, /path/* or $HOME will not work.
Wordlists
aspell wordlists are simple text files with one word per line.
The first line must be:
personal_ws-1.1 en 10
Where en is your locale and 10 would be the number of words in the wordlist. The number of words does not have to be exact but should hint the actual number of words in the file.
The conventional file extension is .pws but any filename can be used.
Scripts
A Spellcheck Command
A small shell script to invoke aspell and use an additional per-file wordlist and/or a project specific wordlist.
The script will look for a wordlist with the same name as the file to be checked and it will look for a wordlist in the current directory. This allows to use a list of allowed words without adding them to any of the globally used wordlists.
Usage:
$ spellcheck document.txt
Code
#!/bin/bash set -o nounset set -o errexit # Commands ----------------------------------------------- ASPELL=/usr/bin/aspell BASENAME=/usr/bin/basename DIRNAME=/usr/bin/dirname PWD=/bin/pwd # Script ------------------------------------------------- extra="" switch="--add-extra-dicts" # file specific dictionary # if there is a file with the same name # but ending in `.pws` # use it as an extra dict directory=`$DIRNAME "$1"` filename=`$BASENAME "$1"` name="${filename%.*}" dictpath=$directory/$name.pws if [ -f "$dictpath" ] then extra="--add-extra-dicts=`$PWD`/${dictpath}" fi # extra dicts from the current directory # all files in the working directory # ending in `.pws` are used as extra dicts # return nothing if *.pws does not match anything shopt -s nullglob for filename in *.pws; do extra="${extra} --add-extra-dicts=`$PWD`/${filename}" done $ASPELL $extra check "$1"
Create a Dictionary from Misspelled Words
Say you have a file which contains multiple domain-specific terms. You want to spellcheck it regularly and you want to exclude these terms from the spellcheck for this file only.
Keeping it separate from your general dictionary is especially useful if these "special terms" are similar to typical typos.
To easily generate this dictionary, produce a list of words that aspell considers misspelled, and edit that list so that it contains the "special terms" only.
#!/bin/bash # generates an *aspell* "dictionary" # from a list of misspelled words found by aspell. set -o errexit # Configuration ---------------------------------------------------- HEADER=personal_ws-1.1 LANG=en # Commands --------------------------------------------------------- ASPELL=/usr/bin/aspell CAT=/bin/cat ECHO=/bin/echo READLINK=/bin/readlink RM=/bin/rm SORT=/usr/bin/sort UNIQ=/usr/bin/uniq WC=/usr/bin/wc # Script ----------------------------------------------------------- if [ -z "$1" ]; then $ECHO Error: no source specified. exit 1 fi # normalize paths # aspell will not resolve shell variables or perform expansion src=`readlink -fn "$1"` if [ ! -r "$src" ]; then $ECHO Error: "$src" does not exist or is not readable exit 1 fi if [ -z "$2" ]; then dest="$1.pws" $ECHO Destination not specified, using \'$dest\' else # use `readlink` to get absolute path. dest=`$READLINK -fn "$2"` fi templist=/tmp/wordlist # generate a wordlist from words considered misspelled by aspell # remove duplicates with `sort | uniq` $ASPELL --rem-extra-dicts "$dest" list < "$src"\ | $ASPELL clean\ | $SORT\ | $UNIQ > "$templist" # generate the header for the pws file with the number of words # wc -l will output: # [numlines] [filenames] # (separated by space) wordcount=`$WC -l "$templist"` $ECHO "$HEADER $LANG ${wordcount% *}" > "$dest" $CAT "$templist" >> "$dest" $RM "$templist" $ECHO Generated dictionary at \'$dest\'.
The command (aspell list < document.txt) generates a list of words that would be considered misspelled by aspell. The command aspell clean takes that list from and cleans it up so that every line is a valid entry for an aspell wordlist (this might not be necessary as aspell list does probably not return invalid words).
sort and uniq are used to remove duplicate entries and sort alphabetically. The result is written to a temporary file.
The last part of the script generates the header-line for the wordlist, using wc -l to count the number of words (lines, actually). The header is written to the dictionary file and then the list of words is cat into the dictionary file.
To just update the header of a wordlist file:
$ echo personal_ws-1.1 en `tail -n +2 wordlist.pws | wc -l`\n > /tmp/wordlist.pws $ tail -n +2 wordlist.pws >> /tmp/wordlist.pws $ mv /tmp/wordlist.pws wordlist.pws
One could additionally add a sort | uniq pipe to the tail-calls. This would remove duplicates.
Create a Wordlist from Radicale Address Book
Add the names of your contacts to the list of correctly spelled words.
This assumes that the address book is available over the network as CardDAV - for example via Radicale.
This script fetches an address book from a CardDAV server and filters out the names and nicknames of all contacts.
It writes them to a wordlist named ~/.aspell-names.en.pws which should be included via --add-extra-dict in in ~/.aspell.conf.
#!/bin/bash set -o errexit set -o nounset user="username" pass="password" host="hostname" port=5232 url="https://$host:$port/$user/contacts.vcf" # check if the addressbook is available curl --silent --insecure\ --basic\ --user "$user:$pass"\ --request HEAD\ --header "Accept: text/vcard"\ --write-out "%{http_code}"\ "$url"\ | grep --silent -E "2[0-9]{2}" if [ $? -gt 0 ]; then echo No access to adressbook at $url. exit 1 fi # Write the complete addressbook to a temp-file curl --silent --insecure\ --basic --user "$user:$pass"\ --request GET\ --header "Accept: text/vcard"\ "$url"\ > /tmp/vcards.vcf # Reduce to the fields we are interested in: # N:Name;Part;More;parts;; # FN:Full Name # NICKNAME: Nicky sed -r\ -e "/^(FN|N|NICKNAME):.+$/!d"\ -e "s/(N|FN|NICKNAME)://"\ -e "s/;+/\n/g"\ /tmp/vcards.vcf\ > /tmp/names # will filter the list and include only # names that would be regarded as spelling errors ~/scripts/mk-aspell-dict.sh /tmp/names /home/USERNAME/.aspell-names.en.pws rm /tmp/vcards.vcf rm /tmp/names
[1] | http://aspell.net/ |
[2] | https://github.com/WojciechMula/aspell-python |
[3] | https://paulbradley.org/aspell/ |