Spellcheck with aspell

Alexander Keil

2014-05-10

Spellcheck with aspell

author: akeil
date: 2014-05-10
version: 1

aspell 1 is an open-source spell checker which can be used from the command line. Aspell can be used when working with text-based files such as ReStructured Text or markdown (or plain text, of course).

aspell has filters to spellcheck E-Mails or HTML documents.

There are also Python bindings for aspell 2.

Invoke aspell like this to open an interactive terminal:

$ aspell check document.txt

By default, aspell uses its built-in dictionary and an additional user dictionary located at ~/.aspell.en.pws (for an English localization). The user dictionary holds any words that were selected as "add to dictionary" during an aspell session and it can be edited manually.

If the file to be checked contains specific terms which should not be treated as misspellings but should also not be included in the global or user based dictionary, one can define an additional dictionary on the command line. This is useful if the document contains for example technical or domain-specific terms, abbreviations, etc.

To check using an additional dictionary:

$ aspell --add-extra-dicts=project-dict.pws check document.txt

This can be useful if the file to be checked contains specific terms that should not be treated as misspellings but should also not be included in the global dictionary (like technical or domain-specific terms, abbreviations, ...).

If the file is modified during the aspell session, a backup with the original content is created. To run without backing up the uncorrected original:

$ aspell -x check document.txt

To list misspelled words from a file, use list.

$ aspell list < document.txt

Aspell does not accept shell-wildcards in arguments, so it is not possible to aspell check directory/*.txt. To spellcheck multiple files, use find (from paulbradley 3):

$ find *.txt -exec aspell check {} \;

Interesting Command Line Options

-c FILE, check FILE: specify a file to be checked
list: Generate a list of misspelled words. will not correct anything.
clean: Clean up a wordlist and output a new wordlist where every line is a valid word.

--add-extra-dicts FILE: Use FILE as an additional wordlist. The absolute path must be specified and the path may not contain shell variables.
-b, --backup: Create a backup-file (.bak) with the uncorrected file.
-x, --dont-backup: Do not create a backup (.bak) file.
-l LANGUAGE, --lang LANGUAGE: Specify the language to use with the two-letter language code (e.g. "en") optionally followed by the country code (e.g. "en-US").

Install

Install with:

# apt-get install aspell

on a Debian based system. Or

# pacman -S aspell

on ArchLinux.

Configure

Relevant configuration files for aspell are:

/etc/aspell.conf: The System-wide configuration.
~/.aspell.conf: The per-user configuration file.
~/.aspell.en.pws: The user-specific wordlist (for the en locale).

Use additional wordlists by specifying a path in the config file:

# ~/.aspell.conf
# -------------------------------------------------------
add-extra-dicts /home/USERNAME/.aspell-tech-terms.en.pws
add-extra-dicts /home/USERNAME/.aspell-names.en.pws

Use multiple lines of add-extra-dicts to add several wordlists. The wordlists will be used in every invocation of aspell.

Note

The paths must be absolute. Shell variables/expansions like ~, /path/* or $HOME will not work.

Wordlists

aspell wordlists are simple text files with one word per line.

The first line must be:

personal_ws-1.1 en 10

Where en is your locale and 10 would be the number of words in the wordlist. The number of words does not have to be exact but should hint the actual number of words in the file.

The conventional file extension is .pws but any filename can be used.

Scripts

A Spellcheck Command

A small shell script to invoke aspell and use an additional per-file wordlist and/or a project specific wordlist.

The script will look for a wordlist with the same name as the file to be checked and it will look for a wordlist in the current directory. This allows to use a list of allowed words without adding them to any of the globally used wordlists.

Usage:

$ spellcheck document.txt

Code

spellcheck.sh (Source)

#!/bin/bash

set -o nounset
set -o errexit


# Commands -----------------------------------------------
ASPELL=/usr/bin/aspell
BASENAME=/usr/bin/basename
DIRNAME=/usr/bin/dirname
PWD=/bin/pwd


# Script -------------------------------------------------
extra=""
switch="--add-extra-dicts"

# file specific dictionary
# if there is a file with the same name
# but ending in `.pws`
# use it as an extra dict
directory=`$DIRNAME "$1"`
filename=`$BASENAME "$1"`
name="${filename%.*}"
dictpath=$directory/$name.pws
if [ -f "$dictpath" ]
then
    extra="--add-extra-dicts=`$PWD`/${dictpath}"
fi

# extra dicts from the current directory
# all files in the working directory
# ending in `.pws` are used as extra dicts

# return nothing if *.pws does not match anything
shopt -s nullglob

for filename in *.pws;
do
    extra="${extra} --add-extra-dicts=`$PWD`/${filename}"
done

$ASPELL $extra check "$1"

Create a Dictionary from Misspelled Words

Say you have a file which contains multiple domain-specific terms. You want to spellcheck it regularly and you want to exclude these terms from the spellcheck for this file only.

Keeping it separate from your general dictionary is especially useful if these "special terms" are similar to typical typos.

To easily generate this dictionary, produce a list of words that aspell considers misspelled, and edit that list so that it contains the "special terms" only.

mk-aspell-dict.sh (Source)

#!/bin/bash
# generates an *aspell* "dictionary"
# from a list of misspelled words found by aspell.

set -o errexit


# Configuration ----------------------------------------------------
HEADER=personal_ws-1.1
LANG=en


# Commands ---------------------------------------------------------
ASPELL=/usr/bin/aspell
CAT=/bin/cat
ECHO=/bin/echo
READLINK=/bin/readlink
RM=/bin/rm
SORT=/usr/bin/sort
UNIQ=/usr/bin/uniq
WC=/usr/bin/wc


# Script -----------------------------------------------------------
if [ -z "$1" ];
then
    $ECHO Error: no source specified.
    exit 1
fi

# normalize paths
# aspell will not resolve shell variables or perform expansion
src=`readlink -fn "$1"`

if [ ! -r "$src" ];
then
    $ECHO Error: "$src"  does not exist or is not readable
    exit 1
fi

if [ -z "$2" ];
then
    dest="$1.pws"
    $ECHO Destination not specified, using \'$dest\'
else
    # use `readlink` to get absolute path.
    dest=`$READLINK -fn "$2"`
fi

templist=/tmp/wordlist

# generate a wordlist from words considered misspelled by aspell
# remove duplicates with `sort | uniq`
$ASPELL --rem-extra-dicts "$dest" list < "$src"\
 | $ASPELL clean\
 | $SORT\
 | $UNIQ > "$templist"

# generate the header for the pws file with the number of words
# wc -l will output:
# [numlines] [filenames]
# (separated by space)
wordcount=`$WC -l "$templist"`
$ECHO "$HEADER $LANG ${wordcount% *}" > "$dest"
$CAT "$templist" >> "$dest"
$RM "$templist"
$ECHO Generated dictionary at \'$dest\'.

The command (aspell list < document.txt) generates a list of words that would be considered misspelled by aspell. The command aspell clean takes that list from and cleans it up so that every line is a valid entry for an aspell wordlist (this might not be necessary as aspell list does probably not return invalid words).

sort and uniq are used to remove duplicate entries and sort alphabetically. The result is written to a temporary file.

The last part of the script generates the header-line for the wordlist, using wc -l to count the number of words (lines, actually). The header is written to the dictionary file and then the list of words is cat into the dictionary file.

To just update the header of a wordlist file:

$ echo personal_ws-1.1 en `tail -n +2 wordlist.pws | wc -l`\n > /tmp/wordlist.pws
$ tail -n +2 wordlist.pws >> /tmp/wordlist.pws
$ mv /tmp/wordlist.pws wordlist.pws

One could additionally add a sort | uniq pipe to the tail-calls. This would remove duplicates.

Create a Wordlist from Radicale Address Book

Add the names of your contacts to the list of correctly spelled words.

This assumes that the address book is available over the network as CardDAV - for example via Radicale.

This script fetches an address book from a CardDAV server and filters out the names and nicknames of all contacts.

It writes them to a wordlist named ~/.aspell-names.en.pws which should be included via --add-extra-dict in in ~/.aspell.conf.

mk-aspell-names.sh (Source)

#!/bin/bash
set -o errexit
set -o nounset

user="username"
pass="password"
host="hostname"
port=5232
url="https://$host:$port/$user/contacts.vcf"


# check if the addressbook is available
curl --silent --insecure\
 --basic\
 --user "$user:$pass"\
 --request HEAD\
 --header "Accept: text/vcard"\
 --write-out "%{http_code}"\
 "$url"\
 | grep --silent -E "2[0-9]{2}"

if [ $? -gt 0 ];
then
    echo No access to adressbook at $url.
    exit 1
fi


# Write the complete addressbook to a temp-file
curl --silent --insecure\
 --basic --user "$user:$pass"\
 --request GET\
 --header "Accept: text/vcard"\
 "$url"\
> /tmp/vcards.vcf


# Reduce to the fields we are interested in:
#   N:Name;Part;More;parts;;
#   FN:Full Name
#   NICKNAME: Nicky
sed -r\
 -e "/^(FN|N|NICKNAME):.+$/!d"\
 -e "s/(N|FN|NICKNAME)://"\
 -e "s/;+/\n/g"\
 /tmp/vcards.vcf\
 > /tmp/names

# will filter the list and include only
# names that would be regarded as spelling errors
~/scripts/mk-aspell-dict.sh /tmp/names /home/USERNAME/.aspell-names.en.pws

rm /tmp/vcards.vcf
rm /tmp/names

1: http://aspell.net/
2: https://github.com/WojciechMula/aspell-python
3: https://paulbradley.org/aspell/