Sometimes it can be useful to get the character occurrence count of a text file. This post covers a unicode safe Bourne shell solution. “Unicode” was only tested with mixed Japanese and English text. The solution in this post can not be used to count whitespace.

Software Versions

$ date -u "+%Y-%m-%d %H:%M:%S +0000"
2016-02-23 10:40:41 +0000
$ uname -vm
FreeBSD 11.0-CURRENT #0 r287598: Thu Sep 10 14:45:48 JST 2015 root@:/usr/obj/usr/src/sys/MIRAGE_KERNEL amd64

Instructions

The following script can be used to count character occurrence in a text file. The characters to count are defined in the CHARSET variable at the top of the file. For each character, the script scans the file and prints the result. Characters are counted for each filename passed to the script. Usage is printed if no file names are passed. Whitespace will not make it into the for loop.

character_count.sh

#!/bin/sh

CHARSET='abcmwxyz~!&#jkdefghst@=+{}[]あいうえお、一二三四五。'

echo "CHARSET=${CHARSET}"
CHARSET=$(echo -n $CHARSET | sed "s/./& /g")

if [ "${#}" -lt 1 ]
then
echo "Usage:"
echo " ${0} FILE [FILE...]"
fi

for
FILENAME
do
echo "---${FILENAME}---"
for CHAR in $CHARSET
do
COUNT=$(fgrep -o "${CHAR}" "${FILENAME}" | wc -l | tr -d '[[:space:]]')
echo "${CHAR} : ${COUNT}"
done
done

The script itself can be used as a test file.

chmod +x character_count.sh
./character_count.sh character_count.sh

A version that uses getopt to parse command line options follows. -c can be used to specify the CHARSET from the command line. -i can be used for a case insensitive match. -h can be used to display usage. man getopt or see a Bourne Shell tutorial for more information.

character_count.sh

#!/bin/sh

CHARSET='abcmwxyz~!&#jkdefghst@=+{}[]あいうえお、一二三四五。'

usage() {
echo "Usage:"
echo " ${0} -h"
echo " ${0} [-i] [-c CHARSET] FILE [FILE...]"
echo "Description:"
echo " -h : display this help"
echo " -i : case insensitive match"
echo " -c : characters to count in file"
exit $1
}

args=$(getopt hic: ${*})
if [ $? -ne 0 ]
then
usage 2
fi
set -- $args

unset CASE_INSENSITIVE BAD_FILENAME
while :; do
case
"$1" in
-h)
usage 1
;;
-i)
CASE_INSENSITIVE=true
shift
;
;;
-c)
CHARSET="${2}"
shift; shift
;;
--)
shift; break
;;
esac
done

if
[ "${#}" -lt 1 ]
then
usage 2
fi

for
FILENAME
do
if
[ ! -r "${FILENAME}" ]
then
echo "The file '${FILENAME}' does not exist or is not readable."
BAD_FILENAME=true
fi
done

[ $BAD_FILENAME ] && exit 1

echo "CHARSET=${CHARSET}"
CHARSET=$(echo -n $CHARSET | sed "s/./& /g")

GREP_FLAGS="-o"
[ $CASE_INSENSITIVE ] && GREP_FLAGS="-i ${GREP_FLAGS}"

for FILENAME
do
echo "---${FILENAME}---"
for CHAR in $CHARSET
do
COUNT=$(fgrep ${GREP_FLAGS} "${CHAR}" "${FILENAME}" | wc -l | tr -d '[[:space:]]')
echo "${CHAR} : ${COUNT}"
done
done

A few usage examples follow.

chmod +x character_count.sh
./character_count.sh -h
./character_count.sh character_count.sh
./character_count.sh -i -- character_count.sh
./character_count.sh -i -c 'aeiouy[]{}()' character_count.sh

References: