Bourne Shell: Counting Character Occurrence in a Unicode Text File
Sometimes it can be useful to get the character occurrence count of a text file. This post covers a unicode safe Bourne shell solution. “Unicode” was only tested with mixed Japanese and English text. The solution in this post can not be used to count whitespace.
Software Versions
$ date -u "+%Y-%m-%d %H:%M:%S +0000"
2016-02-23 10:40:41 +0000
$ uname -vm
FreeBSD 11.0-CURRENT #0 r287598: Thu Sep 10 14:45:48 JST 2015 root@:/usr/obj/usr/src/sys/MIRAGE_KERNEL amd64
Instructions
The following script can be used to count character occurrence in a text file. The characters to count are defined in the CHARSET variable at the top of the file. For each character, the script scans the file and prints the result. Characters are counted for each filename passed to the script. Usage is printed if no file names are passed. Whitespace will not make it into the for loop.
character_count.sh
#!/bin/sh
CHARSET='abcmwxyz~!&#jkdefghst@=+{}[]あいうえお、一二三四五。'
echo "CHARSET=${CHARSET}"
CHARSET=$(echo -n $CHARSET | sed "s/./& /g")
if [ "${#}" -lt 1 ]
then
echo "Usage:"
echo " ${0} FILE [FILE...]"
fi
for FILENAME
do
echo "---${FILENAME}---"
for CHAR in $CHARSET
do
COUNT=$(fgrep -o "${CHAR}" "${FILENAME}" | wc -l | tr -d '[[:space:]]')
echo "${CHAR} : ${COUNT}"
done
done
The script itself can be used as a test file.
chmod +x character_count.sh
./character_count.sh character_count.sh
A version that uses getopt to parse command line options follows. -c can be used to specify the CHARSET from the command line. -i can be used for a case insensitive match. -h can be used to display usage. man getopt or see a Bourne Shell tutorial for more information.
character_count.sh
#!/bin/sh
CHARSET='abcmwxyz~!&#jkdefghst@=+{}[]あいうえお、一二三四五。'
usage() {
echo "Usage:"
echo " ${0} -h"
echo " ${0} [-i] [-c CHARSET] FILE [FILE...]"
echo "Description:"
echo " -h : display this help"
echo " -i : case insensitive match"
echo " -c : characters to count in file"
exit $1
}
args=$(getopt hic: ${*})
if [ $? -ne 0 ]
then
usage 2
fi
set -- $args
unset CASE_INSENSITIVE BAD_FILENAME
while :; do
case "$1" in
-h)
usage 1
;;
-i)
CASE_INSENSITIVE=true
shift;
;;
-c)
CHARSET="${2}"
shift; shift
;;
--)
shift; break
;;
esac
done
if [ "${#}" -lt 1 ]
then
usage 2
fi
for FILENAME
do
if [ ! -r "${FILENAME}" ]
then
echo "The file '${FILENAME}' does not exist or is not readable."
BAD_FILENAME=true
fi
done
[ $BAD_FILENAME ] && exit 1
echo "CHARSET=${CHARSET}"
CHARSET=$(echo -n $CHARSET | sed "s/./& /g")
GREP_FLAGS="-o"
[ $CASE_INSENSITIVE ] && GREP_FLAGS="-i ${GREP_FLAGS}"
for FILENAME
do
echo "---${FILENAME}---"
for CHAR in $CHARSET
do
COUNT=$(fgrep ${GREP_FLAGS} "${CHAR}" "${FILENAME}" | wc -l | tr -d '[[:space:]]')
echo "${CHAR} : ${COUNT}"
done
done
A few usage examples follow.
chmod +x character_count.sh
./character_count.sh -h
./character_count.sh character_count.sh
./character_count.sh -i -- character_count.sh
./character_count.sh -i -c 'aeiouy[]{}()' character_count.sh
References:
- FreeBSD, man getopt
- UNIX, count occurences of specific character in the file
- UNIX, Count occurrences of a char in plain text file
- UNIX, How to perform a for loop on each character in a string in BASH?
- UNIX, Replace comma with newline in sed
- UNIX, How do I insert a space every four characters in a long line?
- UNIX, Bourne Shell Exit Status Examples
- UNIX, Bourne Shell Scripting/Control flow
- UNIX, Sh - the Bourne Shell