Bourne Shell: Counting Character Occurrence in a Unicode Text File

Sometimes it can be useful to get the character occurrence count of a text file. This post covers a unicode safe Bourne shell solution. “Unicode” was only tested with mixed Japanese and English text. The solution in this post can not be used to count whitespace.

Software Versions

$ date -u "+%Y-%m-%d %H:%M:%S +0000"
2016-02-23 10:40:41 +0000
$ uname -vm
FreeBSD 11.0-CURRENT #0 r287598: Thu Sep 10 14:45:48 JST 2015     root@:/usr/obj/usr/src/sys/MIRAGE_KERNEL  amd64

Instructions

The following script can be used to count character occurrence in a text file. The characters to count are defined in the CHARSET variable at the top of the file. For each character, the script scans the file and prints the result. Characters are counted for each filename passed to the script. Usage is printed if no file names are passed. Whitespace will not make it into the for loop.

character_count.sh

#!/bin/sh

CHARSET='abcmwxyz~!&#jkdefghst@=+{}[]あいうえお、一二三四五。'

echo "CHARSET=${CHARSET}"
CHARSET=$(echo -n $CHARSET | sed "s/./& /g")

if [ "${#}" -lt 1 ]
then
  echo "Usage:"
  echo "  ${0} FILE [FILE...]"
fi

for FILENAME
do
  echo "---${FILENAME}---"
  for CHAR in $CHARSET
  do
    COUNT=$(fgrep -o "${CHAR}" "${FILENAME}" | wc -l | tr -d '[[:space:]]')
    echo "${CHAR} : ${COUNT}"
  done
done

The script itself can be used as a test file.

chmod +x character_count.sh
./character_count.sh character_count.sh

A version that uses getopt to parse command line options follows. -c can be used to specify the CHARSET from the command line. -i can be used for a case insensitive match. -h can be used to display usage. man getopt or see a Bourne Shell tutorial for more information.

character_count.sh

#!/bin/sh

CHARSET='abcmwxyz~!&#jkdefghst@=+{}[]あいうえお、一二三四五。'

usage() {
  echo "Usage:"
  echo "  ${0} -h"
  echo "  ${0} [-i] [-c CHARSET] FILE [FILE...]"
  echo "Description:"
  echo "  -h : display this help"
  echo "  -i : case insensitive match"
  echo "  -c : characters to count in file"
  exit $1
}

args=$(getopt hic: ${*})
if [ $? -ne 0 ]
then
  usage 2
fi
set -- $args

unset CASE_INSENSITIVE BAD_FILENAME
while :; do
  case "$1" in
  -h)
    usage 1
    ;;
  -i)
    CASE_INSENSITIVE=true
    shift;
    ;;
  -c)
    CHARSET="${2}"
    shift; shift
    ;;
  --)
    shift; break
    ;;
  esac
done

if [ "${#}" -lt 1 ]
then
  usage 2
fi

for FILENAME
do
  if [ ! -r "${FILENAME}" ]
  then
    echo "The file '${FILENAME}' does not exist or is not readable."
    BAD_FILENAME=true
  fi
done
[ $BAD_FILENAME ] && exit 1

echo "CHARSET=${CHARSET}"
CHARSET=$(echo -n $CHARSET | sed "s/./& /g")

GREP_FLAGS="-o"
[ $CASE_INSENSITIVE ] && GREP_FLAGS="-i ${GREP_FLAGS}"

for FILENAME
do
  echo "---${FILENAME}---"
  for CHAR in $CHARSET
  do
    COUNT=$(fgrep ${GREP_FLAGS} "${CHAR}" "${FILENAME}" | wc -l | tr -d '[[:space:]]')
    echo "${CHAR} : ${COUNT}"
  done
done

A few usage examples follow.

chmod +x character_count.sh
./character_count.sh -h
./character_count.sh character_count.sh
./character_count.sh -i -- character_count.sh
./character_count.sh -i -c 'aeiouy[]{}()' character_count.sh

Software Versions

Instructions

References: