Bourne Shell: Counting Character Occurrence in a Unicode Text File
Sometimes it can be useful to get the character occurrence count of a text file. This post covers a unicode safe Bourne shell solution. “Unicode” was only tested with mixed Japanese and English text. The solution in this post can not be used to count whitespace.
Software Versions
Instructions
The following script can be used to count character occurrence in a text file. The characters to count are defined in the CHARSET variable at the top of the file. For each character, the script scans the file and prints the result. Characters are counted for each filename passed to the script. Usage is printed if no file names are passed. Whitespace will not make it into the for loop.
character_count.sh
The script itself can be used as a test file.
A version that uses getopt to parse command line options follows. -c can be used to specify the CHARSET from the command line. -i can be used for a case insensitive match. -h can be used to display usage. man getopt or see a Bourne Shell tutorial for more information.
character_count.sh
A few usage examples follow.
References:
- FreeBSD, man getopt
- UNIX, count occurences of specific character in the file
- UNIX, Count occurrences of a char in plain text file
- UNIX, How to perform a for loop on each character in a string in BASH?
- UNIX, Replace comma with newline in sed
- UNIX, How do I insert a space every four characters in a long line?
- UNIX, Bourne Shell Exit Status Examples
- UNIX, Bourne Shell Scripting/Control flow
- UNIX, Sh - the Bourne Shell