Sheharyar Naseer

Calculate word frequency of files in Bash


I was reading Ryan Tomayko’s blog post AWK-ward Ruby explaining how the Unix AWK Tool is among the ancestors of Ruby and Perl. He wrote a few examples showing some of AWK’s advanced features, one of them which listed the word frequencies of any file provided. I found this example quite useful and extracted it as a function in to my Dotfiles.

#!/bin/bash

function word_frequency() {
	awk '
	    BEGIN { FS="[^a-zA-Z]+" }

	    {
	        for (i=1; i<=NF; i++) {
	            word = tolower($i)
	            words[word]++
	        }
	    }

	    END {
	        for (w in words)
	             printf("%3d %s\n", words[w], w)
	    }
	' |
	sort -rn
}

Now you can pipe the output of any file to this function and it will list all words and their frequencies in Descending order.

# Examples:

cat my_text_file.txt | word_frequency
# Pipe the contents of a text file to the function using `cat`

curl -s https://github.com/humans.txt | word_frequency
# Get word frequency of a file on the internet

Looking forward to using AWK more and more.