Each year I do an analysis of the word count of the current president's State of the Union addresss. The code to do this (once you have downloaded the speech as a text file) has gotten much simpler over the years, especially after the introduction of LINQ.
Here is an example of the key part of the code in C# as an extension method:
public static Dictionary<string, int> GetWordFrequency(this string input){
return input.Split(new char[] { ' ' })
.Where(i => i.Trim() != String.Empty && Regex.IsMatch(i,@"\w"))
.Select(i => Regex.Replace(i,@"[^A-Za-z0-9]+$","").ToLower())
.Where(x => !stopwords.Contains(x))
.GroupBy(w => w)
.OrderByDescending(group => group.Count())
.ToDictionary(group => group.Key, group => group.Count());
}
Here is the sorted list of words with their frequency (Down to a count of 5):
american,29
people,23
americans,20
tonight,20
america,14
congress,13
tax,13
country,13
home,11
am,10
administration,9
america's,9
family,9
world,8
immigration,8
united,7
building,7
safe,7
finally,7
workers,7
nation,7
veterans,7
citizens,7
heroes,6
love,6
strong,6
proud,6
jobs,6
protect,6
communities,6
nuclear,6
isis,6
north,6
passed,5
help,5
police,5
including,5
stands,5
bill,5
reform,5
drugs,5
drug,5
dangerous,5
terrorists,5
You can download the complete source code below. The speech.text file is in the /bin/debug folder.
SOTU.zip (215.24 kb)