Two ways of using NLTK for word frequency analysis

Two ways of using NLTK for word frequency analysis

In this post you can easily run the following code to either parse words from a file in Python or crawl a website to grab and tokenize all of the key words.

Here is where we got the play text: http://shakespeare.mit.edu/julius_caesar/full.html

I was using the Pycharm IDE and needed to download the following libraries by entering the following code into a terminal.

pip install nltk
pip install beautifulsoup4

import
nltk
# nltk.download()
import urllib.request

#response = urllib.request.urlopen('https://www.etrade.com')
#html = response.read()

#from bs4 import BeautifulSoup

#soup = BeautifulSoup(html, "lxml")
#text = soup.get_text(strip=True)
# print(text)

f = open("file.txt", "r")
text = f.read()

tokens = [t for t in text.split()]
# print(tokens)

from nltk.corpus import stopwords

sr = stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
for key, val in freq.items():
print(str(key) + ':' + str(val))
freq.plot(20, cumulative=False)


Leave a comment

Please note, comments must be approved before they are published