Tokenizing a file - python

In my CS class I have been given a task to read in the entire corpus of Shakespeare's Plays and Sonnets and print out the number of times a particular word occurs. Can anyone help me get my feet off the ground with this. Here is the first level of the stepwise refinement I was given.
Level 0
Define a function that tokenizes a file, returning an array of tokens. Loop through the array, printing each token one per line. For example, your specialized main might look something like this:
def main():
tokens = readTokens("shakespeare.txt")
for i in range(0,len(tokens),1):
print(tokens[i])
I guess my real question is how do I tokenize a file and then read it into an array into python? Sorry if this kind of question is not what this website is for, Im just looking for some help. Thanks.

goodletters = set("abcdefghijklmnopqrstuvwxyz' \t")
def tokenize_file(fname):
tokens = []
with open(fname) as inf:
for line in inf:
clean = ''.join(ch for ch in line.lower() if ch in goodletters)
tokens.extend(clean.split())
return tokens
Written this way for clarity; in production, I would use inf.read().translate(), but the setup for that is significantly different for Python 2.x vs 3.x and I'd prefer not to be more confusing than necessary.

from collections import Counter
def readTokens(file):
tokens = Counter()
with open(file) as f:
for line in f:
tokens += Counter(word.strip() for word in line.split())
# if you're trying to count "Won't", "won't", and "won't!"
# all together, do this instead:
## tokens += Counter(word.strip('"!?,.;:').casefold() for word in line.split())
return tokens
def main():
tokens = readTokens('shakespeare.txt')
for token in tokens:
print(token)
print("The most commonly used word is {}".format(max(tokens.items(), key=
lambda x: x[1])))

Related

How to check how many times a word appears in a text file

I've seen a few people ask how this would be done, but their questions were 'too broad' so I decided to find out how to do it. I've posted below how.
So to do this, first you must open the file (Assuming you have a file of text called 'text.txt') We do this by calling the open function.
file = open('text.txt', 'r')
The open function uses the syntax: open(file, mode)
The file being the text document, and the mode being how it's opened. ('r' means read only) The read function just reads the file, then split separates each of the words into a list object. Lastly, we use the count function to find how many times the word appears.
word = input('word: ')
print(file.read().split().count(word))
And there you have it, counting words in a text file!
Word counts can be tricky. At a minimum, one would like to avoid differences in capitalization and punctuation. A simple way to take the next step in word counts is to use regular expressions and to convert its resulting words to lower case before we do the count. We could even use collections.Counter and count all of the words.
import re
# `word_finder(somestring)` emits all words in string as list
word_finder = re.compile(r'\w+').findall
filename = input('filename: ')
word = input('word: ')
# remove case for compare
lword = word.lower()
# `word_finder` emits all of the words excluding punctuation
# `filter` removes the lower cased words we don't want
# `len` counts the result
count = len(list(filter(lambda w: w.lower() == lword,
word_finder(open(filename).read()))))
print(count)
# we could go crazy and count all of the words in the file
# and do it line by line to reduce memory footprint.
import collections
import itertools
from pprint import pprint
word_counts = collections.Counter(itertools.chain.from_iterable(
word_finder(line.lower()) for line in open(filename)))
print(pprint(word_counts))
Splitting on whitespace isn't sufficient -- split on everything you're not counting and get your case under control:
import re
import sys
file = open(sys.argv[1])
word = sys.argv[2]
print(re.split(r"[^a-z]+", file.read().casefold()).count(word.casefold()))
You can add apostrophes to the inverted pattern [^a-z'] or whatever else you want to include in your count.
Hogan: Colonel, you're asking and answering your own questions. That's tops in German efficiency.
def words_frequency_counter(filename):
"""Print how many times the word appears in the text."""
try:
with open(filename) as file_object:
contents = file_object.read()
except FileNotFoundError:
pass
else:
word = input("Give me a word: ")
print("'" + word + "'" + ' appears ' +
str(contents.lower().count(word.lower())) + ' times.\n')
First, you want to open the file. Do this with:
your_file = open('file.txt', 'r')
Next, you want to count the word. Let's set your word as brian under the variable life. No reason.
your_file.read().split().count(life)
What that does is reads the file, splits it into individual words, and counts the instances of the word 'brian'. Hope this helps!

Counting Hashtag

I'm writing a function called HASHcount(name,list), which receives 2 parameters, the name one is the name of the file that will be analized, a text file structured like this:
Date|||Time|||Username|||Follower|||Text
So, basically my input is a tweets list, with several rows structured like above. The list parameter is a list of hashtags I want to count in that text file. I want my function to check how many times each word of the list given occurred in the tweets list, and give as output a dictionary with each word count, even if the word is missing.
For instance, with the instruction HASHcount(December,[Peace, Love]) the program should give as output a dictionary made by checking how many times the word Peace and the word Love have been used as hashtag in the Text field of each tweet in the file called December.
Also, in the dictionary the words have to be without the hashtag simbol.
I'm stuck on making this function, I'm at this point but I'm having some issues concerning the dictionary:
def HASHcount(name,list):
f = open(name,"r")
dic={}
l = f.readline()
for word in list:
dic[word]=0
for line in f:
li_lis=line.split("|||")
li_tuple=tuple(li_lis)
if word in li_tuple[4]:
dic[word]=dic[word]+1
return dic
The main issue is that you are iterating over the lines in the file for each word, rather than the reverse. Thus the first word will consume all the lines of the file, and each subsequent word will have 0 matches.
Instead, you should do something like this:
def hash_count(name, words):
dic = {word:0 for word in words}
with open(name) as f:
for line in f:
line_text = line.split('|||')[4]
for word in words:
# Check if word appears as a hashtag in line_text
# If so, increment the count for word
return dic
There are several issues with your code, some of which have already been pointed out, while others (e.g concerning the identification of hashtags in a tweet's text) have not. Here's a partial solution not covering the fine points of the latter issue:
def HASHcount(name, words):
dic = dict.fromkeys(words, 0)
with open(name,"r") as f:
for line in f:
for w in words:
if '#' + w in line:
dic[w] += 1
return dic
This offers several simplifications keyed on the fact that hashtags in a tweet do start with # (which you don't want in the dic) -- as a result it's not worth analyzing each line since the # cannot be present except in the text.
However, it still has a fraction of a problem seen in other answers (except the one which just commented out this most delicate of parts!-) -- it can get false positives by partial matches. When the check is just like word in linetext the problem would be huge -- e.g if a word is cat it gets counted as hashtag even if present in perfectly ordinary text (on its own or as part of another word, e.g vindicative). With the '#' + approach, it's a bit better, but still, prefix matches would lead to a false positive, e.g #catalog would erroneously be counted as a hit for cat.
As some suggested, regular expressions can help with that. However, here's an alternative for the body of the for w in words loop...
for w in words:
where = line.find('#' + w)
if where == -1: continue
after = line[where + len(w) + 1]
if after in chars_acceptable_in_hashes: continue
dic[w] += 1
The only issue remaining is to determine which characters can be part of hashtags, i.e, the set chars_acceptable_in_hashes -- I haven't memorized Twitter's specs so I don't know it offhand, but surely you can find out. Note that this works at end of line, too, because line has not be stripped, so it's known to end with a \n. which is not in the acceptable set (so a hashtag at the very end of the line will be "properly terminated" too).
I like using collections module. This worked for me.
from collections import defaultdict
def HASHcount(file_to_open, lst):
with open(file_to_open) as my_file:
my_dict= defaultdict(int)
for line in my_file:
line = line.split('|||')
txt = line[4].strip(" ")
if txt in lst:
my_dict[txt] += 1
return my_dict

Check for iambic pentameter?

I am kind of stuck on a question that I have to do regarding iambic pentameters, but because it is long, I'll try to simplify it.
So I need to get some words and their stress patterns from a text file that look somewhat like this:
if, 0
music,10
be,1
the,0
food,1
of,0
love,1
play,0
on,1
hello,01
world,1
And from the file, you can assume there will be much more words for different sentences. I am trying to get sentences from a text file which have multiple sentences, and to see if the sentence (ignoring punctuation and case) is an iambic pentameter.
For example if the text file contains this:
If music be the food of love play on
hello world
The first sentence will be assigned from the stress dictionary like this: 0101010101, and the second is obviously not a pentameter(011). I would like it so that it only prints sentences which are iambic pentameters.
Sorry if this is a convoluted or messy question.
This is what I have so far:
import string
dict = {};
sentence = open('sentences.txt')
stress = open('stress.txt')
for some in stress:
word,number = some.split(',')
dict[word] = number
for line in sentence:
one = line.split()
I don't think you are building your dictionary of stresses correctly. It's crucial to remember to get rid of the implicit \n character from lines as you read them in, as well as strip any whitespace from words after you've split on the comma. As things stand, the line if, 0 will be split to ['if', ' 0\n'] which isn't what you want.
So to create your dictionary of stresses you could do something like this:
stress_dict = {}
with open('stress.txt', 'r') as f:
for line in f:
word_stress = line.strip().split(',')
word = word_stress[0].strip().lower()
stress = word_stress[1].strip()
stress_dict[word] = stress
For the actual checking, the answer by #khelwood is a good way, but I'd take extra care to handle the \n character as you read in the lines and also make sure that all the characters in the line were lowercase (like in your dictionary).
Define a function is_iambic_pentameter to check whether a sentence is an iambic pentameter (returning True/False) and then check each line in sentences.txt:
def is_iambic_pentameter(line):
line_stresses = [stress_dict[word] for word in line.split()]
line_stresses = ''.join(line_stresses)
return line_stresses == '0101010101'
with open('sentences.txt', 'r') as f:
for line in f:
line = line.rstrip()
line = line.lower()
if is_iambic_pentameter(line):
print line
As an aside, you might be interested in NLTK, a natural language processing library for Python. Some Internet searching finds that people have written Haiku generators and other scripts for evaluating poetic forms using the library.
I wouldn't have thought iambic pentameter was that clear cut: always some words end up getting stressed or unstressed in order to fit the rhythm. But anyway. Something like this:
for line in sentences:
words = line.split()
stresspattern = ''.join([dict[word] for word in words])
if stresspattern=='0101010101':
print line
By the way, it's generally a bad idea to be calling your dictionary 'dict', since you're hiding the dict type.
Here's how the complete code could look like:
#!/usr/bin/env python3
def is_iambic_pentameter(words, word_stress_pattern):
"""Whether words are a line of iambic pentameter.
word_stress_pattern is a callable that given a word returns
its stress pattern
"""
return ''.join(map(word_stress_pattern, words)) == '01'*5
# create 'word -> stress pattern' mapping, to implement word_stress_pattern(word)
with open('stress.txt') as stress_file:
word_stress_pattern = dict(map(str.strip, line.split(','))
for line in stress_file).__getitem__
# print lines that use iambic pentameter
with open('sentences.txt') as file:
for line in file:
if is_iambic_pentameter(line.casefold().split(), word_stress_pattern):
print(line, end='')

Iterate through words of a file in Python

I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.
Any alternatives?
It really depends on your definition of word. But try this:
f = file("your-filename-here").read()
for word in f.split():
# do something with word
print word
This will use whitespace characters as word boundaries.
Of course, remember to properly open and close the file, this is just a quick example.
Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.
First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.
If not, use something like:
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
You really should consider using Generator
def word_gen(file):
for line in file:
for word in line.split():
yield word
with open('somefile') as f:
word_gen(f)
There are more efficient ways of doing this, but syntactically, this might be the shortest:
words = open('myfile').read().split()
If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.
I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):
Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
Read in the line as normal, then split it on whitespace to break it down into words?
Something like:
word_list = loaded_string.split()
After reading the line you could do:
l = len(pattern)
i = 0
while True:
i = str.find(pattern, i)
if i == -1:
break
print str[i:i+l] # or do whatever
i += l
Alex.
What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:
l = []
f = open("filename.txt", "rU")
for line in f:
for word in line.split()
l.append(word)
longer version of what Donald Miner suggested.

Python: counting unique instance of words across several lines

I have a text file with several observations. Each observation is in one line. I would like to detect unique occurrence of each word in a line. In other words, if same word occurs twice or more on the same line, it is still counted as once. However, I would like to count the frequency of occurrence of each words across all observations. This means that if a word occurs in two or more lines,I would like to count the number of lines it occurred in. Here is the program I wrote and it is really slow in processing large number of file. I also remove certain words in the file by referencing another file. Please offer suggestions on how to improve speed. Thank you.
import re, string
from itertools import chain, tee, izip
from collections import defaultdict
def count_words(in_file="",del_file="",out_file=""):
d_list = re.split('\n', file(del_file).read().lower())
d_list = [x.strip(' ') for x in d_list]
dict2={}
f1 = open(in_file,'r')
lines = map(string.strip,map(str.lower,f1.readlines()))
for line in lines:
dict1={}
new_list = []
for char in line:
new_list.append(re.sub(r'[0-9#$?*_><#\(\)&;:,.!-+%=\[\]\-\/\^]', "_", char))
s=''.join(new_list)
for word in d_list:
s = s.replace(word,"")
for word in s.split():
try:
dict1[word]=1
except:
dict1[word]=1
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
freq_list = dict2.items()
freq_list.sort()
f1.close()
word_count_handle = open(out_file,'w+')
for word, freq in freq_list:
print>>word_count_handle,word, freq
word_count_handle.close()
return dict2
dict = count_words("in_file.txt","delete_words.txt","out_file.txt")
You're running re.sub on each character of the line, one at a time. That's slow. Do it on the whole line:
s = re.sub(r'[0-9#$?*_><#\(\)&;:,.!-+%=\[\]\-\/\^]', "_", line)
Also, have a look at sets and the Counter class in the collections module. It may be faster if you just count and then discard those you don't want afterwards.
Without having done any performance testing, the following come to mind:
1) you're using regexes -- why? Are you just trying to get rid of certain characters?
2) you're using exceptions for flow control -- although it can be pythonic (better to ask forgiveness than permission), throwing exceptions can often be slow. As seen here:
for word in dict1.keys():
try:
dict2[word] += 1
except:
dict2[word] = 1
3) turn d_list into a set, and use python's in to test for membership, and simultaneously ...
4) avoid heavy use of replace method on strings -- I believe you're using this to filter out the words that appear in d_list. This could be accomplished instead by avoiding replace, and just filtering the words in the line, either with a list comprehension:
[word for word words if not word in del_words]
or with a filter (not very pythonic):
filter(lambda word: not word in del_words, words)
import re
u_words = set()
u_words_in_lns = []
wordcount = {}
words = []
# get unique words per line
for line in buff.split('\n'):
u_words_in_lns.append(set(line.split(' ')))
# create a set of all unique words
map( u_words.update, u_words_in_lns )
# flatten the sets into a single list of words again
map( words.extend, u_words_in_lns)
# count everything up
for word in u_words:
wordcount[word] = len(re.findall(word,str(words)))

Categories