Check for iambic pentameter? - python

I am kind of stuck on a question that I have to do regarding iambic pentameters, but because it is long, I'll try to simplify it.
So I need to get some words and their stress patterns from a text file that look somewhat like this:
if, 0
music,10
be,1
the,0
food,1
of,0
love,1
play,0
on,1
hello,01
world,1
And from the file, you can assume there will be much more words for different sentences. I am trying to get sentences from a text file which have multiple sentences, and to see if the sentence (ignoring punctuation and case) is an iambic pentameter.
For example if the text file contains this:
If music be the food of love play on
hello world
The first sentence will be assigned from the stress dictionary like this: 0101010101, and the second is obviously not a pentameter(011). I would like it so that it only prints sentences which are iambic pentameters.
Sorry if this is a convoluted or messy question.
This is what I have so far:
import string
dict = {};
sentence = open('sentences.txt')
stress = open('stress.txt')
for some in stress:
word,number = some.split(',')
dict[word] = number
for line in sentence:
one = line.split()

I don't think you are building your dictionary of stresses correctly. It's crucial to remember to get rid of the implicit \n character from lines as you read them in, as well as strip any whitespace from words after you've split on the comma. As things stand, the line if, 0 will be split to ['if', ' 0\n'] which isn't what you want.
So to create your dictionary of stresses you could do something like this:
stress_dict = {}
with open('stress.txt', 'r') as f:
for line in f:
word_stress = line.strip().split(',')
word = word_stress[0].strip().lower()
stress = word_stress[1].strip()
stress_dict[word] = stress
For the actual checking, the answer by #khelwood is a good way, but I'd take extra care to handle the \n character as you read in the lines and also make sure that all the characters in the line were lowercase (like in your dictionary).
Define a function is_iambic_pentameter to check whether a sentence is an iambic pentameter (returning True/False) and then check each line in sentences.txt:
def is_iambic_pentameter(line):
line_stresses = [stress_dict[word] for word in line.split()]
line_stresses = ''.join(line_stresses)
return line_stresses == '0101010101'
with open('sentences.txt', 'r') as f:
for line in f:
line = line.rstrip()
line = line.lower()
if is_iambic_pentameter(line):
print line
As an aside, you might be interested in NLTK, a natural language processing library for Python. Some Internet searching finds that people have written Haiku generators and other scripts for evaluating poetic forms using the library.

I wouldn't have thought iambic pentameter was that clear cut: always some words end up getting stressed or unstressed in order to fit the rhythm. But anyway. Something like this:
for line in sentences:
words = line.split()
stresspattern = ''.join([dict[word] for word in words])
if stresspattern=='0101010101':
print line
By the way, it's generally a bad idea to be calling your dictionary 'dict', since you're hiding the dict type.

Here's how the complete code could look like:
#!/usr/bin/env python3
def is_iambic_pentameter(words, word_stress_pattern):
"""Whether words are a line of iambic pentameter.
word_stress_pattern is a callable that given a word returns
its stress pattern
"""
return ''.join(map(word_stress_pattern, words)) == '01'*5
# create 'word -> stress pattern' mapping, to implement word_stress_pattern(word)
with open('stress.txt') as stress_file:
word_stress_pattern = dict(map(str.strip, line.split(','))
for line in stress_file).__getitem__
# print lines that use iambic pentameter
with open('sentences.txt') as file:
for line in file:
if is_iambic_pentameter(line.casefold().split(), word_stress_pattern):
print(line, end='')

Related

How to compare contents of two large text files in Python?

Datasets: Two Large text files for train and test that all words of them are tokenized. a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place . "
Question: How can I replace every word in the test data not seen in training with the word "unk" in Python?
So far, I made the dictionary by the following codes to count the frequency of each word in the file:
#open text file and assign it to varible with the name "readfile"
readfile= open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-train.txt','r')
writefile=open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-trainReplaced.txt','w')
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in readfile:
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
#replace all words occurring in the training data once with the token<unk>.
for key in list(d.keys()):
line= d[key]
if (line==1):
line="<unk>"
writefile.write(str(d))
else:
writefile.write(str(d))
#close the file that we have created and we wrote the new data in that
writefile.close()
Honestly the above code doesn't work with writefile.write(str(d)) which I want to write the result in the new textfile, but by print(key, ":", line) it works and shows the frequency of each word but in the console which doesn't create the new file. if you also know the reason for this, please let me know.
First off, your task is to replace the words in test file that are not seen in train file. Your code never mentions the test file. You have to
Read the train file, gather what words are there. This is mostly okay; but you need to .strip() your line or the last word in each line will end with a newline. Also, it would make more sense to use set instead of dict if you don't need to know the count (and you don't, you just want to know if it's there or not). Sets are cool because you don't have to care if an element is in already or not; you just toss it in. If you absolutely need to know the count, using collections.Counter is easier than doing it yourself.
Read the test file, and write to replacement file, as you are replacing the words in each line. Something like:
with open("test", "rt") as reader:
with open("replacement", "wt") as writer:
for line in reader:
writer.write(replaced_line(line.strip()) + "\n")
Make sense, which your last block does not :P Instead of seeing whether a word from test file is seen or not, and replacing the unseen ones, you are iterating on the words you have seen in the train file, and writing <unk> if you've seen them exactly once. This does something, but not anything close to what it should.
Instead, split the line you got from the test file and iterate on its words; if the word is in the seen set (word in seen, literally) then replace its contents; and finally add it to the output sentence. You can do it in a loop, but here's a comprehension that does it:
new_line = ' '.join(word if word in seen else '<unk>'
for word in line.split(' '))

Regular Expression to find valid words in file

I need to write a function get_specified_words(filename) to get a list of lowercase words from a text file. All of the following conditions must be applied:
Include all lower-case character sequences including those that
contain a - or ' character and those that end with a '
character.
Exclude words that end with a -.
The function must only process lines between the start and end marker lines
Use this regular expression to extract the words from each relevant line of a file: valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading. That is your open file call should look like open(filename, encoding='utf-8'). This will be especially helpful if your operating system doesn't set Python's default encoding to UTF-8.
The sample text file testing.txt contains this:
That are after the start and should be dumped.
So should that
and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;
a doc-string drought; very long, rambling and unfocused functions; not
enough spacing between functions; inconsistent spacing before and
after operators, just like this here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.
Have a nice day.
Here's my code:
import re
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_valid_words(fname):
return words(lines_between_markers(lines_from_file(fname)))
# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))
Here's my output:
File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found
Here's the expected output:
valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level
I need help with getting my code to work. Any help is appreciated.
lines_between_markers(lines_from_file(fname))
gives you a list of list of valid words.
So you just need to flatten it :
def words(lines):
words_list = [w for line in lines for w in line]
return words_list
Does the trick.
But I think that you should review the design of your program :
lines_between_markers should only yield lines between markers, but it does more. Regexp should be use on the result of this function and not inside the function.
What you didn't do :
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading.
That is your open file call should look like open(filename,
encoding='utf-8').

count word in textfile

I have a textfile that I wanna count the word "quack" in.
textfile named "quacker.txt" example:
This is the textfile quack.
Oh, and how quack did quack do in his exams back in 2009?\n Well, he passed with nine P grades and one B.\n He says that quack he wants to go to university in the\n future but decided to try and make a career on YouTube before that Quack....\n So, far, it’s going very quack well Quack!!!!
So here I want 7 as the output.
readf= open("quacker.txt", "r")
lst= []
for x in readf:
lst.append(str(x).rstrip('\n'))
readf.close()
#above gives a list of each row.
cv=0
for i in lst:
if "quack" in i.strip():
cv+=1
above only works for one "quack" in the element of the list
Well if the file isn't too long, you could try:
with open('quacker.txt') as f:
text = f.read().lower() # make it all lowercase so the count works below
quacks = text.count('quack')
As #PadraicCunningham mentioned in the comments, this would also count the 'quack' in
words like 'quacks' or 'quacking'. But if that's not an issue, then this is fine.
you're incrementing by one if the line contains the string, but what if the line has several occurrences of 'quack'?
try:
for line in lst:
for word in line.split():
if 'quack' in word:
cv+=1
You need to lower, strip and split to get an accurate count:
from string import punctuation
with open("test.txt") as f:
quacks = sum(word.lower().strip(punctuation) == "quack"
for line in f for word in line.split())
print(quacks)
7
You need to split each word in the file into individual words or you will get False positives using in or count. word.lower().strip(punctuation) lowers each word and removes any punctuation, sum will sum all the times word.lower().strip(punctuation) == "quack" is True.
In your own code x is already a string so calling str(x)... is unnecessary, you could also just check each line the first time you iterate, there is no need to add the strings to a list and then iterate a second time. Why you only get one returned is most like because all the data is actually on a single line, you are also comparing quack to Quack which will not work, you need to lower the string.

Counting Hashtag

I'm writing a function called HASHcount(name,list), which receives 2 parameters, the name one is the name of the file that will be analized, a text file structured like this:
Date|||Time|||Username|||Follower|||Text
So, basically my input is a tweets list, with several rows structured like above. The list parameter is a list of hashtags I want to count in that text file. I want my function to check how many times each word of the list given occurred in the tweets list, and give as output a dictionary with each word count, even if the word is missing.
For instance, with the instruction HASHcount(December,[Peace, Love]) the program should give as output a dictionary made by checking how many times the word Peace and the word Love have been used as hashtag in the Text field of each tweet in the file called December.
Also, in the dictionary the words have to be without the hashtag simbol.
I'm stuck on making this function, I'm at this point but I'm having some issues concerning the dictionary:
def HASHcount(name,list):
f = open(name,"r")
dic={}
l = f.readline()
for word in list:
dic[word]=0
for line in f:
li_lis=line.split("|||")
li_tuple=tuple(li_lis)
if word in li_tuple[4]:
dic[word]=dic[word]+1
return dic
The main issue is that you are iterating over the lines in the file for each word, rather than the reverse. Thus the first word will consume all the lines of the file, and each subsequent word will have 0 matches.
Instead, you should do something like this:
def hash_count(name, words):
dic = {word:0 for word in words}
with open(name) as f:
for line in f:
line_text = line.split('|||')[4]
for word in words:
# Check if word appears as a hashtag in line_text
# If so, increment the count for word
return dic
There are several issues with your code, some of which have already been pointed out, while others (e.g concerning the identification of hashtags in a tweet's text) have not. Here's a partial solution not covering the fine points of the latter issue:
def HASHcount(name, words):
dic = dict.fromkeys(words, 0)
with open(name,"r") as f:
for line in f:
for w in words:
if '#' + w in line:
dic[w] += 1
return dic
This offers several simplifications keyed on the fact that hashtags in a tweet do start with # (which you don't want in the dic) -- as a result it's not worth analyzing each line since the # cannot be present except in the text.
However, it still has a fraction of a problem seen in other answers (except the one which just commented out this most delicate of parts!-) -- it can get false positives by partial matches. When the check is just like word in linetext the problem would be huge -- e.g if a word is cat it gets counted as hashtag even if present in perfectly ordinary text (on its own or as part of another word, e.g vindicative). With the '#' + approach, it's a bit better, but still, prefix matches would lead to a false positive, e.g #catalog would erroneously be counted as a hit for cat.
As some suggested, regular expressions can help with that. However, here's an alternative for the body of the for w in words loop...
for w in words:
where = line.find('#' + w)
if where == -1: continue
after = line[where + len(w) + 1]
if after in chars_acceptable_in_hashes: continue
dic[w] += 1
The only issue remaining is to determine which characters can be part of hashtags, i.e, the set chars_acceptable_in_hashes -- I haven't memorized Twitter's specs so I don't know it offhand, but surely you can find out. Note that this works at end of line, too, because line has not be stripped, so it's known to end with a \n. which is not in the acceptable set (so a hashtag at the very end of the line will be "properly terminated" too).
I like using collections module. This worked for me.
from collections import defaultdict
def HASHcount(file_to_open, lst):
with open(file_to_open) as my_file:
my_dict= defaultdict(int)
for line in my_file:
line = line.split('|||')
txt = line[4].strip(" ")
if txt in lst:
my_dict[txt] += 1
return my_dict

Iterate through words of a file in Python

I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.
Any alternatives?
It really depends on your definition of word. But try this:
f = file("your-filename-here").read()
for word in f.split():
# do something with word
print word
This will use whitespace characters as word boundaries.
Of course, remember to properly open and close the file, this is just a quick example.
Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.
First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.
If not, use something like:
line = ''
while True:
word, space, line = line.partition(' ')
if space:
# A word was found
yield word
else:
# A word was not found; read a chunk of data from file
next_chunk = input_file.read(1000)
if next_chunk:
# Add the chunk to our line
line = word + next_chunk
else:
# No more data; yield the last word and return
yield word.rstrip('\n')
return
You really should consider using Generator
def word_gen(file):
for line in file:
for word in line.split():
yield word
with open('somefile') as f:
word_gen(f)
There are more efficient ways of doing this, but syntactically, this might be the shortest:
words = open('myfile').read().split()
If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.
I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):
Here is my totally functional approach which avoids having to read and
split lines. It makes use of the itertools module:
Note for python 3, replace itertools.imap with map
import itertools
def readwords(mfile):
byte_stream = itertools.groupby(
itertools.takewhile(lambda c: bool(c),
itertools.imap(mfile.read,
itertools.repeat(1))), str.isspace)
return ("".join(group) for pred, group in byte_stream if not pred)
Sample usage:
>>> import sys
>>> for w in readwords(sys.stdin):
... print (w)
...
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
It's soo very Functional!
It's
soo
very
Functional!
>>>
I guess in your case, this would be the way to use the function:
with open('words.txt', 'r') as f:
for word in readwords(f):
print(word)
Read in the line as normal, then split it on whitespace to break it down into words?
Something like:
word_list = loaded_string.split()
After reading the line you could do:
l = len(pattern)
i = 0
while True:
i = str.find(pattern, i)
if i == -1:
break
print str[i:i+l] # or do whatever
i += l
Alex.
What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:
l = []
f = open("filename.txt", "rU")
for line in f:
for word in line.split()
l.append(word)
longer version of what Donald Miner suggested.

Categories