IndexError while iterating through a generator - python

I am trying to solve a problem for my programming classes. I am given a folder that contains e-mails and special files. Special files always begin with "!". I am supposed to add a method emails() inside the Corpus class. The method should be a generator. This is the example of its use:
corpus = Corpus('/path/to/directory/with/emails')
count = 0
# Go through all emails and print the filename and the message body
for fname, body in corpus.emails():
print(fname)
print(body)
print('-------------------------')
count += 1
print('Finished: ', count, 'files processed.')
this is the class and the method I've written:
class Corpus:
def __init__(self, path_to_mails_directory):
self.path_to_mails_directory = path_to_mails_directory
def emails(self):
iterator = 0
mail_body = None
mails_folder = os.listdir(self.path_to_mails_directory)
lenght = len(mails_folder)
while iterator <= lenght:
if not mails_folder[iterator].startswith("!"):
with open(self.path_to_mails_directory+"/"+mails_folder[iterator]) as an_e_mail:
mail_body = an_e_mail.read()
yield mails_folder[iterator], mail_body
iterator += 1
and I tried to run the example code this way:
if __name__ == "__main__":
my_corpus = Corpus("data/1")
my_gen = my_corpus.emails()
count = 0
for fname, body in my_gen:
print(fname)
print(body)
print("------------------------------")
count += 1
print("finished: " + str(count))
Python prints quite a bunch of mails (the folder contains about a thousand of files) as expected and then goes:
Traceback (most recent call last):
File "C:/Users/tvavr/PycharmProjects/spamfilter/corpus.py", line 26, in <module>
for fname, body in my_gen:
File "C:/Users/tvavr/PycharmProjects/spamfilter/corpus.py", line 15, in emails
if not mails_folder[iterator].startswith("!"):
IndexError: list index out of range
I have no clue what the problem is and would appreciate any help. Thx
EDIT: I updated the code a bit based your suggestions.

A good way to do this would be as follows:
def emails(self):
mail_body = None
mails_folder = os.listdir(self.path_to_mails_directory)
for mail in mails_folder:
if mail.startswith("!"):
pass
else:
with open(self.path_to_mails_directory+"/"+mail) as an_e_mail:
mail_body = an_e_mail.read()
yield mail, mail_body
Index based iteration is not considered to be Pythonic. You should prefer "for mail in mails_folder:" syntax.

Related

Attribute Error - int object has no attribute keys - Python

I'm running this code but I seem to keep getting an attribute error and I don't know how this would be fixed. I've included code as well as the shell window when I run it!
# the Count class. The wordleFromObject function takes a Count object as
# input, and calls its getTopWords method.
import string
class Count:
# method to initialize any data structures, such as a dictionary to
# hold the counts for each word, and a list of stop words
def __init__(self):
#print("Initializing Word Counter")
# set the attrbute wordCounts to an empty dictionary
self.wordCounts = {}
infile = open("stop_words.txt", "r")
self.stop_word_dict = {};
for line in infile.readlines():
self.stop_word_dict = 1
# method to add one to the count for a word in the dictionary.
# if the word is not yet in the dictionary, we'll need to add a
# record for the word, with a count of one.
def incCount(self,word):
my_table = str.maketrans('', '', string.punctuation)
self.wordCounts = {}
if word in self.stop_word_dict.keys():
return
else:
self.stop_word_dict += 1
cleaned_word = word.translate(my_table).lower()
if cleaned_word != '':
if cleaned_word in self.wordCounts.keys():
self.wordCounts[cleaned_word] += 1
else:
self.wordCounts[cleaned_word] = 1
# method to look up the count for a word
def lookUpCount(self, word):
return self.wordCounts.get(word.lower(), 0)
def main():
print("Initializing Word Counter")
filename = input("Enter book file:")
infile = open(filename, "r")
counter = Count()
for line in infile.readlines():
words = [word.strip() for word in line.strip().split()]
for word in words:
counter.incCount(word)
infile.close()
# Test code for Part 2 and 3
# Comment this code once you have completed part 3.
print(counter.lookUpCount("alice"))
print(counter.lookUpCount("rabbit"))
print(counter.lookUpCount("and"))
print(counter.lookUpCount("she"))
return
# Test code for Part 4 and 5
# topTen = counter.getTopWords(10)
# print(topTen)
# Test code for Part 5
# Import the wordle module and uncomment the call to the wordle function!
# wordle.wordleFromObject(counter,30)
# run the main program
main()
Error Message:
Initializing Word Counter
Enter book file:Alice.txt
Traceback (most recent call last):
line 69, in <module>
main()
line 50, in main
counter.incCount(word)
line 28, in incCount
if word in self.stop_word_dict.keys():
AttributeError: 'int' object has no attribute 'keys'
for line in infile.readlines():
self.stop_word_dict = 1
In this lines you change your stop_word_dict from dict to int, and later in the code, you are trying to reach dictionary "keys" attribute

replace function not working with list items

I am trying to use the replace function to take items from a list and replace the fields below with their corresponding values, but no matter what I do, it only seems to work when it reaches the end of the range (on it's last possible value of i, it successfully replaces a substring, but before that it does not)
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
This is what I get after running that code
<<name>> <<color>> <<age>>
<<name>> <<color>> <<age>>
<<name>> <<color>> 18
I've been stuck on this for way too long. Any advice would be greatly appreciated. Thanks :)
Full code:
def writeDocument():
msgFile = raw_input("Name of file you would like to create or write to?: ")
msgFile = open(msgFile, 'w+')
msg = raw_input("\nType your message here. Indicate replaceable fields by surrounding them with \'<<>>\' Do not use spaces inside your fieldnames.\n\nYou can also create your fieldname list here. Write your first fieldname surrounded by <<>> followed by the value you'd like to assign, then repeat, separating everything by one space. Example: \"<<name>> ryan <<color>> blue\"\n\n")
msg = msg.replace(' ', '\n')
msgFile.write(msg)
msgFile.close()
print "\nDocument written successfully.\n"
def fillDocument():
msgFile = raw_input("Name of file containing the message you'd like to fill?: ")
fieldFile = raw_input("Name of file containing the fieldname list?: ")
msgFile = open(msgFile, 'r+')
fieldFile = open(fieldFile, 'r')
fieldNameList = []
fieldValueList = []
fieldLine = fieldFile.readline()
while fieldLine != '':
fieldNameList.append(fieldLine)
fieldLine = fieldFile.readline()
fieldValueList.append(fieldLine)
fieldLine = fieldFile.readline()
print fieldNameList[0]
print fieldValueList[0]
print fieldNameList[1]
print fieldValueList[1]
msg = msgFile.readline()
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
msgFile.close()
fieldFile.close()
###Program Starts#####--------------------
while True==True:
objective = input("What would you like to do?\n1. Create a new document\n2. Fill in a document with fieldnames\n")
if objective == 1:
writeDocument()
elif objective == 2:
fillDocument()
else:
print "That's not a valid choice."
Message file:
<<name>> <<color>> <<age>>
Fieldname file:
<<name>>
ryan
<<color>>
blue
<<age>>
18
Cause:
This is because all lines except the last line read from the "Fieldname" file contains "\n" characters. So when the program comes to the replacing part fieldNameList , fieldValueList and msg looks like this:
fieldNameList = ['<<name>>\n', '<<color>>\n', '<<age>>\n']
fieldValueList = ['ryan\n', 'blue\n', '18']
msg = '<<name>> <<color>> <<age>>\n'
so the replace() function actually searches for '<<name>>\n','<<color>>\n','<<age>>\n' in msg string and only <<age>> field get replaced.(You must have a "\n" at the end of msg file, otherwise it won't be replaced as well).
Solution:
use rstrip() method when reading lines to strip the newline character at the end.
fieldLine = fieldFile.readline().rstrip()

self modifying python script

I want to create python script which can modify code in that script itself using Python Language Services or using any other way.
e.g. A script which keep track of its count of successfull execution
import re
COUNT = 0
def updateCount():
# code to update second line e.g. COUNT = 0
pass
if __name__ == '__main__':
print('This script has run {} times'.format(COUNT))
updateCount()
On successful execution of this script code should get changed to
import re
COUNT = 1
def updateCount():
# code to update second line e.g. COUNT = 0
pass
if __name__ == '__main__':
print('This script has run {} times'.format(COUNT))
updateCount()
Simple approach came to my mind was to open __file__ in write mode and do requried modification using reguler expessions etc. But that did not work I got exception io.UnsupportedOperation: not readable. Even if this approach would be working then it would be very risky because it can spoil my whole script. so I am looking for solution using Python Language Services.
Yes, you can use the language services to achieve self-modification, as in following example:
>>> def foo(): print("original foo")
>>> foo()
original foo
>>> rewrite_txt="def foo(): print('I am new foo')"
>>> newcode=compile(rewrite_text,"",'exec')
>>> eval(newcode)
>>> foo()
I am new foo
So, by new dynamically generated code you can replace stuff contained in the original source file, without modifying the file itself.
A python script is nothing more than a text file. So, you are able to open it as an external file and read & write on that. (Using __file__ variable you can get the exact name of your script):
def updateCount():
fin = open(__file__, 'r')
code = fin.read()
fin.close()
second_line = code.split('\n')[1]
second_line_parts = second_line.split(' ')
second_line_parts[2] = str(int(second_line_parts[2])+1)
second_line = ' '.join(second_line_parts)
lines = code.split('\n')
lines[1] = second_line
code = '\n'.join(lines)
fout = open(__file__, 'w')
fout.write(code)
fout.close()
#kyriakosSt's answer works but hard-codes that the assignment to COUNT must be on the second line, which can be prone to unexpected behaviors over time when the line number changes due to the source being modified for something else.
For a more robust solution, you can use lib2to3 to parse and update the source code instead, by subclassing lib2to3.refactor.RefactoringTool to refactor the code using a fixer that is a subclass of lib2to3.fixer_base.BaseFix with a pattern that looks for an expression statement with the pattern 'COUNT' '=' any, and a transform method that updates the last child node by incrementing its integer value:
from lib2to3 import fixer_base, refactor
COUNT = 0 # this should be incremented every time the script runs
class IncrementCount(fixer_base.BaseFix):
PATTERN = "expr_stmt< 'COUNT' '=' any >"
def transform(self, node, results):
node.children[-1].value = str(int(node.children[-1].value) + 1)
return node
class Refactor(refactor.RefactoringTool):
def __init__(self, fixers):
self._fixers = [cls(None, None) for cls in fixers]
super().__init__(None)
def get_fixers(self):
return self._fixers, []
with open(__file__, 'r+') as file:
source = str(Refactor([IncrementCount]).refactor_string(file.read(), ''))
file.seek(0)
file.write(source)
Demo: https://repl.it/#blhsing/MushyStrangeClosedsource
This will edit the module level variables defined before _local_config. Later, process an update to the dictionary, then replace the line when iterating over the source file with the new _local_config values:
count = 0
a = 0
b = 1
c = 1
_local_config = dict(
filter(
lambda elem: (elem[0][:2] != "__") and (str(elem[1])[:1] != "<"),
globals().items(),
),
)
# do some stuff
count += 1
c = a + b
a = b
b = c
# update with new values
_local_config = dict(
filter(
lambda elem: elem[0] in _local_config.keys(),
globals().items(),
)
)
# read self
with open(__file__, "r") as f:
new_file = ""
for line in f.read().split("\n"):
for k, v in _local_config.items():
search = f"{k} = "
if search == line[: len(k) + 3]:
line = search + str(v)
_local_config.pop(k)
break
new_file += line + "\n"
# write self
with open(__file__, "w") as f:
f.write(new_file[:-1])

Taking a class and making a new class that is object oriented (python)

I have to figure out a way to take a code that was already given and improving it by making it a class that is object oriented.
This code: was already given and we use it for our new code. the file 'students2txt' is being extracted line by line (being split based on a ':') and the StudentFileReader class is imported into the new class StudentReport(object). And so the finished project is supposed give a student list that has ID numbers, first and last names, gpa's (all information is give in the 'students2.txt' I just have to make the code print all of the info.
filereader.py:
class StudentFileReader:
def __init__(self, inputSrc):
self._inputSrc = inputSrc
self._inputFile = None
def open(self):
self._inputFile = open(self._inputSrc, 'r')
def close(self):
self._inputFile.close()
self._inputFile = None
def fetchRecord(self):
line = self._inputFile.readline()
if line == "":
return None
record = StudentRecord()
#change
record.idNum = int(line)
record.firstName = self._inputFile.readline().rstrip().rsplit(':')
record.lastName = self._inputFile.readline().rstrip().rsplit(':')
record.classCode = int(self._inputFile.readline())
record.gpa = float(self._inputFile.readline())
return record
class StudentRecord:
def __init__(self):
self.idNum = 0
self.firstName = ""
self.lastName = ""
self.classCode = 0
self.gpa = 0.0
New file:
from filereader import StudentFileReader
class StudentReport(object):
def __init__(self):
self._theList = None
def loadRecords(self, filename):
self.reader = StudentFileReader(filename)
self.reader.open()
theList = []
record = self.reader.fetchRecord()
while record is not None:
theList.append(record)
record = self.reader.fetchRecord()
reader.close()
return theList
def sortByid(self):
self._studentList.sort(key = lambda rec: rec.idNum)
def sortByName(self):
pass
def __str__(self):
classNames = [ "", "Freshman", "Sophomore", "Junior", "Senior" ]
print( "LIST OF STUDENTS".center(50) )
print( "" )
print( "%-5s %-25s %-10s %-4s" % ('ID', 'NAME', 'CLASS', 'GPA'))
print( "%5s %25s %10s %4s" % ('-' * 5, '-' * 25, '-' * 10, '-' * 4))
# Print the body.
for record in theList :
print( "%5d %-25s %-10s %4.2f" % \
(record.idNum, \
record.lastName + ', ' + record.firstName,
classNames[record.classCode], record.gpa) )
# Add a footer.
print( "-" * 50 )
print( "Number of students:", len(theList) )
if __name__ == "__main__":
s = StudentReport()
s.loadRecords('students2.txt')
s.sortByName()
print str(s)
This code was taken from the textbook Data Structures and Algorithms Using Python. I'm supposed to make an object oriented class. I've started the StudentRecord class and written the __init__ but I'm not really sure what to do after that. When I try to run anything it gives me a invalid literal for int() with base 10 error. I'm very new to python so I'm not sure how to make any class object oriented easily..
edit: yes, the error came from the fetchRecord function
Traceback (most recent call last):
File "C:\Users\...\studentreport.py", line 24, in <module>
s.loadRecords('students2.txt')
File "C:\Users\...\studentreport.py", line 13, in loadRecords
record = self.reader.fetchRecord()
File "C:\Users\...\filereader.py", line 22, in fetchRecord
record.idNum = int(line)
ValueError: invalid literal for int() with base 10: '10015:John:Smith:2:3.01\n'
Your line parsing code doesn't match the format of the file.
You are trying to interpret the whole line as an integer, but the line contains more.
Perhaps you wanted to split the line first? That one line contains all elements of the record:
parts = line.strip().split(':')
record.idNum = int(parts[0])
record.firstName = parts[1]
record.lastName = parts[2]
record.classCode = parts[3]
record.gpa = float(parts[4])
You can override the original StudentFileReader.fetchRecord()) method by subclassing the class in your own code:
class MyStudentFileReader(StudentFileReader):
def fetchRecord(self):
line = self._inputFile.readline()
if not line:
return None
record = StudentRecord()
parts = line.strip().split(':')
record.idNum = int(parts[0])
record.firstName = parts[1]
record.lastName = parts[2]
record.classCode = parts[3]
record.gpa = float(parts[4])
return record
Then use MyStudentFileReader() instead of StudentFileReader().
You need to split your line before you start trying to convert the pieces into the formats you want for your individual data items. Right now, you're calling readline repeatedly, so each of the values you're calculating for a student comes from a separate line from the file.
Instead, try splitting and unpacking the result directly into local variables:
idNum, firstName, lastName, classCode, GPA = line.rstrip().split(':')
Then do whatever conversions each of those need (e.g. record.idNum = int(idNum)).

python tweet parsing

I'm trying to parse tweets data.
My data shape is as follows:
59593936 3061025991 null null <d>2009-08-01 00:00:37</d> <s><a href="http://help.twitter.com/index.php?pg=kb.page&id=75" rel="nofollow">txt</a></s> <t>honda just recalled 440k accords...traffic around here is gonna be light...win!!</t> ajc8587 15 24 158 -18000 0 0 <n>adrienne conner</n> <ud>2009-07-23 21:27:10</ud> <t>eastern time (us & canada)</t> <l>ga</l>
22020233 3061032620 null null <d>2009-08-01 00:01:03</d> <s><a href="http://alexking.org/projects/wordpress" rel="nofollow">twitter tools</a></s> <t>new blog post: honda recalls 440k cars over airbag risk http://bit.ly/2wsma</t> madcitywi 294 290 9098 -21600 0 0 <n>madcity</n> <ud>2009-02-26 15:25:04</ud> <t>central time (us & canada)</t> <l>madison, wi</l>
I want to get the total numbers of tweets and the numbers of keyword related tweets. I prepared the keywords in text file. In addition, I wanna get the tweet text contents, total number of tweets which contain mention(#), retweet(RT), and URL (I wanna save every URL in other file).
So, I coded like this.
import time
import os
total_tweet_count = 0
related_tweet_count = 0
rt_count = 0
mention_count = 0
URLs = {}
def get_keywords(filepath):
with open(filepath) as f:
for line in f:
yield line.split()
for line in open('/nas/minsu/2009_06.txt'):
tweet = line.strip()
total_tweet_count += 1
with open('./related_tweets.txt', 'a') as save_file_1:
keywords = get_keywords('./related_keywords.txt', 'r')
if keywords in line:
text = line.split('<t>')[1].split('</t>')[0]
if 'http://' in text:
try:
url = text.split('http://')[1].split()[0]
url = 'http://' + url
if url not in URLs:
URLs[url] = []
URLs[url].append('\t' + text)
save_file_3 = open('./URLs_in_related_tweets.txt', 'a')
print >> save_file_3, URLs
except:
pass
if '#' in text:
mention_count +=1
if 'RT' in text:
rt_count += 1
related_tweet_count += 1
print >> save_file_1, text
save_file_2 = open('./info_related_tweets.txt', 'w')
print >> save_file_2, str(total_tweet_count) + '\t' + srt(related_tweet_count) + '\t' + str(mention_count) + '\t' + str(rt_count)
save_file_1.close()
save_file_2.close()
save_file_3.close()
The keyword set likes
Happy
Hello
Together
I think my code has many problem, but the first error is as follws:
Traceback (most recent call last):
File "health_related_tweets.py", line 21, in <module>
keywords = get_keywords('./public_health_related_words.txt', 'r')
TypeError: get_keywords() takes exactly 1 argument (2 given)
Please help me out!
The issue is self explanatory in the error, you have specified two parameters in your call to get_keywords() but your implementation only has one parameter. You should change your get_keywords implementation to something like:
def get_keywords(filepath, mode):
with open(filepath, mode) as f:
for line in f:
yield line.split()
Then you can use the following line without that specific error:
keywords = get_keywords('./related_keywords.txt', 'r')
Now you are getting this error:
Traceback (most recent call last): File "health_related_tweets.py", line 23, in if keywords in line: TypeError: 'in ' requires string as left operand, not generator
The reason is that keywords = get_keywords(...) returns a generator. Logically thinking about it, keywords should be a list of all the keywords. And for each keyword in this list, you want to check if it's in the tweet/line or not.
Sample code:
keywords = get_keywords('./related_keywords.txt', 'r')
has_keyword = False
for keyword in keywords:
if keyword in line:
has_keyword = True
break
if has_keyword:
# Your code here (for the case when the line has at least one keyword)
(The above code would be replacing if keywords in line:)

Categories