I sincerely apologize if this is the incorrect way to ask my question. This is my first time posting in Stack.
My inFile is six edited lines of the poem do not go gentle into the night. It should print out an outFile that contains the lines that contain a word that is greater than 3 letters, that is a duplicate. In example "rage rage against the dying of the light" would be printed because of "rage".
edit: When I run this it gives me an error saying "i" is undefined.
Oh, and I can't use any modules.
Here is my code:
def duplicateWordLines(inFile,outFile):
inFile=open(inFileName, "r")
outFile=open(outFileName, "w")
for line in inFile:
words=line.split() #split the lines
og=[] #orignal words
dups=[] #duplicate words
for word in words: #for each word in words
if og.count(i)>0 and line not in dups: #if the word appears more than once and not already in duplicates
dups.append(line) #add to duplicates
else: #if not a duplicate
og.append(i) #add to the original list - not to worry about it
for line in dups: #for the newly appended lines
outFile.write(line+'\n') #write in the outFile
#test case
inFileName="goodnightPoem.txt"
outFileName="goodnightPoemDUP.txt"
duplicateWordLines(inFileName,outFileName)
#should print
#rage rage against the dying of the light
#do not go gentle into that good good night
Thank you!
Try this out...
def duplicateWordLines(inFile,outFile):
inFile=open(inFileName, "r")
outFile=open(outFileName, "w")
for line in inFile:
# split the lines
words=line.split()
# remove all words less than 3 characters
words = [word for word in words if len(word)>3]
# make the list a set, so all duplicates are removed
no_dups = set(words)
# if there are more words in the words list than the
# no duplicate list, we must have a duplicate, so
# print the line
if len(words) > len(no_dups):
outFile.write(line+'\n') #write in the outFile
#test case
inFileName="file.txt"
outFileName="file_1.txt"
duplicateWordLines(inFileName,outFileName)
Regarding the i is undefined error, let's look at your for loop
for word in words: #for each word in words
if og.count(i)>0 and line not in dups: #if the word appears more than once and not already in duplicates
dups.append(line) #add to duplicates
else: #if not a duplicate
og.append(i) #add to the original list - not to worry about it
You don't actually define i anywhere, your loop defines word. You are blending a smart loop, i.e. for word for words with a range loop, like for i in range(0,len(words)). If we were to fix your loop, I think it would look something like this...
for word in words: #for each word in words
if og.count(word)>0 and line not in dups: #if the word appears more than once and not already in duplicates
dups.append(line) #add to duplicates
else: #if not a duplicate
og.append(word) #add to the original list - not to worry
Related
I am creating a project game which will include palindrome words
I have a list of all the words in english and I want to check every word in the list and find the ones equal to eachother
file1 = open ('words.txt')
file2reversed = open ('words.txt')
words = file1.readlines()
print(words[3][::-1])
print()
if words[3][::-1] == words[3]:
print("equal")
else:
print("not")
my code looks like this, I wrote the 3rd word as a palindrome word and wanted to check if it is working and the output looks like this
aaa
aaa
not
why is words[3][::-1] not equal to words[3] even if it is a palindrome word?
Use file.read().splitlines() instead. file.readlines() returns lines with a newline appended to each string at the end, so when reversed, '\naaa' != 'aaa\n'.
More cleanly
file = open('words.txt')
text = file.read()
words = text.splitlines()
# words is a list of strings without '\n' at the end of each line.
I am working on counting the length of split sentences, but always get index out of range error when trying to print out lines/lists that has [1] within them.
The code:
for line in open("testing.txt"):
strip = line.rstrip()
words = strip.split(';')
first = words[0]
for test in words:
if words[1] in words:
print(words)
else:
continue
The split output of the sample .txt file are for example:
['"What does Bessie say I have done?" I asked.']
['Be seated somewhere', ' and until you can speak pleasantly, remain silent."']
['Of farthest Thule', ' and the Atlantic surge']
['Pours in among the stormy Hebrides."']
['"Alright, let's get out of here!" I yelled.']
So some sentences only got [0] element while the ones with [1] are the sentences I am trying to print out (The current if/else statement doesn't work).
The expected output (basically any split sentences/lists that has a second element):
['Be seated somewhere', ' and until you can speak pleasantly, remain silent."']
['Of farthest Thule', ' and the Atlantic surge']
You're getting this error because you try to access the second element of an array that contains only 1 string. In this case you want to check the length of the array
for line in open("testing.txt"):
strip = line.rstrip()
words = strip.split(';')
for test in words:
if len(words) > 1:
print(words)
else: # this else is not necessary
continue
Edit: If you want to print each sentences containing at least one ';' only once, you don't actually have to use a for loop. One concise way to get the desired output would be this:
for line in open("testing.txt"):
strip = line.rstrip()
words = strip.split(';')
if len(words) > 1:
print(words)
As far as I understand from your question you only trying to print the words list which has more than one element.
One simple way to do it is:
for line in open("testing.txt"):
strip = line.rstrip()
words = strip.split(';')
# first = words[0]
for test in words:
if len(words) > 1:
print(words)
Here you are just checking if the length of the words is greater than 1 and printing if that is the case
EDIT:
I think the for loop is unnecessary. All you want is to print lists of words greater than length 1.
So for that purpose:
for line in open("testing.txt"):
strip = line.rstrip()
words = strip.split(';')
if len(words) > 1:
print(words)
Here you are just splitting the sentences on ; and then checking after splitting if the length of the list (named words) is greater than 1; if so you are printing the list named words.
EDIT2:
As S3DEV had pointed out that you are opening a file inside for keyword which won't close your file automatically once you are out of for loop. As a result the file pointer remains open until the program stopped completely and it might cause weird issues. The best practice is to use with keyword. the with keyword automatically opens the file nad closes it once the block execution is complete, so you won't face any odd issues. form keeping a file pointer open.
with open("testing.txt", "r") as f: # this line open file as f in read-only format
for line in f:
strip = line.rstrip()
words = strip.split(';')
if len(words) > 1:
print(words)
Datasets: Two Large text files for train and test that all words of them are tokenized. a part of data is like the following: " the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place . "
Question: How can I replace every word in the test data not seen in training with the word "unk" in Python?
So far, I made the dictionary by the following codes to count the frequency of each word in the file:
#open text file and assign it to varible with the name "readfile"
readfile= open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-train.txt','r')
writefile=open('C:/Users/amtol/Desktop/NLP/Homework_1/brown-trainReplaced.txt','w')
# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in readfile:
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
#replace all words occurring in the training data once with the token<unk>.
for key in list(d.keys()):
line= d[key]
if (line==1):
line="<unk>"
writefile.write(str(d))
else:
writefile.write(str(d))
#close the file that we have created and we wrote the new data in that
writefile.close()
Honestly the above code doesn't work with writefile.write(str(d)) which I want to write the result in the new textfile, but by print(key, ":", line) it works and shows the frequency of each word but in the console which doesn't create the new file. if you also know the reason for this, please let me know.
First off, your task is to replace the words in test file that are not seen in train file. Your code never mentions the test file. You have to
Read the train file, gather what words are there. This is mostly okay; but you need to .strip() your line or the last word in each line will end with a newline. Also, it would make more sense to use set instead of dict if you don't need to know the count (and you don't, you just want to know if it's there or not). Sets are cool because you don't have to care if an element is in already or not; you just toss it in. If you absolutely need to know the count, using collections.Counter is easier than doing it yourself.
Read the test file, and write to replacement file, as you are replacing the words in each line. Something like:
with open("test", "rt") as reader:
with open("replacement", "wt") as writer:
for line in reader:
writer.write(replaced_line(line.strip()) + "\n")
Make sense, which your last block does not :P Instead of seeing whether a word from test file is seen or not, and replacing the unseen ones, you are iterating on the words you have seen in the train file, and writing <unk> if you've seen them exactly once. This does something, but not anything close to what it should.
Instead, split the line you got from the test file and iterate on its words; if the word is in the seen set (word in seen, literally) then replace its contents; and finally add it to the output sentence. You can do it in a loop, but here's a comprehension that does it:
new_line = ' '.join(word if word in seen else '<unk>'
for word in line.split(' '))
I'm writing a function called HASHcount(name,list), which receives 2 parameters, the name one is the name of the file that will be analized, a text file structured like this:
Date|||Time|||Username|||Follower|||Text
So, basically my input is a tweets list, with several rows structured like above. The list parameter is a list of hashtags I want to count in that text file. I want my function to check how many times each word of the list given occurred in the tweets list, and give as output a dictionary with each word count, even if the word is missing.
For instance, with the instruction HASHcount(December,[Peace, Love]) the program should give as output a dictionary made by checking how many times the word Peace and the word Love have been used as hashtag in the Text field of each tweet in the file called December.
Also, in the dictionary the words have to be without the hashtag simbol.
I'm stuck on making this function, I'm at this point but I'm having some issues concerning the dictionary:
def HASHcount(name,list):
f = open(name,"r")
dic={}
l = f.readline()
for word in list:
dic[word]=0
for line in f:
li_lis=line.split("|||")
li_tuple=tuple(li_lis)
if word in li_tuple[4]:
dic[word]=dic[word]+1
return dic
The main issue is that you are iterating over the lines in the file for each word, rather than the reverse. Thus the first word will consume all the lines of the file, and each subsequent word will have 0 matches.
Instead, you should do something like this:
def hash_count(name, words):
dic = {word:0 for word in words}
with open(name) as f:
for line in f:
line_text = line.split('|||')[4]
for word in words:
# Check if word appears as a hashtag in line_text
# If so, increment the count for word
return dic
There are several issues with your code, some of which have already been pointed out, while others (e.g concerning the identification of hashtags in a tweet's text) have not. Here's a partial solution not covering the fine points of the latter issue:
def HASHcount(name, words):
dic = dict.fromkeys(words, 0)
with open(name,"r") as f:
for line in f:
for w in words:
if '#' + w in line:
dic[w] += 1
return dic
This offers several simplifications keyed on the fact that hashtags in a tweet do start with # (which you don't want in the dic) -- as a result it's not worth analyzing each line since the # cannot be present except in the text.
However, it still has a fraction of a problem seen in other answers (except the one which just commented out this most delicate of parts!-) -- it can get false positives by partial matches. When the check is just like word in linetext the problem would be huge -- e.g if a word is cat it gets counted as hashtag even if present in perfectly ordinary text (on its own or as part of another word, e.g vindicative). With the '#' + approach, it's a bit better, but still, prefix matches would lead to a false positive, e.g #catalog would erroneously be counted as a hit for cat.
As some suggested, regular expressions can help with that. However, here's an alternative for the body of the for w in words loop...
for w in words:
where = line.find('#' + w)
if where == -1: continue
after = line[where + len(w) + 1]
if after in chars_acceptable_in_hashes: continue
dic[w] += 1
The only issue remaining is to determine which characters can be part of hashtags, i.e, the set chars_acceptable_in_hashes -- I haven't memorized Twitter's specs so I don't know it offhand, but surely you can find out. Note that this works at end of line, too, because line has not be stripped, so it's known to end with a \n. which is not in the acceptable set (so a hashtag at the very end of the line will be "properly terminated" too).
I like using collections module. This worked for me.
from collections import defaultdict
def HASHcount(file_to_open, lst):
with open(file_to_open) as my_file:
my_dict= defaultdict(int)
for line in my_file:
line = line.split('|||')
txt = line[4].strip(" ")
if txt in lst:
my_dict[txt] += 1
return my_dict
def myfunc(filename):
filename=open('hello.txt','r')
lines=filename.readlines()
filename.close()
lengths={}
for line in lines:
for punc in ".,;'!:&?":
line=line.replace(punc," ")
words=line.split()
for word in words:
length=len(word)
if length not in lengths:
lengths[length]=0
lengths[length]+=1
for length,counter in lengths.items():
print(length,counter)
filename.close()
Use Counter. (<2.7 version)
You are counting the frequency of words in a single line.
for line in lines:
for word in length.keys():
print(wordct,length)
length is dict of all distinct words plus their frequency, not their length
length.get(word,0)+1
so you probably want to replace the above with
for line in lines:
....
#keep this at this indentaiton - will have a v large dict but of all words
for word in sorted(length.keys(), key=lambda x:len(x)):
#word, freq, length
print(word, length[word], len(word), "\n")
I would also suggest
Dont bring the file into memory like that, the file objects and handlers are now iterators and well optimised for reading from files.
drop the wordct and so on in the main lines loop.
rename length to something else - perhaps words or dict_words
Errr, maybe I misunderstood - are you trying to count the number of distinct words in the file, in which case use len(length.keys()) or the length of each word in the file, presumably ordered by length....
The question has been more clearly defined now so replacing the above answer
The aim is to get a frequency of word lengths throughout the whole file.
I would not even bother with line by line but use something like:
fo = open(file)
d_freq = {}
st = 0
while 1:
next_space_index = fo.find(" ", st+1)
word_len = next_space_index - st
d_freq.get(word_len,0) += 1
print d_freq
I think that will work, not enough time to try it now. HTH