Words are not printed in the file - python

I have this project which is information search systems.
I have an array called Corpus and I want to read the words from it and extract the irregular verbs from it and save it to another file.
I have another matrix containing irregular verbs.
I compared the two matrices and extracted the regular verbs and saved them in a file called "irregular_verbs".
But the problem is when the program is executed nothing is printed in the file "irregular_verbs".
Inside this file I called the corpus array in order to pass it and compare it with the array of regular verbs and if the verb is irregular it is put in the file "irregular_verbs".
second_request.py:
from firstRequest import read_corpus_file_and_delete_stop_words;
corpus = read_corpus_file_and_delete_stop_words();
# print(corpus)
z = []
irregular_verbs = []
def read_irregular_verbs():
with open('C:/Users/Super/PycharmProjects/pythonProject/Files/irregular_verbs.txt', 'r') as file:
for line in file:
for word in line.split():
irregular_verbs.append(word)
return irregular_verbs
# print(read_irregular_verbs())
irregular_verbs_file = []
def access_irregular_verbs():
irregular_verbs_list = read_irregular_verbs()
for t in irregular_verbs_list:
for words in corpus:
for word in words:
if t != word[0]:
continue
else:
with open('../Files/irregular_verbs_output.txt', 'a+') as irregular_file:
irregular_file.write(t)
return irregular_verbs_list
print(access_irregular_verbs())
Through this file, I went to a folder named Corpus that contains many files and I saved the elements of these lists in an array.
first.py:
def read_corpus_file_and_delete_stop_words():
stop_words_list = stopwords.words('english')
additional_stopwords = []
with open("C:/Users/Super/Desktop/IR/homework/Lab4/IR Homework/stop words.txt", 'r') as file:
for word in file:
word = word.split('\n')
additional_stopwords.append(word[0])
stop_words_list += additional_stopwords
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
# save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
files_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
# save_file = open(save_dir + document, 'w')
text = reader.read()
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' '). replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if
(word not in stop_words_list)]
# save_file.writelines(["%s " % item for item in tokens_without_sw])
# print(document, ':', tokens_without_sw)
files_without_sw.append(tokens_without_sw)
return files_without_sw
print(read_corpus_file_and_delete_stop_words())

Related

Opening a folder and modifying the files while saving the new modification [duplicate]

This question already has answers here:
Replace and overwrite instead of appending
(7 answers)
Closed 1 year ago.
I have this function.
I have a folder called "Corpus" with many files inside.
I opened the files, a file file, and modified these files. The modification was to delete the period, comma, question mark, and so on.
But the problem is that I do not want to save the modification in an array, but I want to save the modification in each of the files in the corpus file.
For example, test if the first file within the corpus folder contains a period and a comma, and if it contains a period and a comma, I delete it from the first file and then move to the second file.
This means that I want to modify the same files that are in the Corpus folder and return all files at the end.
how can i do that?
# read files from corpus folder + tokenize
def read_files_from_corpus():
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
all_tokens_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
dir_path = open(dir_path + document, 'w')
text = reader.read()
# --------
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
# ------
text_tokens = word_tokenize(text)
dir_path.writelines(["%s " % item for item in text_tokens])
all_tokens_without_sw = all_tokens_without_sw + text_tokens
return all_tokens_without_sw
You need to open the file for reading and writing and after reading whole file content seek again to start of file to overwrite the data after making the needed changes.
def read_files_from_corpus():
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
all_tokens_without_sw = []
for document in os.listdir(dir_path):
# open for reading and writing
with open(dir_path + document, "r+") as reader:
text = reader.read()
# --------
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
# seek to start of file to overwrite data
reader.seek(0)
text_tokens = word_tokenize(text)
# write data back to the file
reader.writelines(["%s " % item for item in text_tokens])
all_tokens_without_sw = all_tokens_without_sw + text_tokens
return all_tokens_without_sw
this code only open file reader and edit it. Hope that is what you want.

Returns all files in the folder, not just the last file

I have this function and through this function I want to pass through all the files within the folder and then return them, but the problem is that only the last file is returned.
How can I solve this problem?
def read_corpus_file_and_delete_stop_words():
stop_words_list = stopwords.words('english')
additional_stopwords = []
with open("C:/Users/Super/Desktop/IR/homework/Lab4/IR Homework/stop words.txt", 'r') as file:
for word in file:
word = word.split('\n')
additional_stopwords.append(word[0])
stop_words_list += additional_stopwords
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
text_tokens = word_tokenize(text)
tokens_without_sw = [word.replace(',', ' ').replace('.', ' ') for word in text_tokens if
(word not in stop_words_list)]
save_file.writelines(["%s " % item.replace(',', ' ').replace('.', ' ') for item in tokens_without_sw])
# print(document, ':', tokens_without_sw)
return tokens_without_sw
Did you mean to return tokens_without_sw for every file, in a list?
def read_corpus_file_and_delete_stop_words():
stop_words_list = stopwords.words('english')
additional_stopwords = []
with open("C:/Users/Super/Desktop/IR/homework/Lab4/IR Homework/stop words.txt", 'r') as file:
for word in file:
word = word.split('\n')
additional_stopwords.append(word[0])
stop_words_list += additional_stopwords
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
files_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
text_tokens = word_tokenize(text)
tokens_without_sw = [word.replace(',', ' ').replace('.', ' ') for word in text_tokens if
(word not in stop_words_list)]
save_file.writelines(["%s " % item.replace(',', ' ').replace('.', ' ') for item in tokens_without_sw])
# print(document, ':', tokens_without_sw)
files_without_sw.append(tokens_without_sw)
return files_without_sw

Only print the content of the first file within the folder even though I want to print all the files

I have this function that returns all the files in the folder after deleting the stop words from them, but the problem is that when I print the result of this function, only the content of the first file is printed, and I want to print all the files after deleting the stop words from them.
How can I solve the problem?
def remove_stop_word_from_files():
stop_words_list = get_stop_words()
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
text_tokens = word_tokenize(text)
tokens_without_sw = [word.replace(',', '').replace('.', '') for word in
text_tokens if (word not in stop_words_list)]
save_file.writelines(["%s " % item.replace(',', '').replace('.', '') for item in
tokens_without_sw])
return tokens_without_sw
print(remove_stop_word_from_files())
The line
return tokens_without_sw
will cause the function to end on the first iteration of the for loop. Instead of returning tokens_without_sw in the for loop, you could create another variable like all_tokens_without_sw where you would append tokens_without_sw at the end of the for loop. Then after the for loop, you could return all_tokens_without_sw.
def remove_stop_word_from_files():
stop_words_list = get_stop_words()
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
all_tokens_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
text_tokens = word_tokenize(text)
tokens_without_sw = [word.replace(',', '').replace('.', '') for word in
text_tokens if (word not in stop_words_list)]
save_file.writelines(["%s " % item.replace(',', '').replace('.', '') for item in
tokens_without_sw])
all_tokens_without_sw = all_tokens_without_sw + tokens_without_sw
return all_tokens_without_sw
print(remove_stop_word_from_files())
You return statement is within the loop. You need to reduce its indent by one level. This function returns after doing its first iteration.
In addition, you are clobbering it after each iteration, rather than appending a running tally.
def remove_stop_word_from_files():
stop_words_list = get_stop_words()
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
all_tokens_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader, \
open(save_dir + document, 'w') as save_file::
text = reader.read()
text_tokens = word_tokenize(text)
tokens_without_sw =[word.replace(',', '').replace('.', '')
for word in text_tokens
if (word not in stop_words_list)])
save_file.writelines(["%s " % item.replace(',', '').replace('.', '')
for item in tokens_without_sw])
all_tokens_without_sw.extend(tokens_without_sw)
return all_tokens_without_sw
print(remove_stop_word_from_files())
The problem is you used return tokens_without_sw inside a loop. When you use a return statement inside the loop, it will break the loop. Instead try to use yield to continue return the result as for that the result is going to be a list. So, your code should be like:
def remove_stop_word_from_files():
stop_words_list = get_stop_words()
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
text_tokens = word_tokenize(text)
tokens_without_sw = [word.replace(',', '').replace('.', '') for word in
text_tokens if (word not in stop_words_list)]
save_file.writelines(["%s " % item.replace(',', '').replace('.', '') for item in
tokens_without_sw])
yield tokens_without_sw
print(list(remove_stop_word_from_files()))

Replace periods and commas with space in each file within the folder

I have a folder that contains a group of files, and each file contains a text string, periods, and commas. I want to replace the periods and commas with spaces and print all the files afterwards.
I used Replace, but this error appeared to me:
attributeError: 'list' object has no attribute 'replace'
How can i solve it?
codes.py:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import os
# 1-stop word processing
stop_words_list = stopwords.words('english')
additional_stopwords = []
with open("C:/Users/Super/Desktop/IR/homework/Lab4/IR Homework/stop words.txt", 'r') as file:
for word in file:
word = word.split('\n')
additional_stopwords.append(word[0])
stop_words_list += additional_stopwords
# --------------
# 2-tokenize and stemming
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
tokens_without_sw = [word for word in text if (word not in stop_words_list)]
cleaned = tokens_without_sw.replace(',', ' ')
cleaned = cleaned.replace('.', ' ')
ps = PorterStemmer()
text_tokens = word_tokenize(cleaned)
save_file.writelines(["%s " % item for item in text_tokens])
# cleaned = (" ").join(tokens_without_sw)
print(document, ':', tokens_without_sw)
with open("../Files/stemmer_words.txt", "a+") as stemFile:
for stemWord in tokens_without_sw:
stemFile.write(stemWord)
stemFile.write(":")
stemFile.write(ps.stem(stemWord))
stemFile.write('\n')
It seems you are trying to use the string function "replace" on a list. If your intention is to use it on all of the list's members, you can do it like so:
cleaned = [item.replace(',', ' ') for item in tokens_without_sw]
cleaned = [item.replace('.', ' ') for item in cleaned]
You can even take it one step forward and do both of the replaces at once, instead of doing two list comprehensions.
cleaned = [item.replace(',', ' ').replace('.', ' ') for item in tokens_without_sw]
Another way without list comprehensions was mentioned in the comments by Andreas.

Changing words in a string to capitalize a text file

In order to fix a bunch all-uppercase text files, I have written a script that:
Lowers all characters and capitalizes the first word of each line and the first word after a period.
Capitalizes all words that are in a list of city and country names (from another text file)
def lowit(line):
line = line.lower()
sentences = line.split('. ')
sentences2 = [sentence[0].capitalize() + sentence[1:] for sentence in sentences]
string2 = '. '.join(sentences2)
return string2
def capcico(line, allKeywords):
allWords = line.split(' ')
original = line.split(' ')
for i,words in enumerate(allWords):
words = words.replace(',', '')
words = words.replace('.', '')
words = words.replace(';', '')
if words in allKeywords:
original[i] = original[i].capitalize()
return ' '.join(original)
def main():
dfile = open('fixed.txt', 'w')
f = open('allist.txt', 'r')
allKeywords = f.read().split('\n')
with open('ulm.txt', 'r') as fileinput:
for line in fileinput:
low_line = lowit(line)
dfile.write('\n' + capcico(low_line, allKeywords))
dfile.close()
if __name__ == '__main__':
main()
It works, but the problem is that it doesn't capitalize a city/Country if there are more than one in the same line:
TOWN IN WUERTTEMBERG, GERMANY.
changes to:
Town in Wuerttemberg, germany.
Any Ideas to what's wrong?
TNX
It is because "germany" is really "germany\n".
Strip the EOL off the word...
words = words.replace(',', '')
words = words.replace('.', '')
words = words.replace(';', '')
# Add in this line to strip the EOL
words = words.rstrip('\r\n')
#Input
fileinput = open("ulm.txt").read()
##Input lower
filow = fileinput.lower()
#Keywords
allKeywords = open("allist.txt").read().split("\n")
for kw in allKeywords:
filow = filow.replace(kw.strip().lower(), kw.capitalize())
#Dots
fidots = filow.split(".")
for i,d in enumerate(fidots):
c = d.strip().capitalize()
dc = d.replace(c.lower(), c)
fidots[i] = dc
#Result
dfile = open("fixed.txt", "w")
result = ".".join(fidots)
dfile.write(result)
dfile.close()

Categories