Replace periods and commas with space in each file within the folder - python

I have a folder that contains a group of files, and each file contains a text string, periods, and commas. I want to replace the periods and commas with spaces and print all the files afterwards.
I used Replace, but this error appeared to me:
attributeError: 'list' object has no attribute 'replace'
How can i solve it?
codes.py:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import os
# 1-stop word processing
stop_words_list = stopwords.words('english')
additional_stopwords = []
with open("C:/Users/Super/Desktop/IR/homework/Lab4/IR Homework/stop words.txt", 'r') as file:
for word in file:
word = word.split('\n')
additional_stopwords.append(word[0])
stop_words_list += additional_stopwords
# --------------
# 2-tokenize and stemming
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
save_file = open(save_dir + document, 'w')
text = reader.read()
tokens_without_sw = [word for word in text if (word not in stop_words_list)]
cleaned = tokens_without_sw.replace(',', ' ')
cleaned = cleaned.replace('.', ' ')
ps = PorterStemmer()
text_tokens = word_tokenize(cleaned)
save_file.writelines(["%s " % item for item in text_tokens])
# cleaned = (" ").join(tokens_without_sw)
print(document, ':', tokens_without_sw)
with open("../Files/stemmer_words.txt", "a+") as stemFile:
for stemWord in tokens_without_sw:
stemFile.write(stemWord)
stemFile.write(":")
stemFile.write(ps.stem(stemWord))
stemFile.write('\n')

It seems you are trying to use the string function "replace" on a list. If your intention is to use it on all of the list's members, you can do it like so:
cleaned = [item.replace(',', ' ') for item in tokens_without_sw]
cleaned = [item.replace('.', ' ') for item in cleaned]
You can even take it one step forward and do both of the replaces at once, instead of doing two list comprehensions.
cleaned = [item.replace(',', ' ').replace('.', ' ') for item in tokens_without_sw]
Another way without list comprehensions was mentioned in the comments by Andreas.

Related

Opening a folder and modifying the files while saving the new modification [duplicate]

This question already has answers here:
Replace and overwrite instead of appending
(7 answers)
Closed 1 year ago.
I have this function.
I have a folder called "Corpus" with many files inside.
I opened the files, a file file, and modified these files. The modification was to delete the period, comma, question mark, and so on.
But the problem is that I do not want to save the modification in an array, but I want to save the modification in each of the files in the corpus file.
For example, test if the first file within the corpus folder contains a period and a comma, and if it contains a period and a comma, I delete it from the first file and then move to the second file.
This means that I want to modify the same files that are in the Corpus folder and return all files at the end.
how can i do that?
# read files from corpus folder + tokenize
def read_files_from_corpus():
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
all_tokens_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
dir_path = open(dir_path + document, 'w')
text = reader.read()
# --------
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
# ------
text_tokens = word_tokenize(text)
dir_path.writelines(["%s " % item for item in text_tokens])
all_tokens_without_sw = all_tokens_without_sw + text_tokens
return all_tokens_without_sw
You need to open the file for reading and writing and after reading whole file content seek again to start of file to overwrite the data after making the needed changes.
def read_files_from_corpus():
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
all_tokens_without_sw = []
for document in os.listdir(dir_path):
# open for reading and writing
with open(dir_path + document, "r+") as reader:
text = reader.read()
# --------
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
# seek to start of file to overwrite data
reader.seek(0)
text_tokens = word_tokenize(text)
# write data back to the file
reader.writelines(["%s " % item for item in text_tokens])
all_tokens_without_sw = all_tokens_without_sw + text_tokens
return all_tokens_without_sw
this code only open file reader and edit it. Hope that is what you want.

Words are not printed in the file

I have this project which is information search systems.
I have an array called Corpus and I want to read the words from it and extract the irregular verbs from it and save it to another file.
I have another matrix containing irregular verbs.
I compared the two matrices and extracted the regular verbs and saved them in a file called "irregular_verbs".
But the problem is when the program is executed nothing is printed in the file "irregular_verbs".
Inside this file I called the corpus array in order to pass it and compare it with the array of regular verbs and if the verb is irregular it is put in the file "irregular_verbs".
second_request.py:
from firstRequest import read_corpus_file_and_delete_stop_words;
corpus = read_corpus_file_and_delete_stop_words();
# print(corpus)
z = []
irregular_verbs = []
def read_irregular_verbs():
with open('C:/Users/Super/PycharmProjects/pythonProject/Files/irregular_verbs.txt', 'r') as file:
for line in file:
for word in line.split():
irregular_verbs.append(word)
return irregular_verbs
# print(read_irregular_verbs())
irregular_verbs_file = []
def access_irregular_verbs():
irregular_verbs_list = read_irregular_verbs()
for t in irregular_verbs_list:
for words in corpus:
for word in words:
if t != word[0]:
continue
else:
with open('../Files/irregular_verbs_output.txt', 'a+') as irregular_file:
irregular_file.write(t)
return irregular_verbs_list
print(access_irregular_verbs())
Through this file, I went to a folder named Corpus that contains many files and I saved the elements of these lists in an array.
first.py:
def read_corpus_file_and_delete_stop_words():
stop_words_list = stopwords.words('english')
additional_stopwords = []
with open("C:/Users/Super/Desktop/IR/homework/Lab4/IR Homework/stop words.txt", 'r') as file:
for word in file:
word = word.split('\n')
additional_stopwords.append(word[0])
stop_words_list += additional_stopwords
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
# save_dir = "C:/Users/Super/Desktop/IR/homework/Files_Without_SW/"
files_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
# save_file = open(save_dir + document, 'w')
text = reader.read()
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' '). replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if
(word not in stop_words_list)]
# save_file.writelines(["%s " % item for item in tokens_without_sw])
# print(document, ':', tokens_without_sw)
files_without_sw.append(tokens_without_sw)
return files_without_sw
print(read_corpus_file_and_delete_stop_words())

Data generation Python

I'm trying to generate a dataset based on an existing one, I was able to implement a method to randomly change the contents of files, but I can’t write all this to a file. Moreover, I also need to write the number of changed words to the file, since I want to use this dataset to train a neural network, could you help me?
Input: files with 2 lines of text in each.
Output: files with 3(maybe) lines: the first line does not change, the second changes according to the method, the third shows the number of words changed (if for deep learning tasks it is better to do otherwise, I would be glad to advice, since I'm a beginner)
from random import randrange
import os
Path = "D:\corrected data\\"
filelist = os.listdir(Path)
if __name__ == "__main__":
new_words = ['consultable', 'partie ', 'celle ', 'également ', 'forte ', 'statistiques ', 'langue ',
'cadeaux', 'publications ', 'notre', 'nous', 'pour', 'suivr', 'les', 'vos', 'visitez ', 'thème ', 'thème ', 'thème ', 'produits', 'coulisses ', 'un ', 'atelier ', 'concevoir ', 'personnalisés ', 'consultable', 'découvrir ', 'fournit ', 'trace ', 'dire ', 'tableau', 'décrire', 'grande ', 'feuille ', 'noter ', 'correspondant', 'propre',]
nb_words_to_replace = randrange(10)
#with open("1.txt") as file:
for i in filelist:
# if i.endswith(".txt"):
with open(Path + i,"r",encoding="utf-8") as file:
# for line in file:
data = file.readlines()
first_line = data[0]
second_line = data[1]
print(f"Original: {second_line}")
# print(f"FIle: {file}")
second_line_array = second_line.split(" ")
for j in range(nb_words_to_replace):
replacement_position = randrange(len(second_line_array))
old_word = second_line_array[replacement_position]
new_word = new_words[randrange(len(new_words))]
print(f"Position {replacement_position} : {old_word} -> {new_word}")
second_line_array[replacement_position] = new_word
res = " ".join(second_line_array)
print(f"Result: {res}")
with open(Path + i,"w") as f:
for line in file:
if line == second_line:
f.write(res)
In short, you have two questions:
How to properly replace line number 2 (and 3) of the file.
How to keep track of number of words changed.
How to properly replace line number 2 (and 3) of the file.
Your code:
with open(Path + i,"w") as f:
for line in file:
if line == second_line:
f.write(res)
Reading is not enabled. for line in file will not work. fis defined, but file is used instead. To fix this, do the following instead:
with open(Path + i,"r+") as file:
lines = file.read().splitlines() # splitlines() removes the \n characters
lines[1] = second_line
file.writelines(lines)
However, you want to add more lines to it. I suggest you structure the logic differently.
How to keep track of number of words changed.
Add varaible changed_words_count and increment it when old_word != new_word
Resulting code:
for i in filelist:
filepath = Path + i
# The lines that will be replacing the file
new_lines = [""] * 3
with open(filepath, "r", encoding="utf-8") as file:
data = file.readlines()
first_line = data[0]
second_line = data[1]
second_line_array = second_line.split(" ")
changed_words_count = 0
for j in range(nb_words_to_replace):
replacement_position = randrange(len(second_line_array))
old_word = second_line_array[replacement_position]
new_word = new_words[randrange(len(new_words))]
# A word replaced does not mean the word has changed.
# It could be replacing itself.
# Check if the replacing word is different
if old_word != new_word:
changed_words_count += 1
second_line_array[replacement_position] = new_word
# Add the lines to the new file lines
new_lines[0] = first_line
new_lines[1] = " ".join(second_line_array)
new_lines[2] = str(changed_words_count)
print(f"Result: {new_lines[1]}")
with open(filepath, "w") as file:
file.writelines(new_lines)
Note: Code not tested.

How to create table to find mean of document using python

I have a directory containing corpus text files, I want to create a table which contains the number of words in each document that is table contains column of document number & row contains word count in that document for each unique word...all should be done in python...please help...thank you...
The table should look like this:
word1 word2 word3 ...
doc1 14 5 45
doc2 6 1 0
.
.
.
import nltk
import collections
import os.path
def cleanDoc(doc):
stopset = set(nltk.corpus.stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = nltk.WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
path = "c://Users/Desktop/corpus files"
i=0
for file in os.listdir(path) :
f = open("c://Users/Desktop/corpus files/file%d.txt" %i,'r')
data= f.read()
words = cleanDoc(data)
fw = open("c://Users/Desktop/words/words%d.txt" %i,'w')
fd = collections.Counter(words)
#fd = nltk.FreqDist(words)
#plot(fd)
row_format = "{:>15}" * (len(words) + 1)
print row_format.format("document %d" %i, *words)
#for
fw.write(str(fd))
fw.write(str(words))
fw.close()
i=i+1
f.close()
I think this is fairly close, if not exactly, what you want. In case it isn't, I tried to make things easy to change.
To produce the table desired processing is done two phases. In the first, the unique words in each document file of the formfile<document-number>.txtare found and saved in a corresponding words<document-number>.txtfile, plus they are added to a set of comprising all the unique words seen among all document files. This set is needed to produce table columns that consist of all the unique words in all the files and is why two phases of processing were required.
In the second phase, the word files are read back in and turned back into dictionies which used to fill in the corresponding columns of the table being printed.
import ast
import collections
import nltk
import re
import os
user_name = "UserName"
path = "c://Users/%s/Desktop/corpus files" % user_name
def cleanDoc(doc):
stopset = set(nltk.corpus.stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = nltk.WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens
if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
# phase 1 -- find unique words, create word files, update overall unique word set
corpus_file_pattern = re.compile(r"""file(\d+).txt""")
unique_words = set()
longest_filename = 0
document_nums = []
for filename in os.listdir(path):
corpus_file_match = corpus_file_pattern.match(filename)
if corpus_file_match: # corpus text file?
if len(filename) > longest_filename:
longest_filename = len(filename)
document_num = int(corpus_file_match.group(1))
document_nums.append(document_num)
with open(os.path.join(path, filename)) as file:
data = file.read()
words = cleanDoc(data)
unique_words.update(words)
fd = collections.Counter(words)
words_filename = "words%d.txt" % document_num
with open(os.path.join(path, words_filename), mode = 'wt') as fw:
fw.write(repr(dict(fd)) + '\n') # write representation as dict
# phase 2 -- create table using unique_words and data in word files
unique_words_list = sorted(unique_words)
unique_words_empty_counter = collections.Counter({word: 0 for word
in unique_words})
document_nums = sorted(document_nums)
padding = 2 # spaces between columns
min_col_width = 5
col_headings = ["Document"] + unique_words_list
col_widths = [max(min_col_width, len(word))+padding for word in col_headings]
col_widths[0] = longest_filename+padding # first col is special case
# print table headings
for i, word in enumerate(col_headings):
print "{:{align}{width}}".format(word, align='>' if i else '<',
width=col_widths[i]),
print
for document_num in document_nums:
# read word in document dictionary back in
filename = "words%d.txt" % document_num
file_words = unique_words_empty_counter.copy()
with open(os.path.join(path, filename)) as file:
data = file.read()
# convert data read into dict and update with file word counts
file_words.update(ast.literal_eval(data))
# print row of data
print "{:<{width}}".format(filename, width=col_widths[0]),
for i, word in enumerate(col_headings[1:], 1):
print "{:>{width}n}".format(file_words[word], width=col_widths[i]),
print

Replace four letter word in python

I am trying to write a program that opens a text document and replaces all four letter words with **. I have been messing around with this program for multiple hours now. I can not seem to get anywhere. I was hoping someone would be able to help me out with this one. Here is what I have so far. Help is greatly appreciated!
def censor():
filename = input("Enter name of file: ")
file = open(filename, 'r')
file1 = open(filename, 'w')
for element in file:
words = element.split()
if len(words) == 4:
file1 = element.replace(words, "xxxx")
alist.append(bob)
print (file)
file.close()
here is revised verison, i don't know if this is much better
def censor():
filename = input("Enter name of file: ")
file = open(filename, 'r')
file1 = open(filename, 'w')
i = 0
for element in file:
words = element.split()
for i in range(len(words)):
if len(words[i]) == 4:
file1 = element.replace(i, "xxxx")
i = i+1
file.close()
for element in file:
words = element.split()
for word in words:
if len(word) == 4:
etc etc
Here's why:
say the first line in your file is 'hello, my name is john'
then for the first iteration of the loop: element = 'hello, my name is john'
and words = ['hello,','my','name','is','john']
You need to check what is inside each word thus for word in words
Also it might be worth noting that in your current method you do not pay any attention to punctuation. Note the first word in words above...
To get rid of punctuation rather say:
import string
blah blah blah ...
for word in words:
cleaned_word = word.strip(string.punctuation)
if len(cleaned_word) == 4:
etc etc
Here is a hint: len(words) returns the number of words on the current line, not the length of any particular word. You need to add code that would look at every word on your line and decide whether it needs to be replaced.
Also, if the file is more complicated than a simple list of words (for example, if it contains punctuation characters that need to be preserved), it might be worth using a regular expression to do the job.
It can be something like this:
def censor():
filename = input("Enter name of file: ")
with open(filename, 'r') as f:
lines = f.readlines()
newLines = []
for line in lines:
words = line.split()
for i, word in enumerate(words):
if len(word) == 4:
words[i] == '**'
newLines.append(' '.join(words))
with open(filename, 'w') as f:
for line in newLines:
f.write(line + '\n')
def censor(filename):
"""Takes a file and writes it into file censored.txt with every 4-letterword replaced by xxxx"""
infile = open(filename)
content = infile.read()
infile.close()
outfile = open('censored.txt', 'w')
table = content.maketrans('.,;:!?', ' ')
noPunc = content.translate(table) #replace all punctuation marks with blanks, so they won't tie two words together
wordList = noPunc.split(' ')
for word in wordList:
if '\n' in word:
count = word.count('\n')
wordLen = len(word)-count
else:
wordLen = len(word)
if wordLen == 4:
censoredWord = word.replace(word, 'xxxx ')
outfile.write(censoredWord)
else:
outfile.write(word + ' ')
outfile.close()

Categories