Removing phrases from a file based on another file (Python)

Removing phrases from a file based on another file (Python) - python

How do I do this in python?
badphrases.txt contains
Go away
Don't do that
Stop it
allphrases.txt contains
I don't know why you do that. Go away.
I was wondering what you were doing.
You seem nice
I want allphrases.txt to be clean of the lines in badphrases.txt.
It's trivial in bash
cat badfiles.txt | while read b
do
cat allphrases.txt | grep -v "$b" > tmp
cat tmp > allphrases.txt
done
Oh, you thought I hadn't looked or tried. I searched for over and hour.
Here's my code:
# Files
ttv = "/tmp/tv.dat"
tmp = "/tmp/tempfile"
bad = "/tmp/badshows"
badfiles already exists
...code right here creates ttv
# Function grep_v
def grep_v(f,str):
file = open(f, "r")
for line in file:
if line in str:
return True
return False
t = open(tmp, 'w')
tfile = open(ttv, "r")
for line in tfile:
if not grep_v(bad,line):
t.write(line)
tfile.close
t.close
os.rename(tmp, ttv)

First google how to read a file in python:
you will probably get something like this: How do I read a file line-by-line into a list?
Use this to read both the files in lists
with open('badphrases.txt') as f:
content = f.readlines()
badphrases = [x.strip() for x in content]
with open('allphrases.txt') as f:
content = f.readlines()
allphrases = [x.strip() for x in content]
Now you have both the content in lists.
Iterate over allphrases and check if phrases from badphrases are present in it.
At this point you might consider google :
how to iterate over a list python
how to check if string present in another string python
Take the code from those places and built a brute-force algo like this:
for line in allphrases:
flag = True
for badphrase in badphrases:
if badphrase in line:
flag = False
break
if flag:
print(line)
If you can understand this code then you will notice you need to replace print with output to file:
Now google how to print to file python.
Then think about how to improve the algorithm. All the best.
UPDATE:
#COLDSPEED suggested you can simple google
- how to replace lines in a file in python:
You might get something like this: Search and replace a line in a file in Python
Which also works.

Solution not too bad.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import feedparser, os, re
# Files
h = os.environ['HOME']
ttv = h + "/WEB/Shows/tv.dat"
old = h + "/WEB/Shows/old.dat"
moo = h + "/WEB/Shows/moo.dat"
tmp = h + "/WEB/Shows/tempfile"
bad = h + "/WEB/Shows/badshows"
# Function not_present
def not_present(f,str):
file = open(f, "r")
for line in file:
if str in line:
return False
return True
# Sources (shortened)
sources = ['http://predb.me/?cats=tv&rss=1']
# Grab all the feeds and put them into ttv and old
k = open(old, 'a')
f = open(ttv, 'a')
for h in sources:
d = feedparser.parse(h)
for post in d.entries:
if not_present(old,post.link):
f.write(post.title + "|" + post.link + "\n")
k.write(post.title + "|" + post.link + "\n")
f.close
k.close
# Remove shows without [Ss][0-9] and put them in moo
m = open(moo, 'a')
t = open(tmp, 'w')
file = open(ttv, "r")
for line in file:
if re.search(r's[0-9]', line, re.I) is None:
m.write(line)
# print("moo", line)
else:
t.write(line)
# print("tmp", line)
t.close
m.close
os.rename(tmp, ttv)
# Remove badshows
t = open(tmp, 'w')
with open(bad) as f:
content = f.readlines()
bap = [x.strip() for x in content]
with open(ttv) as f:
content = f.readlines()
all = [x.strip() for x in content]
for line in all:
flag = True
for b in bap:
if b in line:
flag = False
break
if flag:
t.write(line + "\n")
t.close
os.rename(tmp, ttv)

Related

I am trying to create a program that edits text by letting you select from a few things and changes them to one of a few options

import os, re
config_file = "jsm_gyro_config.txt"
#fptr = open(config, "w")
#text = "demo text"
#fptr.write(text)
#fptr.close()
file = open(config_file, 'r')
file-read = file.read()
for line in file-read:
if re.search(userinput, file-read):
x = re.search(userinput, file-read)
# iteminputted is what the user wants to replace
iteminputted = "ref"
startpostion = x.span[1] + 3
endpostion = startposition + len(iteminputted)
# Find out how to write to a specific location in a file that will finish this off
else:
print("Item not found")
This is what i've tried and here is my thought process as always any help is appreatated and please make it understandable for an idiot :(

To begin with, you should not use - in your variable declarations as it is actually an operator and will always be treated as such. It will attempt to subtract.
Here is the same code with that fixed and also with the input
import os, re
config_file = "jsm_gyro_config.txt"
#fptr = open(config, "w")
#text = "demo text"
#fptr.write(text)
#fptr.close()
file = open(config_file, 'r')
file_read = file.read()
file.close() # You should always close your files.
for line in file_read:
if re.search(userinput, file_read):
x = re.search(userinput, file_read)
# iteminputted is what the user wants to replace
iteminputted = input("Input what you would like to replace > ")
startpostion = x.span[1] + 3
endpostion = startposition + len(iteminputted)
# Find out how to write to a specific location in a file that will finish this off
else:
print("Item not found")
However your question is very unclear, I did the best I could.

replace line if found or append - python

I have text that is key-value pairs separated by '='. I would like to replace the line if the key matches. if not, i would like to append it at the bottom. I've tried several ways, including:
def split_command_key_and_value(command):
if '=' in command:
command2 = command.split('=')
return command2
def test(command, path):
command2 = split_command_key_and_value(command)
pattern = command2[0]
myfile = open(path,'r') # open file handle for read
# use r'', you don't need to replace '\' with '/'
result = open(path, 'w') # open file handle for write
for line in myfile:
line = line.strip() # it's always a good behave to strip what you read from files
if pattern in line:
line = command # if match, replace line
result.write(line) # write every line
myfile.close() # don't forget to close file handle
result.close()
I know the above is just to replace text, but it deletes the text in the file, and I can't see why. Could someone point me in the right direction?
Thanks
Update:
I'm almost there, but some of my lines have similar keys, so mutiple lines are matching when only 1 should. I've tried to incorporate a regex boundary in my loop with no luck. My code is below. Does anyone have a suggestion?
There is some text in the file that isn't key-value, so I would like to skip that.
def modify(self, name, value):
comb = name + ' ' + '=' + ' ' + value + '\n'
with open('/file/', 'w') as tmpstream:
with open('/file/', 'r') as stream:
for line in stream:
if setting_name in line:
tmpstream.write(comb)
else:
tmpstream.write(line)
I think I got it. See code below.
def modify(self, name, value):
comb = name + ' ' + '=' + ' ' + value + '\n'
mylist = []
with open('/file/', 'w') as tmpstream:
with open('/file/', 'r') as stream:
for line in stream:
a = line.split()
b = re.compile('\\b'+name+'\\b')
if len(a) > 0:
if b.search(a[0]):
tmpstream.write(comb)
else:
tmpstream.write(line)
I spoke too soon. It stops at the key-value I provide. So, it only writes one line, and doesn't write the lines that don't match.
def modify(name, value):
comb = name + ' ' + '=' + ' ' + value + '\n'
mylist = []
with open('/file1', 'w') as tmpstream:
with open('/file2', 'r') as stream:
for line in stream:
a = line.split()
b = re.compile('\\b'+name+'\\b')
if len(a) > 0:
if b.search(a[0]):
tmpstream.write(comb)
else:
tmpstream.write(line)
Can anyone see the issue?

Because when you open file for writing
result = open(path, 'w') # open file handle for write
you just erase it content. Try to write in different file and after all work done replace old file with new one. Or read all data into memory and then process it and write to file.
with open(path) as f:
data = f.read()
with open(path, 'w') as f:
for l in data:
# make job here

first of all you are reading an writing the same file ...
you could first read it all and the write line by line
with open(path,'r') as f:
myfile = f.read() # read everything in the variable "myfile"
result = open(path, 'w') # open file handle for write
for line in myfile.splitlines(): # process the original file content 1 line at a time
# as before

I strongly recommend reading python's documentation on how to read and write files.
If you open an existing file in write-mode open(path, 'w'), its content will be erased:
mode can be (...) 'w' for only writing (an existing file with the same name will be erased)
To replace a line in python you can have a look at this: Search and replace a line in a file in Python
Here is one the solutions provided there adapted to your context (tested for python3):
from tempfile import mkstemp
from shutil import move
from os import close
def test(filepath, command):
# Split command into key/value
key, _ = command.split('=')
matched_key = False
# Create a temporary file
fh, tmp_absolute_path = mkstemp()
with open(tmp_absolute_path, 'w') as tmp_stream:
with open(filepath, 'r') as stream:
for line in stream:
if key in line:
matched_key = True
tmp_stream.write(command + '\n')
else:
tmp_stream.write(line)
if not matched_key:
tmp_stream.write(command + '\n')
close(fh)
move(tmp_absolute_path, filepath)
Note that with the code above every line that matches key (key=blob or blob=key) will be replaced.

Python -- How to split headers/chapters into separate files automatically

I'm converting text directly to epub and I'm having a problem automatically splitting the HTML book file into separate header/chapter files. At the moment, the code below partially works but only creates every other chapter file. So half the header/chapter files are missing from the output. Here is the code:
def splitHeaderstoFiles(fpath):
infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:
# format and split headers to files
if '<h1' in line:
#-----------format header file names and other stuff ------------#
# create a new file for the header/chapter section
path = os.getcwd() + os.sep + header
with open(path, 'wt', encoding=('utf-8')) as outfp:
# write html top meta headers
outfp = addMetaHeaders(outfp)
# add the header
outfp = outfp.write(line)
# add the chapter/header bodytext
for line in infp:
if '<h1' not in line:
outfp.write(line)
else:
outfp.write('</body>\n</html>')
break
else:
continue
infp.close()
The problem occurs in the second 'for loop' at the bottom of the code, when I look for the next h1 tag to stop the split. I cannot use seek() or tell() to rewind or move back one line so the program can find the next header/chapter on the next iteration. Apparently you cannot use these in python in a for loop containing an implicit iter or next object in operation. Just gives a 'can't do non-zero cur-relative seeks' error.
I've also tried the while line != ' ' + readline() combination in the code which also gives the same error as above.
Does anyone know an easy way to split HTML headers/chapters of varying lengths into separate files in python? Are there any special python modules(such as pickles) that could help make this task easier?
I'm using Python 3.4
My grateful thanks in advance for any solutions to this problem...

I ran into similar problem a while ago, here is a simplified solution:
from itertools import count
chapter_number = count(1)
output_file = open('000-intro.html', 'wb')
with open('index.html', 'rt') as input_file:
for line in input_file:
if '<h1' in line:
output_file.close()
output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
output_file.write(line)
output_file.close()
In this approach, the first block of text leading to the first h1 block is written into 000-intro.html, the first chapter will be written into 001-chapter.html and so on. Please modify it to taste.
The solution is a simple one: Upon encountering the h1 tag, close the last output file and open a new one.

You are looping over your input file twice, which is likely causing your problems:
for line in infp:
...
with open(path, 'wt', encoding=('utf-8')) as outfp:
...
for line in infp:
...
Each for is going to have it's own iterator, so you are going to loop over the file many times.
You might try transforming your for loop into a while so you're not using two different iterators:
while infp:
line = infp.readline()
if '<h1' in line:
with open(...) as outfp:
while infp:
line = infp.readline()
if '<h1' in line:
break
outfp.writeline(...)
Alternatively, you may wish to use an HTML parser (i.e., BeautifulSoup). Then you can do something like what is described here: https://stackoverflow.com/a/8735688/65295.
Update from comment - essentially, read the entire file all at once so you can freely move back or forward as necessary. This probably won't be a performance issue unless you have a really really big file (or very little memory).
lines = infp.readlines() # read the entire file
i = 0
while i < len(lines):
if '<h1' in lines[i]:
with open(...) as outfp:
j = i + 1
while j < len(lines):
if '<h1' in lines[j]:
break
outfp.writeline(lines[j])
# line j has an <h1>, set i to j so we detect the it at the
# top of the next loop iteration.
i = j
else:
i += 1

I eventually found the answer to the above problem. The code below does alot more that just get the file header. It also simultaneously loads two parallel list arrays with formatted file name data(with extension) and pure header name data respectively so I can use these lists to fill in the and formatted filename extension in these html files within a while loop in one hit. The code now works well and is shown below.
def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []
inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:
for line in infp:
if '<h1' in line:
#strip html tags, convert to start caps
p = re.compile(r'<.*?>')
header = p.sub('', line)
header = capwords(header)
line_save = header
# Add 0 for count below 10
if count < 10:
header = '0' + str(count) + '_' + header
else:
header = str(count) + '_' + header
# remove all spaces + add extension in header
header = header.replace(' ', '_')
header = header + '.xhtml'
count = count + 1
#create two parallel lists used later
out_path = dir + os.sep + header
outfp = open(out_path, 'wt', encoding=('utf-8'))
file_path_names.insert(t_count, out_path)
pure_header_names.insert(t_count, line_save)
t_count = t_count + 1
# Add html meta headers and write it
outfp = addMainHeaders(outfp)
outfp.write(line)
write_bodytext = True
# add header bodytext
elif write_bodytext == True:
outfp.write(line)
# now add html titles and close the html tails on all files
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0
while i < max_num_files:
outfp = open(tmp, 'wt', encoding=('utf-8'))
infp = open(file_path_names[i], 'rt', encoding=('utf-8'))
for line in infp:
if '<title>' in line:
line = line.strip(' ')
line = line.replace('<title></title>', '<title>' + pure_header_names[i] + '</title>')
outfp.write(line)
else:
outfp.write(line)
# add the html tail
if '</body>' in line or '</html>' in line:
pass
else:
outfp.write(' </body>' + '\n</html>')
# clean up
infp.close()
outfp.close()
shutil.copy2(tmp, file_path_names[i])
os.remove(tmp)
i = i + 1
# now rename just the title page
if os.path.isfile(file_path_names[0]):
title_page_name = file_path_names[0]
new_title_page_name = dir + os.sep + '01_Title.xhtml'
os.rename(title_page_name, new_title_page_name)
file_path_names[0] = '01_Title.xhtml'
else:
logmsg27(DEBUG_FLAG)
os._exit(0)
# xhtml file is no longer needed
if os.path.isfile(inpath):
os.remove(inpath)
# returned list values are also used
# later to create epub opf and ncx files
return(file_path_names, pure_header_names)
#Hai Vu and #Seth -- Thanks for all your help.

Deleting one line within txt file in Python

I am having problems deleting a specific line/entry within a text file. With the code I have the top line in the file is deleted no matter what line number I select to delete.
def erase():
contents = {}
f = open('members.txt', 'a')
f.close()
f = open('members.txt', 'r')
index = 0
for line in f:
index = index + 1
contents[index] = line
print ("{0:3d}) {1}".format(index,line))
f.close()
total = index
entry = input("Enter number to be deleted")
f = open('members.txt', 'w')
index = 0
for index in range(1,total):
index = index + 1
if index != entry:
f.write(contents[index])

Try this:
import sys
import os
def erase(file):
assert os.path.isfile(file)
with open(file, 'r') as f:
content = f.read().split("\n")
#print content
entry = input("Enter number to be deleted:")
assert entry >= 0 and entry < len(content)
new_file = content[:entry] + content[entry+1:]
#print new_file
with open(file,'w') as f:
f.write("\n".join(new_file))
if __name__ == '__main__':
erase(sys.argv[1])
As already noted you were starting the range from 1 which is incorrect. List slicing which I used in new_file = content[:entry] + content[entry+1:] makes the code more readable and it is an approach less prone to similar errors.
Also you seem to open and close the input file at the beginning for no reason. Also you should use with if possible when doing operations with files.
Finally I used the join and split to simplify the code so you don't need a for loop to process the lines of the file.

Python: Issue when trying to read and write multiple files

This script reads and writes all the individual html files in a directory. The script reiterates, highlight and write the output.The issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file. Any help here is appreciated.
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = os.path.join(source+'\\'+f)
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b in \b)|(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w')
outfile.seek(0, 2)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()

After your for loop is over, you need to include whatever is left after the last match:
...
i = m.end()
output += source_content[i:]) # Here's the end of your file
outfile = open(filepath, 'w')
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing phrases from a file based on another file (Python) - python

Related

I am trying to create a program that edits text by letting you select from a few things and changes them to one of a few options

replace line if found or append - python

Python -- How to split headers/chapters into separate files automatically

Deleting one line within txt file in Python

Python: Issue when trying to read and write multiple files

Categories

Resources