I am new to Python, and with some really great assistance from StackOverflow, I've written a program that:
1) Looks in a given directory, and for each file in that directory:
2) Runs a HTML-cleaning program, which:
Opens each file with BeautifulSoup
Removes blacklisted tags & content
Prettifies the remaining content
Runs Bleach to remove all non-whitelisted tags & attributes
Saves out as a new file
It works very well, except when it hits a certain kind of file content that throws up a bunch of BeautifulSoup errors and aborts the whole thing. I want it to be robust against that, as I won't have control over what sort of content winds up in this directory.
So, my question is: How can I re-structure the program so that when it errors on one file within the directory, it reports that it was unable to process that file, and then continues to run through the remaining files?
Here is my code so far (with extraneous detail removed):
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p', 'div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
print "Opened "+ filename
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
print "Prettified"
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("../posts-cleaned/"+filename, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Saved " + filename +" in /posts-cleaned"
print "Done"
clean_dir("../posts/")
I looking for any guidance on how to write this so that it will keep running after hitting a parsing/encoding/content/attribute/etc error within the clean_file function.
You can handle the Errors using :try-except-finally
You can do the error handling inside clean_file or in the for loop.
for filename in os.listdir(directory):
try:
clean_file(filename)
except:
print "Error processing file %s" % filename
If you know what exception gets raised you can use a more specific catch.
Related
What I need to do is to write some messages on a .txt file, close it and send it to a server. This happens in a infinite loop, so the code should look more or less like this:
from requests_toolbelt.multipart.encoder import MultipartEncoder
num = 0
while True:
num += 1
filename = f"example{num}.txt"
with open(filename, "w") as f:
f.write("Hello")
f.close()
mp_encoder = MultipartEncoder(
fields={
'file': ("file", open(filename, 'rb'), 'text/plain')
}
)
r = requests.post("my_url/save_file", data=mp_encoder, headers=my_headers)
time.sleep(10)
The post works if the file is created manually inside my working directory, but if I try to create it and write on it through code, I receive this response message:
500 - Internal Server Error
System.IO.IOException: Unexpected end of Stream, the content may have already been read by another component.
I don't see the file appearing in the project window of PyCharm...I even used time.sleep(10) because at first, I thought it could be a time-related problem, but I didn't solve the problem. In fact, the file appears in my working directory only when I stop the code, so it seems the file is held by the program even after I explicitly called f.close(): I know the with function should take care of closing files, but it didn't look like that so I tried to add a close() to understand if that was the problem (spoiler: it was not)
I solved the problem by using another file
with open(filename, "r") as firstfile, open("new.txt", "a+") as secondfile:
secondfile.write(firstfile.read())
with open(filename, 'w'):
pass
r = requests.post("my_url/save_file", data=mp_encoder, headers=my_headers)
if r.status_code == requests.codes.ok:
os.remove("new.txt")
else:
print("File not saved")
I make a copy of the file, empty the original file to save space and send the copy to the server (and then delete the copy). Looks like the problem was that the original file was held open by the Python logging module
Firstly, can you change open(f, 'rb') to open("example.txt", 'rb'). In open, you should be passing file name not a closed file pointer.
Also, you can use os.path.abspath to show the location to know where file is written.
import os
os.path.abspath('.')
Third point, when you are using with context manager to open a file, you don't close the file. The context manger supposed to do it.
with open("example.txt", "w") as f:
f.write("Hello")
I am parsing images from a webpage into a specif folder everything goes very well a huge part of the images are parsed into the desired folder then before the process ends until it gives this error:
IOError: [Errno 2] No such file or directory: u'C:\\Users\\pro\\Downloads\\AAA\\photos\\'
The code is something like this:
import os
save_path = raw_input("give save path. like '/home/user/dalbums'")
album = raw_input("the name of album: ")
completeName = os.path.join(save_path,album)
class X:
def saver(self, info):
path_name = os.path.join(completeName, 'photos')
if not os.path.exists(path_name):
os.makedirs(path_name)
with open(os.path.join(path_name, info), 'a') as f:
for i in lo:
f.write(lo)
If I keep only this part the error goes away but then the images goes to the wrong place:
with open(info, 'a') as f:
for i in lo:
f.write(lo)
When i try to use url https://www.google.com i get this error for the same code
InvalidSchema:
No connection adapters were found for 'javascript:void(0)'
You show different code in your question than that actually cause the error.
This is the relevant code:
with open(os.path.join(imgs_folder, My_imgs.strip()), 'wb') as f:
Since My_imgs.strip() returns an empty string, your file name is an empty string. Therefore, you try to write to directory after join the empty string to a directory name.
Here is where you create My_imgs:
My_imgs = data_fetched.split('/')[-1].split("?")[0]
For debugging you could do:
if not My_imgs.strip():
print('data_fetched:', data_fetched)
to see what data_fetched actually is.
I'm trying to control exceptions when reading files, but I have a problem. I'm new to Python, and I am not yet able to control how I can catch an exception and still continue reading text from the files I am accessing. This is my code:
import errno
import sys
class Read:
#FIXME do immutables this 2 const
ROUTE = "d:\Profiles\user\Desktop\\"
EXT = ".txt"
def setFileReaded(self, fileToRead):
content = ""
try:
infile = open(self.ROUTE+fileToRead+self.EXT)
except FileNotFoundError as error:
if error.errno == errno.ENOENT:
print ("File not found, please check the name and try again")
else:
raise
sys.exit()
with infile:
content = infile.read()
infile.close()
return content
And from another class I tell it:
read = Read()
print(read.setFileReaded("verbs"))
print(read.setFileReaded("object"))
print(read.setFileReaded("sites"))
print(read.setFileReaded("texts"))
Buy only print this one:
turn on
connect
plug
File not found, please check the name and try again
And no continue with the next files. How can the program still reading all files?
It's a little difficult to understand exactly what you're asking here, but I'll try and provide some pointers.
sys.exit() will terminate the Python script gracefully. In your code, this is called when the FileNotFoundError exception is caught. Nothing further will be ran after this, because your script will terminate. So none of the other files will be read.
Another thing to point out is that you close the file after reading it, which is not needed when you open it like this:
with open('myfile.txt') as f:
content = f.read()
The file will be closed automatically after the with block.
I have a script that regularly reads a text file on a server and over writes a copy of the text to a local copy of the text file. I have an issue of the process adding extra carriage returns and an extra invisible character after the last character. How do I make an identical copy of the server file?
I use the following to read the file
for link in links:
try:
f = urllib.urlopen(link)
myfile = f.read()
except IOError:
pass
and to write it to the local file
f = open("C:\\localfile.txt", "w")
try:
f.write(myfile)
except NameError:
pass
finally:
f.close()
This is how the file looks on the server
!http://i.imgur.com/rAnUqmJ.jpg
and this is how the file looks locally. Besides, an additional invisible character after the last 75
!http://i.imgur.com/xfs3E8D.jpg
I have seen quite a few similar questions, but not sure how to handle the urllib to read in binary
Any solution please?
If you want to copy a remote file denoted by a URL to a local file i would use urllib.urlretrieve:
import urllib
urllib.urlretrieve("http://anysite.co/foo.gz", "foo.gz")
I think urllib is reading binary.
Try changing
f = open("C:\\localfile.txt", "w")
to
f = open("C:\\localfile.txt", "wb")
Based on the script here: .doc to pdf using python I've got a semi-working script to export .docx files to pdf from C:\Export_to_pdf into a new folder.
The problem is that it gets through the first couple of documents and then fails with:
(-2147352567, 'Exception occurred.', (0, u'Microsoft Word', u'Command failed', u'wdmain11.chm', 36966, -2146824090), None)
This, apparently is an unhelpful general error message. If I debug slowly it using pdb, I can loop through all files and export successfully. If I also keep an eye on the processes in Windows Task Manager I can see that WINWORD starts then ends when it is supposed to, but on the larger files it takes longer for the memory usage to stablise. This makes me think that the script is tripping up when WINWORD doesn't have time to initialize or quit before the next method is called on the client.Dispatch object.
Is there a way with win32com or comtypes to identify and wait for a process to start or finish?
My script:
import os
from win32com import client
folder = "C:\\Export_to_pdf"
file_type = 'docx'
out_folder = folder + "\\PDF"
os.chdir(folder)
if not os.path.exists(out_folder):
print 'Creating output folder...'
os.makedirs(out_folder)
print out_folder, 'created.'
else:
print out_folder, 'already exists.\n'
for files in os.listdir("."):
if files.endswith(".docx"):
print files
print '\n\n'
try:
for files in os.listdir("."):
if files.endswith(".docx"):
out_name = files.replace(file_type, r"pdf")
in_file = os.path.abspath(folder + "\\" + files)
out_file = os.path.abspath(out_folder + "\\" + out_name)
word = client.Dispatch("Word.Application")
doc = word.Documents.Open(in_file)
print 'Exporting', out_file
doc.SaveAs(out_file, FileFormat=17)
doc.Close()
word.Quit()
except Exception, e:
print e
The working code - just replaced the try block with this. Note moved the DispatchEx statement outside the for loop and the word.Quit() to a finally statement to ensure it closes.
try:
word = client.DispatchEx("Word.Application")
for files in os.listdir("."):
if files.endswith(".docx") or files.endswith('doc'):
out_name = files.replace(file_type, r"pdf")
in_file = os.path.abspath(folder + "\\" + files)
out_file = os.path.abspath(out_folder + "\\" + out_name)
doc = word.Documents.Open(in_file)
print 'Exporting', out_file
doc.SaveAs(out_file, FileFormat=17)
doc.Close()
except Exception, e:
print e
finally:
word.Quit()
The might not be the problem but dispatching a separate word instance and then closing it within each iteration is not necessary and may be the cause of the strand memory problem you are seeing. You only need to open the instance once and within that instance you can open and close all the documents you need. Like the following:
try:
word = client.DispatchEx("Word.Application") # Using DispatchEx for an entirely new Word instance
word.Visible = True # Added this in here so you can see what I'm talking about with the movement of the dispatch and Quit lines.
for files in os.listdir("."):
if files.endswith(".docx"):
out_name = files.replace(file_type, r"pdf")
in_file = os.path.abspath(folder + "\\" + files)
out_file = os.path.abspath(out_folder + "\\" + out_name)
doc = word.Documents.Open(in_file)
print 'Exporting', out_file
doc.SaveAs(out_file, FileFormat=17)
doc.Close()
word.Quit()
except Exception, e:
Note: Be careful using try/except when opening win32com instances and files as if you open them and the error occurs before you close it it won't close (as it has not reached that command yet).
Also you may want to consider using DispatchEx instead of just Dispatch. DispatchEx opens a new instance (an entirely new .exe) whereas I believe just using Dispatch will try and look for an open instance to latch onto but the documentation of this is foggy. Use DispatchEx if in fact you want more than one instance (i.e open one file in one and one file in another).
As for waiting, the program should just wait on that line when more time is needed but I dunno.
Oh! also you can use word.Visible = True if you want to be able to see the instance and files actually open (might be useful to visually see the problem but turn it of when fixed because it will def slow things down ;-) ).