Python: Unicode characters in file or folder names - python

We process a lot of files where path can contain an extended character set like this:
F:\Site Section\Cieślik
My Python scripts fail to open such files or chdir to such folders whatever I try.
Here is an extract from my code:
import zipfile36 as zipfile
import os
from pathlib import Path
outfile = open("F:/zip_pdf3.log", "w", encoding="utf-8")
with open('F:/zip_pdf.txt') as f: # Input file list - note the forward slashes!
for line in f:
print (line)
path, filename = os.path.split(line)
file_no_ext = os.path.splitext(os.path.basename(line))[0]
try:
os.chdir(path) # Go to the file path
except Exception as exception:
print (exception, file = outfile) #3.7
print (exception)
continue
I tried the following:
Converting path to a raw string
raw_string = r"{}".format(path)
try:
os.chdir(raw_string)
Converting a string to Path
Ppath = Path(path)
try:
os.chdir(Ppath.decode("utf8"))
Out of ideas... Anyone knows how to work with Unicode file and folder names? Using Python 3.7 or higher on Windows.
Could be as simple as that - thanks #SergeBallesta:
with open('F:/pdf_err.txt', encoding="utf-8") as f:
I may post updates after more runs with different input.
This, however, leads to a slightly different question: if, instead of reading from the file, I walk over folders and files with extended character set - how do I deal with those, i.e.
for subdir, dirs, files in os.walk(rootdir): ?
At present I'm getting either a "The filename, directory name, or volume label syntax is incorrect" or "Can't open the file".

Related

using "With open" in python to write to another directory

I want to do the following:
1) Ask the user for input for a file path they wish a directory listing for.
2) Take this file path and enter the results, in a list, in a text file in the directory they input NOT the current directory.
I am very nearly there but the last step is that I can't seem to save the file to the directory the user has input only the current directory. I have set out the current code below (which works for the current directory). I have tried various variations to try and save it to the directory input by the user but to no avail - any help would be much appreciated.
CODE BELOW
import os
filenames = os.path.join(input('Please enter your file path: '))
with open ("files.txt", "w") as a:
for path, subdirs, files in os.walk(str(filenames)):
for filename in files:
f = os.path.join(path, filename)
a.write(str(f) + os.linesep)
I came across this link https://cmdlinetips.com/2012/09/three-ways-to-write-text-to-a-file-in-python/. I think your issue has something to do with you needing to provide the full path name and or the way you are using the close() method.
with open(out_filename, 'w') as out_file:
..
..
.. parsed_line
out_file.write(parsed_line)
You have to alter the with open ("files.txt", "w") as a: statement to not only include the filename, but also the path. This is where you should use os.path.join(). Id could be handy to first check the user input for existence with os.path.exists(filepath).
os.path.join(input(...)) does not really make sense for the input, since it returns a single str, so there are no separate things to be joined.
import os
filepath = input('Please enter your file path: ')
if os.path.exists(filepath):
with open (os.path.join(filepath, "files.txt"), "w") as a:
for path, subdirs, files in os.walk(filepath):
for filename in files:
f = os.path.join(path, filename)
a.write(f + os.linesep)
Notice that your file listing will always include a files.txt-entry, since the file is created before os.walk() gets the file list.
As ShadowRanger kindly points out, this LBYL (look before you leap) approach is unsafe, since the existence check could pass, although the file system is changed later while the process is running, leading to an exception.
The mentioned EAFP (it's easier to ask for forgiveness than permission) approach would use a try... except block to handle all errors.
This approach could look like this:
import os
filepath = input('Please enter your file path: ')
try:
with open (os.path.join(filepath, "files.txt"), "w") as a:
for path, subdirs, files in os.walk(filepath):
for filename in files:
f = os.path.join(path, filename)
a.write(f + os.linesep)
except:
print("Could not generate directory listing file.")
You should further refine it by catching specific exceptions. The more code is in the try block, the more errors unrelated to the directory reading and file writing are also caught and suppressed.
Move to the selected directory then do things.
Extra tip: In python 2 use raw_input to avoid special chars error like : or \ ( just use input in python 3 )
import os
filenames = raw_input('Please enter your file path: ')
if not os.path.exists(filenames):
print 'BAD PATH'
return
os.chdir(filenames)
with open ("files.txt", "w") as a:
for path, subdirs, files in os.walk('.'):
for filename in files:
f = os.path.join(path, filename)
a.write(str(f) + os.linesep)

gzip multiple files in python

I have to compress a lot of XML files into and split them by the data in the file name, just for clarification's sake, there is a parser which collects information from XML file and then moves it to a backup folder. My code needs to gzip it according to the date in the filename and group those files in a compressed .gz file.
Please find the code bellow:
import os
import re
import gzip
import shutil
import sys
import time
#
timestr = time.strftime("%Y%m%d%H%M")
logfile = 'D:\\Coleta\\log_compactador_xml_tar'+timestr+'.log'
ptm_dir = "D:\\PTM\\monitored_programs\\"
count_files_mdc = 0
count_files_3gpp = 0
count_tar = 0
#
for subdir, dir, files in os.walk(ptm_dir):
for file in files:
path = os.path.join(subdir, file)
try:
backup_files_dir = path.split(sep='\\')[4]
parser_id = path.split(sep='\\')[3]
if re.match('backup_files_*', backup_files_dir):
if file.endswith('xml'):
# print(time.strftime("%Y-%m-%d %H:%M:%S"), path)
data_arq = file[1:14]
if parser_id in ('parser-924'):
gzip_filename_mdc = os.path.join(subdir,'E4G_PM_MDC_IP51_'+timestr+'_'+data_arq)
with open(path, 'r')as f_in, gzip.open(gzip_filename_mdc + ".gz", 'at') as f_out_mdc:
shutil.copyfileobj(f_in, f_out_mdc)
count_files_mdc += 1
f_out_mdc.close()
f_in.close()
print(time.strftime("%Y-%m-%d %H:%M:%S"), "Compressing file MDC: ",path)
os.remove(path)
except PermissionError:
print(time.strftime("%Y-%m-%d %H:%M:%S"), "Permission error on file:", fullpath, file=logfile)
pass
except IndexError:
print(time.strftime("%Y-%m-%d %H:%M:%S"), "IndexError: ", path, file=logfile)
pass
As long as I seem it creates a stream of data, then compress and write it to a new file with the specified filename. However, instead of grouping each XML file independently inside a ".gz" file, it does creates inside the "gzip" file, a big file (big stream of data?) with the same name of the output "gzip" file, but without any extension. After the files are totally compressed, it's not possible to uncompress the big file generated inside the "gzip" output file. Does someone know where is the problem with my code?
PS: I have edited the code for readability purposes.
Not sure whether the solution is still needed, but I will just leave it here for anyone who faces the same issue.
There is a way to create a gzip archive in python using tarfile, the code is quite simple:
with tarfile.open(filename, mode="w:gz") as archive:
archive.add(name=name_of_file_to_add, recursive=True)
in this case name_of_file_to_add can be a directory, in which case tarfile will add it recursively with all its contents. Obviously you will need to import the tarfile module.
If you need to add files without a directory a simple for with calls to add will do (recursive flag is not required in this case).

Using terminal command to search through files for a specific string within a python script

I have a parent directory, and I'd like to go through that directory and grab each file with a specific string for editing in python. I have been using grep -r 'string' filepath in terminal, but I want to be able to do everything using python. I'm hoping to get all the files into an array and go through each of them to edit them.
Is there a way to do this by only running a python script?
changing current folder to parent
import os
os.chdir("..")
changing folder
import os
os.chdir(dir_of_your_choice)
finding files with a rule in the current folder
import glob
import os
current_dir = os.getcwd()
for f in glob.glob('*string*'):
do_things(f)
import os
#sourceFolder is the folder you're going to be looking inside for backslashes are a special character in python so they're escaped as double backslashes
sourceFolder = "C:\\FolderBeingSearched\\"
myFiles = []
# Find all files in the directory
for file in os.listdir(sourceFolder):
myFiles.append(file)
#open them for editing
for file in myFiles:
try:
open(sourceFolder + file,'r')
except:
continue
#run whatever code you need to do on each open file here
print("opened %s" % file)
EDIT: If you want to separate all files that contain a string (this just prints the list at the end currently):
import os
#sourceFolder is the folder you're going to be looking inside for backslashes are a special character in python so they're escaped as double backslashes
sourceFolder = "C:\\FolderBeingSearched\\"
myFiles = []
filesContainString = []
stringsImLookingFor = ['first','second']
# Find all files in the directory
for file in os.listdir(sourceFolder):
myFiles.append(file)
#open them for editing
for file in myFiles:
looking = sourceFolder + file
try:
open(looking,'r')
except:
continue
print("opened %s" % file)
found = 0
with open(looking,encoding="latin1") as in_file:
for line in in_file:
for x in stringsImLookingFor:
if line.find(x) != -1:
#do whatever you need to do to the file or add it to a list like I am
filesContainString.append(file)
found = 1
break
if found:
break
print(filesContainString)

Concatenate last 100 files in an only one

Begginer in Python needs a bit of help. I am using Python 2.7.
I want to make a program that concatenates the last 100 files I have in a folder. In that folder I have lots of files but I only want the concatenation of the last 100 ones. I am able to do the concatenation of all of them (if I don´t specify number and change the for loop), but I am not able to select the last 100 files. These files are saved in binary by the software.They are saved in the folder specified below. I would like to remove that 100 files once are concatenated in teh new one.The program I have done is the following:
#!/usr/bin/python
import os
import glob
os.chdir("C:\AFM_test\jpk_files")
rout=""
filename=glob.glob("*-*-*.*.*-*.*.*.jpk-force")
filename.sort(key=os.path.getmtime)
for filename in range(0,99):
filename=open(filename,"rb")
tout=filename.read()+\r\n"
rout = rout+tout
os.remove(filename)
filename.close()
fout = open("output.jpk-force","wb+")
fout.write(rout)
fout.close()
It doesn´t do anything and the error is the following:
Traceback (most recent call last):
File "C:\AFM_test\jpk_files\AFM_test.py", line 12, in <module>
filename = open(filename,"rb")
TypeError: coercing to Unicode: need string or buffer, int found
[Finished in 0.1s]
I guess the problem is the loop and its structure "range(0,99)",as when I have concatenated all the files contained in the folder:
#!/usr/bin/python
import os
import glob
os.chdir("C:\AFM_test\jpk_files")
rout=""
filename=glob.glob("*-*-*.*.*-*.*.*.jpk-force")
for filename in files:
filename=open(filename,"rb")
tout=filename.read()+\r\n"
rout = rout+tout
os.remove(filename)
filename.close()
fout = open("output.jpk-force","wb+")
fout.write(rout)
fout.close()
it worked okay except the remove order, which showed this error:
Traceback (most recent call last):
File "C:\try\AFM_test_2.py", line 17, in <module>
os.remove(filename)
must be string, not file
Any ideas how can I achieve my goal?
I hope I have explained myself properly. Maybe I have missed something important, sorry, I am just a beginner in this field.
Thank you.
TypeError: coercing to Unicode: need string or buffer, int found
That is because filename is an integer and then you are trying to concatenate it with a string.
os.remove(filename)
must be string, not file
That is because you are re-assigning the variable filename (which was a string path) to a file handle/object. os.remove(..) expects the variable from the for-loop, not the result of open(..). Its generally a good practice to give meaningful names to variables – filepath and infile etc.
A better approach would be:
def processFile(filepath):
with open(filepath) as f:
content = f.read()
os.remove(filepath)
return content
def main():
paths = glob.glob("..*..*..")
last100paths = paths[-100:]
with open(outFilePath, "w") as f:
f.write("\r\n".join(processFile(path) for path in last100paths))
You need to change:
filename=open(filename,"rb")
...to something like:
inf = open(filename, "rb")
...
inf.close()
Then, when you're calling os.remove(filename), it will still be the filename from the original loop, not a file object that your code is reassigning to this variable.
Note: rather than doing this explicit opening and closing of files, though, try using the with statement (see this helpful guide).
Checn if glob is matching patterns
pattern = r"*-*-*.*.*-*.*.*.jpk-force"
filenames=glob.glob(pattern)
if not filenames:
print 'no files matched ', pattern
sys.exit(1)
Get mtime sorted file list by building list of tuples each containing file name and mtime
filenames = [ (filename,os.stat(filename)[8]) for filename in filenames ]
sort the list with mtime in descending order
filenames.sort(key=lambda x:x[1],reverse=True)
The above two lines can be simplified as;
filenames = [ filename for filename in sorted(filenames,key=os.path.getmtime,reverse=True) ]
The above line can be refactored, because we can sort in place
filenames.sort(key=os.path.getmtime,reverse=True)
#!/usr/bin/python
import os
import glob
os.chdir("C:\AFM_test\jpk_files")
rout=""
pattern = r"*-*-*.*.*-*.*.*.jpk-force"
filenames=glob.glob(pattern)
if not filenames:
print 'no files matched ', pattern
sys.exit(1)
filenames.sort(key=os.path.getmtime,reverse=True)
for filename in filenames[:100]
filecontent=open(filename,"rb")
tout=filecontent.read()+"\r\n"
filecontent.close()
rout = rout+tout
os.remove(filename)
fout = open("output.jpk-force","wb+")
fout.write(rout)
fout.close()
You didn't check for exceptions.

How to extract a file within a folder within a zip?

I need to extract a file called Preview.pdf from a folder called QuickLooks inside of a zip file.
Right now my code looks a little like this:
with ZipFile(newName, 'r') as newName:
newName.extract(\QuickLooks\Preview.pdf)
newName.close()
(In this case, newName has been set equal to the full path to the zip).
It's important to note that the backslash is correct in this case because I'm on Windows.
The code doesn't work; here's the error it gives:
Traceback (most recent call last):
File "C:\Users\Asit\Documents\Evam\Python_Scripts\pageszip.py", line 18, in <module>
ZF.extract("""QuickLooks\Preview.pdf""")
File "C:\Python33\lib\zipfile.py", line 1019, in extract
member = self.getinfo(member)
File "C:\Python33\lib\zipfile.py", line 905, in getinfo
'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'QuickLook/Preview.pdf' in the archive"
I'm running the Python script from inside Notepad++, and taking the output from its console.
How can I accomplish this?
Alternatively, how could I extract the whole QuickLooks folder, move out Preview.pdf, and then delete the folder and the rest of it's contents?
Just for context, here's the rest of the script. It's a script to get a PDF of a .pages file. I know there are bonified converters out there; I'm just doing this as an excercise with some sort of real-world application.
import os.path
import zipfile
from zipfile import *
import sys
file = raw_input('Enter the full path to the .pages file in question. Please note that file and directory names cannot contain any spaces.')
dir = os.path.abspath(os.path.join(file, os.pardir))
fileName, fileExtension = os.path.splitext(file)
if fileExtension == ".pages":
os.chdir(dir)
print (dir)
fileExtension = ".zip"
os.rename (file, fileName + ".zip")
newName = fileName + ".zip" #for debugging purposes
print (newName) #for debugging purposes
with ZipFile(newName, 'w') as ZF:
print("I'm about to list names!")
print(ZF.namelist()) #for debugging purposes
ZF.extract("QuickLook/Preview.pdf")
os.rename('Preview.pdf', fileName + '.pdf')
finalPDF = fileName + ".pdf"
print ("Check out the PDF! It's located at" + dir + finalPDF + ".")
else:
print ("Sorry, this is not a valid .pages file.")
sys.exit
I'm not sure if the import of Zipfile is redundant; I read on another SO post that it was better to use from zipfile import * than import zipfile. I wasn't sure, so I used both. =)
EDIT: I've changed the code to reflect the changes suggested by Blckknght.
Here's something that seems to work. There were several issues with your code. As I mentioned in a comment, the zipfile must be opened with mode 'r' in order to read it. Another is that zip archive member names always use forward slash / characters in their path names as separators (see section 4.4.17.1 of the PKZIP Application Note). It's important to be aware that there's no way to extract a nested archive member to a different subdirectory with Python's currentzipfilemodule. You can control the root directory, but nothing below it (i.e. any subfolders within the zip).
Lastly, since it's not necessary to rename the .pages file to .zip — the filename you passZipFile() can have any extension — I removed all that from the code. However, to overcome the limitation on extracting members to a different subdirectory, I had to add code to first extract the target member to a temporary directory, and then copy that to the final destination. Afterwards, of course, this temporary folder needs to deleted. So I'm not sure the net result is much simpler...
import os.path
import shutil
import sys
import tempfile
from zipfile import ZipFile
PREVIEW_PATH = 'QuickLooks/Preview.pdf' # archive member path
pages_file = input('Enter the path to the .pages file in question: ')
#pages_file = r'C:\Stack Overflow\extract_test.pages' # hardcode for testing
pages_file = os.path.abspath(pages_file)
filename, file_extension = os.path.splitext(pages_file)
if file_extension == ".pages":
tempdir = tempfile.gettempdir()
temp_filename = os.path.join(tempdir, PREVIEW_PATH)
with ZipFile(pages_file, 'r') as zipfile:
zipfile.extract(PREVIEW_PATH, tempdir)
if not os.path.isfile(temp_filename): # extract failure?
sys.exit('unable to extract {} from {}'.format(PREVIEW_PATH, pages_file))
final_PDF = filename + '.pdf'
shutil.copy2(temp_filename, final_PDF) # copy and rename extracted file
# delete the temporary subdirectory created (along with pdf file in it)
shutil.rmtree(os.path.join(tempdir, os.path.split(PREVIEW_PATH)[0]))
print('Check out the PDF! It\'s located at "{}".'.format(final_PDF))
#view_file(final_PDF) # see Bonus below
else:
sys.exit('Sorry, that isn\'t a .pages file.')
Bonus: If you'd like to actually view the final pdf file from the script, you can add the following function and use it on the final pdf created (assuming you have a PDF viewer application installed on your system):
import subprocess
def view_file(filepath):
subprocess.Popen(filepath, shell=True).wait()

Categories