python List files with spaces - python

I have the following code:
Basically is to pull md5 from each file. The problem is with the files that has spaces , what would be the solution to the program can take into account those files and not skip them.
def onepath(archivo):
logging.basicConfig(filename=salida,filemode="w", format='%(message)s', level=logging.DEBUG)
for filename in (file for file in os.listdir(archivo)):
with open(filename) as checkfile:
logging.info("MD5 " + "(%s) = " % filename + hashlib.md5(checkfile.read()).hexdigest())
I was reading about the method shlex, but not sure how can I implement.
Can you help me?
I think that files with spaces are showing. I did a short snippet not im facing a problem that I cant have control in how Linux understand the spaces on the filenames in order to do as follows:
files_destino = [f for f in os.listdir(os.path.join(sys.argv[1].strip()))]
for i in files_destino:
print i
subprocess.call(["cp","-v", "%s" % i,"/tmp/"])
In the shell shows:
bash-3.2$ ./comodin.py ./espacio/
Boxx view.pdf
cp: Boxx view.pdf: No such file or directory
hola.txt
hola.txt -> /tmp/hola.txt
bash-3.2$

def onepath(archivo):
logging.basicConfig(filename=salida,filemode="w", format='%(message)s', level=logging.DEBUG)
for filename in os.listdir(archivo):
filepath = os.path.join(archivo, filename)
with open(filepath) as checkfile:
logging.info("MD5 " + "(%s) = " % filename + hashlib.md5(checkfile.read()).hexdigest())

Related

Python: Check if files exist and copy only the missing files

I'm a begynder in python and trying to make a script that does the following:
Check number of files, if they exist in the destFile
If they all exist, exit the script (don't do anything)
If some files are missing, copy only the missing files from the srcFile to the destFile
The script that I have made is working, but the issue that I would like your help with is to make my script only copies the file/files missing and not as my script is doing now, which copies from file 1 (test1.txt) to the file missing. Example if test4.txt & test5.txt files are missing in destFile, my script will copy from test1.txt to test5.txt, in stead of only copying the two missing files test4.txt & test5.txt.
import os, shutil
from datetime import datetime
count = 0
error = "ERROR! file is missing! (files have been copied)"
sttime = datetime.now().strftime('%d/%m/%Y - %H:%M:%S - ')
os.chdir("C:\log")
log = "log.txt"
srcFile = [r"C:\srcFile\test1.txt",
r"C:\srcFile\test2.txt",
r"C:\srcFile\test3.txt",
r"C:\srcFile\test4.txt",
r"C:\srcFile\test5.txt"]
destFile = [r"C:\destFile\test1.txt",
r"C:\destFile\test2.txt",
r"C:\destFile\test3.txt",
r"C:\destFile\test4.txt",
r"C:\destFile\test5.txt"]
for file in destFile:
if not os.path.exists(file):
for file_sr in srcFile:
if not os.path.exists(file):
shutil.copy(file_sr, 'C:\destFile')
count +=1
with open(log, 'a') as logfile:
logfile.write(sttime + error + " " + str(count) + " => " + file + '\n')
The problem is that you're iterating over all of the source files whenever you detect a missing destination file: for file_sr in srcFile:. Instead, you can copy just the missing file by keeping track of the position (in the array) of the missing destination file:
for position, file in enumerate(destFile):
if not os.path.exists(file):
file_sr = srcFile[position]
if not os.path.exists(file):
shutil.copy(file_sr, 'C:\destFile')
Using your code, you can do:
import os, shutil
from datetime import datetime
count = 0
error = "ERROR! file is missing! (files have been copied)"
sttime = datetime.now().strftime('%d/%m/%Y - %H:%M:%S - ')
os.chdir("C:\log")
log = "log.txt"
srcFile = [r"C:\srcFile\test1.txt",
r"C:\srcFile\test2.txt",
r"C:\srcFile\test3.txt",
r"C:\srcFile\test4.txt",
r"C:\srcFile\test5.txt"]
destFile = [r"C:\destFile\test1.txt",
r"C:\destFile\test2.txt",
r"C:\destFile\test3.txt",
r"C:\destFile\test4.txt",
r"C:\destFile\test5.txt"]
for file in destFile:
if not os.path.exists(file):
src_file = destFile.replace("destFile","srcFile")
shutil.copy(src_file, file)
count +=1
with open(log, 'a') as logfile:
logfile.write(sttime + error + " " + str(count) + " => " + file + '\n')
Thank you for your help guys. Exactly my problem was that I was iterating over all of the source files whenever I detected a missing destination file. The following logic from mackorone is doing what I was looking for.
for position, file in enumerate(destFile):
if not os.path.exists(file):
file_sr = srcFile[position]
shutil.copy(file_sr, 'C:\destFile')
I have updated the script, so now this script compares two folders, source folder and destination folder. If destination folder is missing files from the source folder, it will be copied. The script is working fine.
import os
import shutil
from datetime import datetime
sttime = datetime.now().strftime('%d/%m/%Y - %H:%M:%S - ')
error = "ERROR! file is missing! (files have been copied)"
des_path = 'C:\des_folder'
sr_path = 'C:\sr_folder'
des_folder = os.listdir(des_path)
sr_folder = os.listdir(sr_path)
count = 0
os.chdir("C:\log")
log = "log.txt"
def compare_folder(folder1,folder2):
files_in_sr_folder = set(sr_folder) - set(des_folder)
return files_in_sr_folder
files_missing = compare_folder(sr_folder,des_folder)
if len(files_missing) != 0:
for file in files_missing:
full_path_files = os.path.join(sr_path,file)
shutil.copy(full_path_files,des_path)
count +=1
with open(log, 'a') as logfile:
logfile.write(sttime + error + " " + str(count) + " => " + file + '\n')
else:
exit

Python convert files in directory one by one

How do I convert all files in directory one by one with the code below?
This code takes all the files in a folder and converts them together, but uses up too much memory. I need to do it in the loop for each file separately.
i.e. Find file. Convert. Move. Repeat.
import os
import shutil
import glob
command = ('convert -compress LZW -alpha off -density 320 -depth 4 -
contrast-stretch 700x0 -gamma .45455 *.pdf -set filename:base "%
[basename]" +adjoin "%[filename:base].tiff"')
newpath = r'...'
new_dir = 'tiff'
if not os.path.exists(newpath):
try:
os.mkdir(new_dir)
os.system(command)
except:
print "The folder is already exist"
for file in glob.glob("*.tiff"):
try:
print('"' + file + '"' + ' has just moved to ' + '"' + new_dir + '"' + ' folder')
shutil.move(file, new_dir);
except:
print "Error"
using rename?
import os
os.mkdir("new_folder")
for file in ['file1.txt', 'file2.txt']:
os.rename(file,f'new_folder/{file}')

Finding Files in File Tree Based on List of Extensions

I'm working on a small python 3 utility to build a zip file based on a list of file extensions. I have a text file of extensions and I'm passing a folder into the script:
working_folder = sys.argv[1]
zip_name = sys.argv[2]
#Open the extension file
extensions = []
with open('CxExt.txt') as fp:
lines = fp.readlines()
for line in lines:
extensions.append(line)
#Now get the files in the directory. If they have the right exttension add them to the list.
files = os.listdir(working_folder)
files_to_zip = []
for ext in extensions:
results = glob.glob(working_folder + '**/' + ext, recursive=True)
print(str(len(results)) + " results for " + working_folder + '**/*' + ext)
#search = "*"+ext
#results = [y for x in os.walk(working_folder) for y in glob(os.path.join(x[0], search))]
#results = list(Path(".").rglob(search))
for result in results:
files_to_zip.append(result)
if len(files_to_zip) == 0:
print("No Files Found")
sys.exit()
for f in files:
print("Checking: " + f)
filename, file_extension = os.path.splitext(f)
print(file_extension)
if file_extension in extensions:
print(f)
files_to_zip.append(file)
ZipFile = zipfile.ZipFile(zip_name, "w" )
for z in files_to_zip:
ZipFile.write(os.path.basename(z), compress_type=zipfile.ZIP_DEFLATED)
I've tried using glob, os.walk, and Path.rglob and I still can't get a list of files. There's got to be something just obvious that I'm missing. I built a test directory that has some directories, py files, and a few zip files. It returns 0 for all file types. What am I overlooking?
This is my first answer, so please don't expect it to be perfect.
I notice you're using file.readlines(). According to the Python docs here, file.readlines() returns a list of lines including the newline at the end. If your text file has the extensions separated by newlines, maybe try using file.read().split("\n") instead. Besides that, your code looks okay. Tell me if this fix doesn't work.

Get all files from my C drive - Python

Here is what I try to do:
I would like to get a list of all files that are heavier than 35 MB in my C drive.
Here is my code:
def getAllFileFromDirectory(directory, temp):
files = os.listdir(directory)
for file in files:
if (os.path.isdir(file)):
getAllFileFromDirectory(file, temp)
elif (os.path.isfile(file) and os.path.getsize(file) > 35000000):
temp.write(os.path.abspath(file))
def getFilesOutOfTheLimit():
basePath = "C:/"
tempFile = open('temp.txt', 'w')
getAllFileFromDirectory(basePath, tempFile)
tempFile.close()
print("Get all files ... Done !")
For some reason, the interpreter doesn't go in the if-block inside 'getAllFileFromDirectory'.
Can someone tell me what I'm doing wrong and why (learning is my aim). How to fix it ?
Thanks a lot for your comments.
I fixed your code. Your problem was that os.path.isdir can only know if something is a directory if it receives the full path of it. So, I changed the code to the following and it works. Same thing for os.path.getsize and os.path.isfile.
import os
def getAllFileFromDirectory(directory, temp):
files = os.listdir(directory)
for file in files:
if (os.path.isdir(directory + file)):
if file[0] == '.': continue # i added this because i'm on a UNIX system
print(directory + file)
getAllFileFromDirectory(directory + file, temp)
elif (os.path.isfile(directory + file) and os.path.getsize(directory + file) > 35000000):
temp.write(os.path.abspath(file))
def getFilesOutOfTheLimit():
basePath = "/"
tempFile = open('temp.txt', 'w')
getAllFileFromDirectory(basePath, tempFile)
tempFile.close()
print("Get all files ... Done !")
getFilesOutOfTheLimit()

Script that reads PDF metadata and writes to CSV

I wrote a script to read PDF metadata to ease a task at work. The current working version is not very usable in the long run:
from pyPdf import PdfFileReader
BASEDIR = ''
PDFFiles = []
def extractor():
output = open('windoutput.txt', 'r+')
for file in PDFFiles:
try:
pdf_toread = PdfFileReader(open(BASEDIR + file, 'r'))
pdf_info = pdf_toread.getDocumentInfo()
#print str(pdf_info) #print full metadata if you want
x = file + "~" + pdf_info['/Title'] + " ~ " + pdf_info['/Subject']
print x
output.write(x + '\n')
except:
x = file + '~' + ' ERROR: Data missing or corrupt'
print x
output.write(x + '\n')
pass
output.close()
if __name__ == "__main__":
extractor()
Currently, as you can see, I have to manually input the working directory and manually populate the list of PDF files. It also just prints out the data in the terminal in a format that I can copy/paste/separate into a spreadsheet.
I'd like the script to work automatically in whichever directory I throw it in and populate a CSV file for easier use. So far:
from pyPdf import PdfFileReader
import csv
import os
def extractor():
basedir = os.getcwd()
extension = '.pdf'
pdffiles = [filter(lambda x: x.endswith('.pdf'), os.listdir(basedir))]
with open('pdfmetadata.csv', 'wb') as csvfile:
for f in pdffiles:
try:
pdf_to_read = PdfFileReader(open(f, 'r'))
pdf_info = pdf_to_read.getDocumentInfo()
title = pdf_info['/Title']
subject = pdf_info['/Subject']
csvfile.writerow([file, title, subject])
print 'Metadata for %s written successfully.' % (f)
except:
print 'ERROR reading file %s.' % (f)
#output.writerow(x + '\n')
pass
if __name__ == "__main__":
extractor()
In its current state it seems to just prints a single error (as in, the error message in the exception, not an error returned by Python) message and then stop. I've been staring at it for a while and I'm not really sure where to go from here. Can anyone point me in the right direction?
writerow([file, title, subject]) should be writerow([f, title, subject])
You can use sys.exc_info() to print the details of your error
http://docs.python.org/2/library/sys.html#sys.exc_info
Did you check the pdffiles variable contains what you think it does? I was getting a list inside a list... so maybe try:
for files in pdffiles:
for f in files:
#do stuff with f
I personally like glob. Notice I add * before the .pdf in the extension variable:
import os
import glob
basedir = os.getcwd()
extension = '*.pdf'
pdffiles = glob.glob(os.path.join(basedir,extension)))
Figured it out. The script I used to download the files was saving the files with '\r\n' trailing after the file name, which I didn't notice until I actually ls'd the directory to see what was up. Thanks for everyone's help.

Categories