Python File System Reader Performance - python

I need to scan a file system for a list of files, and log those who don't exist. Currently I have an input file with a list of the 13 million files which need to be investigated. This script needs to be run from a remote location, as I do not have access/cannot run scripts directly on the storage server.
My current approach works, but is relatively slow. I'm still fairly new to Python, so I'm looking for tips on speeding things up.
import sys,os
from pz import padZero #prepends 0's to string until desired length
output = open('./out.txt', 'w')
input = open('./in.txt', 'r')
rootPath = '\\\\server\share\' #UNC path to storage
for ifid in input:
ifid = padZero(str(ifid)[:-1], 8) #extracts/formats fileName
dir = padZero(str(ifid)[:-3], 5) #exracts/formats the directory containing the file
fPath = rootPath + '\\' + dir + '\\' + ifid + '.tif'
try:
size = os.path.getsize(fPath) #don't actually need size, better approach?
except:
output.write(ifid+'\n')
Thanks.

dirs = collections.defaultdict(set)
for file_path in input:
file_path = file_path.rjust(8, "0")
dir, name = file_path[:-3], file_path
dirs[dir].add(name)
for dir, files in dirs.iteritems():
for missing_file in files - set(glob.glob("*.tif")):
print missing_file
Explanation
First read the input file into a dictionary of directory: filename. Then for each directory, list all the TIFF files in that directory on the server, and (set) subtract this from the collection of filenames you should have. Print anything that's left.
EDIT: Fixed silly things. Too late at night when I wrote this!

That padZero and string concatenation stuff looks to me like it would take a good percent of time.
What you want it to do is spend all its time reading the directory, very little else.
Do you have to do it in python? I've done similar stuff in C and C++. Java should be pretty good too.

You're going to be I/O bound, especially on a network, so any changes you can make to your script will result in very minimal speedups, but off the top of my head:
import os
input, output = open("in.txt"), open("out.txt", "w")
root = r'\\server\share'
for fid in input:
fid = fid.strip().rjust(8, "0")
dir = fid[:-3] # no need to re-pad
path = os.path.join(root, dir, fid + ".tif")
if not os.path.isfile(path):
output.write(fid + "\n")
I don't really expect that to be any faster, but it is arguably easier to read.
Other approaches may be faster. For example, if you expect to touch most of the files, you could just pull a complete recursive directory listing from the server, convert it to a Python set(), and check for membership in that rather than hitting the server for many small requests. I will leave the code as an exercise...

I would probably use a shell command to get the full listing of files in all directories and subdirectories in one hit. Hopefully this will minimise the amount of requests you need to make to the server.
You can get a listing of the remote server's files by doing something like:
Linux: mount the shared drive as /shared/directory/ and then do ls -R /shared/directory > ~/remote_file_list.txt
Windows: Use Map Network Drive to mount the shared drive as drive letter X:, then do dir /S X:/shared_directory > C:/remote_file_list.txt
Use the same methods to create a listing of your local folder's contents as local_file_list.txt. You python script will then reduce to an exercise in text processing.
Note: I did actually have to do this at work.

Related

Lost files while tried to move files using python shutil.move

I had 120 files in my source folder which I need to move to a new directory (destination). The destination is made in the function I wrote, based on the string in the filename. For example, here is the function I used.
path ='/path/to/source'
dropbox='/path/to/dropbox'
files = = [os.path.join(path,i).split('/')[-1] for i in os.listdir(path) if i.startswith("SSE")]
sam_lis =list()
for sam in files:
sam_list =sam.split('_')[5]
sam_lis.append(sam_list)
sam_lis =pd.unique(sam_lis).tolist()
# Using the above list
ID = sam_lis
def filemover(ID,files,dropbox):
"""
Function to move files from the common place to the destination folder
"""
for samples in ID:
for fs in files:
if samples in fs:
desination = dropbox + "/"+ samples + "/raw/"
if not os.path.isdir(desination):
os.makedirs(desination)
for rawfiles in fnmatch.filter(files, pat="*"):
if samples in rawfiles:
shutil.move(os.path.join(path,rawfiles),
os.path.join(desination,rawfiles))
In the function, I am creating the destination folders, based on the ID's derived from the files list. When I tried to run this for the first time it threw me FILE NOT exists error.
However, later when I checked the source all files starting with SSE were missing. In the beginning, the files were there. I want some insights here;
Whether or not os.shutil.move moves the files to somewhere like a temp folder instead of destination folder?
whether or not the os.shutil.move deletes the files from the source in any circumstance?
Is there any way I can test my script to find the potential reasons for missing files?
Any help or suggestions are much appreciated?
It is late but people don't understand the op's question. If you move a file into a non-existing folder, the file seems to become a compressed binary and get lost forever. It has happened to me twice, once in git bash and the other time using shutil.move in Python. I remember the python happens when your shutil.move destination points to a folder instead of to a copy of the full file path.
For example, if you run the code below, a similar situation to what the op described will happen:
src_folder = r'C:/Users/name'
dst_folder = r'C:/Users/name/data_images'
file_names = glob.glob(r'C:/Users/name/*.jpg')
for file in file_names:
file_name = os.path.basename(file)
shutil.move(os.path.join(src_folder, file_name), dst_folder)
Note that dst_folder in the else block is just a folder. It should be dst_folder + file_name. This will cause what the Op described in his question. I find something similar on the link here with a more detailed explanation of what went wrong: File moving mistake with Python
shutil.move does not delete your files, if for any reason your files failed to move to a given location, check the directory where your code is stored, for a '+' folder your files are most likely stored there.

OneDrive free up space with Python

I have been using OneDrive to store a large amount of images and now I need to process those, so I have sync'd my OneDrive folder to my computer, which takes relatively no space on disk. However, since I have to open() them in my code, they all get downloaded, which would take much more than the available memory on my computer. I can manually use the Free up space action in the right-click contextual menu, which keeps them sync'd without taking space.
I'm looking for a way to do the same thing but in my code instead, after every image I process.
Trying to find how to get the commands of contextual menu items led me to these two places in the registry:
HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Directory\shell
HKEY_LOCAL_MACHINE\SOFTWARE\Classes*\shellex\ContextMenuHandlers
However I couldn't find anything related to it and those trees have way too many keys to check blindly. Also this forum post (outside link) shows a few ways to free up space automatically, but it seems to affect all files and is limited to full days intervals.
So is there any way to either access that command or to free up the space in python ?
According to this microsoft post it is possible to call Attrib.exe to do that sort of manipulation on files.
This little snippet does the job for a per-file usage. As shown in the linked post, it's also possible to do it on the full contents of a folder using the /s argument, and much more.
import subprocess
def process_image(path):
# Open the file, which downloads it automatically
with open(path, 'r') as img:
print(img)
# Free up space (OneDrive) after usage
subprocess.run('attrib +U -P "' + path + '"')
The download and freeing up space are fairly quick, but in the case of running this heavily in parallel, it is possible that some disk space will be consumed for a short amount of time. In general though, this is pretty instantaneous.
In addition to Mat's answer. If you are working on a Mac then you can replace Attrib.exe with "/Applications/OneDrive.App/Contents/MacOS/OneDrive /unpin" to make the file online only.
import subprocess
path = "/Users/OneDrive/file.png"
subprocess.run(["/Applications/OneDrive.App/Contents/MacOS/OneDrive", "/unpin", path])
Free up space for multiples files.
import os
import subprocess
path = r"C:\Users\yourUser\Folder"
diret = os.listdir(path)
for di in diret:
dir_atual = path + "\\" + di
for root, dirs, files in os.walk(dir_atual):
for file in files:
arquivos = (os.path.join(root, file))
print (arquivos)
subprocess.run('attrib +U -P "' + arquivos + '"')

Using os.system() in a specific directory only

I have a directory containing mutliple files with similar names and subdirectories named after these so that files with like-names are located in that subdirectory. I'm trying to concatenate all the .sdf files in a given subdirectory to a single .sdf file.
import os
from os import system
for ele in os.listdir(Path):
if ele.endswith('.sdf'):
chdir(Path + '/' + ele[0:5])
system('cat' + ' ' + '*.sdf' + '>' + ele[0:5] + '.sdf')
However when I run this, the concatenated file includes every .sdf file from the original directory rather than just the .sdf files from the desired one. How do I alter my script to concatenate the files in the subdirectory only?
this is a very clumsy way of doing it. Using chdir is not recommended, and system either (deprecated, and overkill to call cat)
Let me propose a pure python implementation using glob.glob to filter the .sdf files, and read each file one by one and write to the big file opened before the loop:
import glob,os
big_sdf_file = "all_data.sdf" # I'll let you compute the name/directory you want
with open(big_sdf_file,"wb") as fw:
for sdf_file in glob.glob(os.path.join(Path,"*.sdf")):
with open(sdf_file,"rb") as fr:
fw.write(fr.read())
I left big_sdf_file not computed, I would not recommend to put it in the same directory as the other files, since running the script twice would result in taking the output as input as well.
Note that the drawback of this approach is that if the files are big, they're read fully into memory, which can cause problems. In that case, replace
fw.write(fr.read())
by:
shutil.copyfileobj(fr,fw)
(importing shutil is necessary in that case). That allows packet copy instead of full-file read/write.
I'll add that it's probably not the full solution you're expecting, since there seem to be something about scanning the sub-directories of Path to create 1 big .sdf file per sub-directory, but with the provided code which doesn't use any system command or chdir, it should be easier to adapt to your needs.

Why is my original folder not kept after compression? Why is my compression so slow? - python 3.4

The purpose of this program is to zip a directory or a folder as simply as possible, and write
the generated .tar.gz to one of my USB flash drives (or any other location), plans are to add a
function that will also use 'GnuPG' to encrypt the folder and another
that will allow user to input a time in order to perform this task
daily, weekly, monthly, etc. I also want the user to be able to choose
the destination of the zipped folder. Just wanted to post this up now
to see if it worked on first attempt and to get a bit of feedback.
My main question is why I lose the main folder upon extraction of the compressed files. For example, if I compress "Documents" which contains the two folders "Videos" and "Pictures" and the file "manual.txt". When I extract the file it does not dump "Documents" into the extraction point it dumps "Videos" and "Pictures" and "manual.txt". Which is fine and all, no data loss and everything is still intact, just creates a bit of clutter and I would like to keep the original directory.
Also wondering why in the world is this program taking so long to convert the file and when it does the conversion in some cases the .tar.gz file is just as large as the original folder, this happens with video files, it does seem to compress text files well, and much quicker.
Are video files just hard to compress? Or what, It takes like 5 minutes to process 2gb of video files and then they are the same as the original size? Kinda pointless.
Also would it make sense to use regex to validate user input in this case, I could just use a couple if statements instead no? like the preferred input in this program is 'root' not '/root'. Couldn't I just have it cut the '/' off if the input starts with a '/'.
I mainly want to see if this is the right, most efficient way of doing things, I'd rather not be given the answer in the usual stack overflow copy/paste way, lets get a discussion going.
So why is this program so slow when processing larger amounts of data? I expect a reduction in speed but not by that much
#!/usr/bin/env python3
'''
author: ryan st***
date: 12/5/2015
time: 18:55 Eastern time (GMT -5)
language: python 3.4
'''
# Import, import, import.
import os, subprocess, sys, zipfile, re
import shutil
import time
# Backup (zip) files
def zipDir():
try:
# Get file to be zipped and zip file destination from user
Dir = "~"
str1 = input ('Input directory to be zipped(eg. Douments, Dowloads, Desktop/programs): ')
# an input example that works "bin/mans"
str2 = input ('Zipped output directory (eg. root, myBackups): ')
# an output example that works "bin2/test"
zipName = input ("What would you like to name your zipped folder? ")
path1 = Dir, str1, "/"
path2 = Dir, str2, "/"
# Zip it up
# print (zipFile, ".tar.gz will be created from the folder ", path1[0]+path1[1]+path1[2])
#"and placed into the folder ", path2[0]+path2[1]+path2[2])
zipDirTo = os.path.expanduser(os.path.join(path2[0], path2[1]+path2[2], zipName))
zipDir = os.path.expanduser(os.path.join(path1[0], path1[1]))
print ('Directory "',zipDir,'" will be zipped and saved to the location: "' ,zipDirTo,'.tar.gz"')
shutil.make_archive(zipDirTo, 'gztar', zipDir)
print ("file zipped")
# In Case of mistake
except:
print ("Something went wrong in compression.\n",
"Ending Task, Please try again")
quit()
# Execute the program
def main():
print ("It will be a fucking miracle if this succeeds.")
zipDir()
print ("Success!!!!!!")
time.sleep(2)
quit()
# Wrap it all up
if __name__ == '__main__':
main()
Video files are normally compressed themselves and recompressing them doesn't help.for image and video file use tar only.
My main question is why I lose the main folder upon extraction of the compressed files
Because you're not storing that folder's name in the zip file. The paths you're using don't include Documents, they start with the name of the items inside Documents.
Are video files just hard to compress?
Any file that is already compressed, such as most video and audio formats, will be hard to compress further, and it will take quite a bit of time to find that out if the size is large. You might consider detecting compressed files and storing them in the zip file without further compression using the ZIP_STORED constant.
let[']s get a discussion going.
Stack Overflow's format is not really suited to discussions.

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.
If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.
First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.
The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.
Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file
You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

Categories