Efficiently move large number of files to different folder structure

Efficiently move large number of files to different folder structure - python

I am trying to reorganize a large number of pdf files (3 million files, average file 300KB). Currently, the files are stored in randomly named folders, but I want to organize them by their file name. File names are 8-digit integers such as 12345678.pdf
Currently, the files are stored like this
/old/a/12345678.pdf
/old/a/12345679.pdf
/old/b/22345679.pdf
I want them to be stored like this
/new/12/345/12345678.pdf
/new/12/345/12345679.pdf
/new/22/345/22345679.pdf
I thought this was an easy task using shutil:
from pathlib import Path
import shutil
for path_old in Path('old').rglob('*.pdf'):
r = int(path_old.stem)
path_new = '/new/'+str(r// 1000**2)+'/'+str(r // 1000 % 1000)+'/'+path_old.name
shutil.move(str(path_old),path_new)
Unfortunately, this takes forever. My script is only moving ~15 files per second, which means it will take days to complete.
I am not exactly sure whether this is a Python/shutil problem or a more general IO problem - sorry if I misplaced the question. I am open to any type of solution that makes this process faster.

Related

Most efficient way to check if a string contains any file format?

I have a .txt with hundreds of thousands of paths and I simply have to check if each line is a folder or a file. The hard drive is not with me so I can't use the module os with the os.path.isdir() function. I've tried the code below but it is just not perfect since some folders contains . at the end.
for row in files:
if (row[-6:].find(".") < 0):
folders_count += 1
It is just not worth testing if the ending of the string contains any known file format (.zip, .pdf, .doc ...) since there are dozens of different files format inside this HD. When my code reads the .txt, it stores each line as a string inside an array, so my code should work with the string format.
An example of a folder path:
'path1/path2/truckMV.34'
An example of a file path:
'path1/path2/certificates.pdf'

It's impossible for us to judge if it's a file or path just by the string since an extension is just an arbitrary agreeable string that programs choose to decode in a certain way.
Having said that, if I had the same problem I would do my best to estimate with the following pseudo code:
Create a hash map (or a dictionary as you are in Python)
For every line of the file, read the last bit and see if there's a "." in the last path
Create a key for it on the hash map with a counter of how many times you have encountered the "possible extensions".
After you go through all of the list you will have a collection of possible extensions and how many you have encountered them. Assume the ones with only 1 occurrence (or any other low arbitrary number) to be a path and not an extension.
The basis of this heuristic is that it's unlikely for a person to have a lot of unique extensions on their desktop - but that's just an assumption I came up with.

Automate (or looping) moving certain files to subdirectories by name

I have 207 directories full of .WAV files, where each directory contains a certain number of files recorded on one day (the number varies from directory to directory). The names of the directories are just dates in YYYYMMDD format, and the filenames have already been modified so that their names are in ‘HHMMSS.WAV’ format (the time the recording was taken, i.e. 024545.WAV) in each directory. Each directory has a different recording period, so for example, directory1 contains files that were recorded on a certain day between 02am and 11am, while directory2 contains files that were recorded on a certain day between 11am and 6pm, etc.
I need to concatenate the files by hourly intervals; so for example, in directory1 there are 1920 clips, and I need to move files in each hourly interval into a separate directory – so effectively, there will be x number of new subdirectories for directory1 where x is the number of hourly intervals that are present in directory1 (i.e. directory1_00-01 for all the files in directory1 that were recorded between 00am and 01am, directory1_01-02 for all the files in directory1 that were recorded between 01am and 02am, etc. and if there were 6 hour intervals in directory1, I will need 6 subdirectories, one for each hour interval). I need to have these separate directories because it’s the only way I’ve figured out how to concatenate .WAV files together (see Script 2). The concatenated files should also be in a separate directory to contain all stitched files for directory1.
Currently, I’m doing everything manually using two python scripts and it’s getting extremely cumbersome since I’m physically changing the numbers and intervals for every hour (silly, I know):
Script 1 (to move all files in an hour into another directory; in this particular bit of code, I'm finding all the clips between 01am and 02am and moving them to the subdirectory within directory1 so that the subdirectory only contains files from 01am to 02am):
import os
import shutil
origin = r'PATH/TO/DIRECTORY1’
destination = r'PATH/TO/DIRECTORY1/DIRECTORY1_01-02'
startswith_ = '01'
[os.rename(os.path.join(origin,i), os.path.join(destination, i)) for i in os.listdir(origin) if i.startswith(startswith_)]
Script 2 (to concatenate all files in the folder and writing the output to another directory; in this particular bit of code, I'm in the subdirectory from Script 1, concatenating all the files within it, and saving the output file "directory1_01-02.WAV" in another subdirectory of directory1 called "directory1_concatenated"):
import os
import glob
import ffmpeg
from pydub import AudioSegment
os.chdir("PATH/TO/DIRECTORY1/DIRECTORY1_01-02'")
wav_segments = [AudioSegment.from_wav(wav_file) for wav_file in glob.glob("*.wav")]
combined = AudioSegment.empty()
for clip in wav_segments:
combined += clip
combined.export(‘PATH/TO/DIRECTORY1/DIRECTORY1_CONCATENATED/DIRECTORY1_01-02.WAV', format = “wav)
The idea is that by the end of it, "directory1_concatenated" should contain all the concatenated files from each hour interval within directory1.
Can anyone please help me somehow automate this process so I don’t have to do it manually for all 207 directories? Feel free to ask any questions about the process just in case I haven't explained myself very well (sorry!).
Edit:
Figured out how to automate Script 1 to run thanks to the os.walk suggestions :) Now I have a follow-up question about Script 2. How do you increment the saved files so that they're numbered? When I try the following, I get an "invalid syntax" error.
rootdir = 'PATH/TO/DIRECTORY1'
for root, dirs, files in os.walk(rootdir):
for i in dirs:
wav_segments = [AudioSegment.from_wav(wav_file) for wav_file in glob.glob("*.wav")]
combined = AudioSegment.empty()
for clip in wav_segments:
combined += clip
combined.export("PATH/TO/DIRECTORY1/DIRECTORY1_CONCATENATED/DIRECTORY1_%s.wav", format = "wav", % i)
i++
I've been reading some other stack overflow questions but they all seem to deal with specific files? Or maybe I'm just not understanding os.walk fully yet - sorry, beginner here.

I am pretty sure you could do something with os.walk() to walk through your directories and files. Look at this snippet:
import os
rootdir = '.'
for root, dirs, files in os.walk(rootdir):
print(root, dirs, files)

How do find the average of the numbers in multiple text files?

I have multiple (around 50) text files in a folder and I wish to find the mean average of all these files. Is there a way for python to add up all the numbers in each of these files automatically and find the average for them?

I assume you don't want to put the name of all the files manually, so the first step is to get the name of the files in python so that you can use them in the next step.
import os
import numpy as np
Initial_directory = "<the full address to the 50 files you have ending with />"
Files = []
for file in os.listdir(Initial_directory):
Path.append( begin + file0 )
Now the list called "Files" has all the 50 files. Let's make another list to save the average of each file.
reading the data from each file depends on how the data is stored but I assume that in each line there is a single value.
Averages = []
for i in range(len(Files)):
Data = np.loadtxt(Files[i])
Averages.append (np.average(Data))
looping over all the files, Data stores the values in each file and then their average is added to the list Averages.

This can be done if we can unpack the steps needed to accomplish.
steps:
Python has a module called os that allows you to interact with the file system. You'll need this for accessing the files and reading from them.
declare a few variables as counters to be used for the duration of your script, including the directory name where the files reside.
loop over files in the directory, increment the total file_count variable by 1 (to get the total number of files, used for averaging at the end of the script).
join the file's specific name with the directory to create a path for the open function to find the accurate file.
read each file and add each line (assuming it's a number) within the file to the total number count (used for averaging at the end of the script), removing the newline character.
finally, print the average or continue using it in the script for whatever you need.
You could try something like the following:
#!/usr/bin/env python
import os
file_count=0
total=0
dir_name='your_directory_path_here'
for files in os.listdir(dir_name):
file_count+=1
for file_name in files:
file_path=os.path.join(dir_name,file_name)
file=open(file_path, 'r')
for line in file.readlines():
total+=int(line.strip('\n'))
avg=(total/file_count)
print(avg)

Limitation to Python's glob?

I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

Simple way to storing data from multiple processes

I have a Python script that does something along the line of:
def MyScript(input_filename1, input_filename2):
return val;
i.e. for every pair of input, I calculate some float value. Note that val is a simple double/float.
Since this computation is very intensive, I will be running them across different processes (might be on the same computer, might be on multiple computers).
What I did before was I output this value to a text file: input1_input2.txt . Then I will have 1000000 files that I need to reduce into one file. This process is not very fast since OS doesn't like folders that have too many files.
How do I efficiently get all these data into one single computer? Perhaps having MongoDB running on a computer and all the processes send the data along?
I want something easy. I know that I can do this in MPI but I think it is overkill for such a simple task.

If the inputs have a natural order to them, and each worker can find out "which" input it's working on, you can get away with one file per machine. Since Python floats are 8 bytes long, each worker would write the result to its own 8-byte slot in the file.
import struct
RESULT_FORMAT = 'd' # Double-precision float.
RESULT_SIZE = struct.calcsize(RESULT_FORMAT)
RESULT_FILE = '/tmp/results'
def worker(position, input_filename1, input_filename2):
val = MyScript(input_filename1, input_filename2)
with open(RESULT_FILE, 'rb+') as f:
f.seek(RESULT_SIZE * position)
f.write(struct.pack(RESULT_FORMAT, val))
Compared to writing a bunch of small files, this approach should also be a lot less I/O intensive, since many workers will be writing to the same pages in the OS cache.
(Note that on Windows, you may need some additional setup to allow sharing the file between processes.)

You can use python parallel processing support.
http://wiki.python.org/moin/ParallelProcessing
Specially, I would mention NetWorkSpaces.
http://www.drdobbs.com/web-development/200001971

You can generate a folder structure that contains generated sub folders that contain generated sub folders.
For example you have a main folder that contains 256 sub folder and each sub folder contains 256 sub folders. 3 levels deep will be enough. You can use sub strings of guids for generating unique folder names.
So guid AB67E4534678E4E53436E becomes folder AB that contains sub folder 67 and that folder contains folder E4534678E4E53436E.
Using 2 substrings of 2 characters makes it possible to genereate 256 * 256 folders. More than enough to store 1 million files.

You could run one program that collects the outputs, as example over XMLRPC.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.