Split large directory into chunks of files

Split large directory into chunks of files - python

I have a directory structure with a lot of files in it (~1 million) which I would like to zip into chunks of 10k files. So far I have this, which creates, well, garbage files-- when I unzip them it looks like all of the files are glommed into one long file instead of individual files--- and I'm stuck. Any help would be greatly appreciated.
dirctr = 1
for root, dirs, files in os.walk(args.input_dir, followlinks=False):
counter = 1
curtar= args.output_dir + 'File' + str(dirctr) + '.gz'
tar = tarfile.open(name=curtar, mode="w:gz")
for filename in files:
if ((counter -1) % args.files_per_dir) == 0:
if tarfile.is_tarfile(curtar):
tar.close(curtar)
dirctr = dirctr + 1
curtar= args.output_dir + 'File' + str(dirctr) + '.gz'
tar.open(name=curtar, mode="w:gz")
tar.add(os.path.join(root,filename))
counter = counter + 1
tar.close(curtar)

Related

Python: How make a zip with file, not into folder

Folder contain my files and I want to make a zip with those files, and save zip into a folder.
Here my files
- file:
- file_0.txt
- file_1.txt
- file_2.txt
- zip:
// save zip
script.py
Here my code
from zipfile import ZipFile
zip_name = "Zipfile"
zipObj = ZipFile("zip/{}.zip".format(zip_name), "w")
count = 0
while count < 3:
file_name = "file_"
zipObj.write('file/' + file_name + str(count) + ".txt")
count += 1
This make a Zip file with a folder named file, and inside all txt, I want to remove folder and only zip the files

from zipfile import ZipFile
zip_name = "Zipfile"
zipObj = ZipFile("zip/{}.zip".format(zip_name), "w")
count = 0
import os
os.chdir('file')
while count < 3:
file_name = "file_"
zipObj.write(file_name + str(count) + ".txt")
count += 1
This should work for you

How to unzip all files from the same filetype with python

I want to extract all files that have the same filetype from a zip file.
I have this code:
from zipfile import ZipFile
counter = 0
with ZipFile('Video.zip', 'r') as zipObject:
listOfFileNames = zipObject.namelist()
for fileName in listOfFileNames:
if fileName.endswith('.MXF'):
zipObject.extract(fileName, 'Greenscreen')
print('File ' + str(counter) + ' extracted')
counter += 1
print('All ' + str(counter) + ' files extraced')
The problem is that the zip file also has multiple sub-folders with the required .MXF files in them.
Thus after running the script my Greenscreen folder also shows all sub-folders like this:
But i just need the files of the same file-type. So it should look like this:

The filename, directory name, or volume label syntax is incorrect: ':' Python

I'am trying to add some new features to my file-management system. But I get struck with the following. I have 3 folders (Source, Destination and Archive). Each of them has 3 sub-folders (A, B and C). Only the sub-folders in Source contain 1 or more files. These will be re-written (moved) in Destination or Archive (depends on some requirements).
The files in the sub-folders (A, B and C) in Source will only be rewritten to the sub-folders (A, B, and C) in Destination if:
They ARE the last created file AND if they are at least 120 seconds old.
If they ARE NOT the last created file but they ARE at least 120 seconds old, they will be moved to the sub-folders (A, B, and C) in Archive.
If they ARE NOT the last created file and they ARE NOT at least 120 seconds old, they will stay in the current sub-folders (A, B, and C) in Source
During the re-writing process, the content of Power in the file will be multiplied by 10.
[![which looks like this: https://i.stack.imgur.com/ee0r7.png ]
This is the code I have, I get the following error: The filename, directory name, or volume label syntax is incorrect: ':'
Can someone help and tell what I'm doing wrong and why nothing is getting re-written/moved or deleted? Much appreciated!
import os, os.path
import time
#Make source, destination and archive paths.
src = r'c:\data\AM\Desktop\Source'
dst = r'c:\data\AM\Desktop\Destination'
arc = r'c:\data\AM\Desktop\Archive'
os.chdir(src)
#Now we want to get the absolute paths of the files which will be used to re-write the files.
for root, subdirs, files in os.walk('.'):
src_path = os.path.join(src, root)
dst_path = os.path.join(dst, root)
arc_path = os.path.join(arc, root)
for f in files:
src_fpath = os.path.join(src_path, f)
dst_fpath = os.path.join(dst_path, f)
arc_fpath = os.path.join(arc_path, f)
#Get only the newest files inside the src_fpath and store it in newest_file_paths.
newest_file_paths = max(src_fpath, key=os.path.getctime)
#Now we start reading data from the old files, write it into the new files and delete the old files.
with open(src_fpath, 'r') as read1, open(dst_fpath, 'w') as write1, open(arc_fpath, 'w') as write2:
data = {
'Power': None,
}
for line in read1:
splitter = (ID, Item, Content, Status) = line.strip().split()
#If the file(s) ARE the last created file(s) AND if they are at least 120 seconds old. Rewrite in Destination and remove in Source.
if read1 == newest_file_paths and os.path.getctime(newest_file_paths) < time.time() - 120 and Item in data:
Content = str(int(Content) * 10)
write1.write(ID+'\t'+Item+'\t'+Content+'\t'+Status+'\n')
write1.write(line)
os.remove(src_fpath)
#If the file(s) ARE NOT the last created file(s) but they ARE at least 120 seconds old. Rewrite in Archive and remove in Source
elif read1 == newest_file_paths and os.path.getctime(newest_file_paths) > time.time() - 120 and Item in data:
write2.write(line)
os.remove(src_fpath)
#If they ARE NOT the last created file and they ARE NOT at least 120 seconds old. Stay in Source.
else:
continue

Something like this might get you started with what I mean in the comments:
import os, os.path
import time
def rewrite(src_fpath, dst_fpath):
# Ensure the directory exists:
os.makedirs(os.path.dirname(dst_fpath), exist_ok=True)
with open(src_fpath, "r") as read1, open(dst_fpath, "w") as write1:
for line in read1:
(ID, Item, Content, Status) = line.strip().split()
Content = str(int(Content) * 10)
write1.write(ID + "\t" + Item + "\t" + Content + "\t" + Status + "\n")
write1.write(line)
print("Wrote", dst_fpath, "based on", src_fpath)
# Uncomment this to actually remove the originals:
# os.remove(src_fpath)
def process(src_dir, dst_dir, arc_dir, max_age=120):
operations = []
for dirpath, dirnames, filenames in os.walk(src_dir):
if not filenames:
# Empty directory, skip
continue
rel_dir = os.path.relpath(dirpath, src_dir)
filename_to_ctime = {
filename: os.path.getctime(os.path.join(dirpath, filename))
for filename in filenames
}
newest_ctime = max(filename_to_ctime.values())
for filename, ctime in filename_to_ctime.items():
abspath = os.path.join(dirpath, filename)
age = time.time() - ctime
# TODO: is this logic correct?
if age >= max_age:
# Move file(s) with newest ctime to dst_dir, everything else to arc_dir
new_root = dst_dir if ctime == newest_ctime else arc_dir
new_path = os.path.join(new_root, rel_dir, filename)
operations.append(("rewrite", abspath, new_path))
else:
operations.append(("noop", abspath))
for operation, *args in operations:
print(operation, args)
if operation == "rewrite":
rewrite(args[0], args[1])
if __name__ == "__main__":
process(src_dir="./src", dst_dir="./dst", arc_dir="./arc")
For instance, if I have a structure like
|____src
| |____a
| | |____2.txt
| | |____1.txt
| |____b
| | |____5-old.txt
| | |____4.txt
| | |____3.txt
where 5-old.txt is artificially oldened,
the output is
('rewrite', './src/a/1.txt', './arc/a/1.txt')
('rewrite', './src/a/2.txt', './dst/a/2.txt')
('rewrite', './src/b/3.txt', './arc/b/3.txt')
('rewrite', './src/b/4.txt', './dst/b/4.txt')
('noop', './src/b/5-old.txt')

How to move batches of files in Python?

I have a folder with 1092 files. I need to move those files to a new directory in batches of 10 (each new folder will have only 10 files each, so max. of 110 folders).
I tried this code, and now the folders have been created, but I can't find any of original files (???). They are neither in the original and newly created folders...
path = "/home/user/Documents/MSc/Imagens/Dataset"
paths = []
for root, dirs, file in os.walk(path):
for name in file:
paths.append(os.path.join(root,name))
start = 0
end = 10
while end <= 1100:
dest = str(os.mkdir("Dataset_" + str(start) + "_" + str(end)))
for i in paths[start:end]:
shutil.move(i, dest)
start += 10
end += 10
Any ideas?

With your move command, you are moving all 10 files to one single folder - but not into that folder as the filenames are missing. And dest is none, since os.mkdir() doesn't return anything.
You need to append the filename to dest:
dataset_dirname = "Dataset_" + str(start) + "_" + str(end)
os.mkdir(dataset_dirname)
dataset_fullpath = os.path.join(path, dataset_dirname)
for i in paths[start:end]:
# append filename to dataset_fullpath and move the file
shutil.move(i, os.path.join(dataset_fullpath , os.path.basename(i)))

Python: Getting files into an archive without the directory?

I've been learning python for about 3 weeks now, and I'm currently trying to write a little script for sorting files (about 10.000) by keywords and date appearing in the filename. Files before a given date should be added to an archive. The sorting works fine, but not the archiving
It creates an archive - the name is fine - but in the archive is the complete path to the files.
If i open it, it looks like: folder1 -> folder2 -> folder3 -> files.
How can I change it such that the archive only contains the files and not the whole structure?
Below is a snippet with my zip function, node is the path where the files were before sorting, folder is a subfolder with the files sorted by a keyword in the name, items are the folders with files sorted by date.
I am using Python 2.6
def ZipFolder(node, zipdate):
xynode = node + '/xy'
yznode = node + '/yz'
for folder in [xynode,yznode]:
items = os.listdir(folder)
for item in items:
itemdate = re.findall('(?<=_)\d\d\d\d-\d\d', item)
print item
if itemdate[0] <= zipdate:
arcname = str(item) + '.zip'
x = zipfile.ZipFile(folder + '/' + arcname, mode='w', compression = zipfile.ZIP_DEFLATED)
files = os.listdir(folder + '/' + item)
for f in files:
x.write(folder + '/' + item + '/' + f)
print 'writing ' + str(folder + '/' + item + '/' + f) + ' in ' + str(item)
x.close()
shutil.rmtree(folder + '/' + item)
return
I am also open to any suggestions and improvements.

From help(zipfile):
| write(self, filename, arcname=None, compress_type=None)
| Put the bytes from filename into the archive under the name
| arcname.
So try changing your write() call with:
x.write(folder + '/' + item + '/' + f, arcname = f)
About your code, it seems to me good enough, especially for a 3 week pythonist, although a few comments would have been welcomed ;-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split large directory into chunks of files - python

Related

Python: How make a zip with file, not into folder

How to unzip all files from the same filetype with python

The filename, directory name, or volume label syntax is incorrect: ':' Python

How to move batches of files in Python?

Python: Getting files into an archive without the directory?

Categories

Resources