Python: Find files in very large folder (over 100 TB) - python

I am working on a program that compares entries in cataloguing software (Rucio) with the files in storage. From the cataloguing, I get a path to what it believes the storage location for the file is. I then search that location for the file to see if it exists there or not. I have successfully created a bash script that performs this, but it would be a lot better if it could be redone in python.
The problem I have encountered is that python will not find the files, even when I know they exist there. I have tried stuff like
if path.exists(fulladdress):
does stuff
And providing a file I know exists it still does not find it. I suspect it has to do with the fact that the folder is huge, over 100 TB and over 287000 files, so it does not search the whole folder and therefore does not find the file.
Does there exist a python solution that works for folders that big?
Best regards
Piotr
the bash script that works is:
os.system("cd; cd directory_with_files; test -e file_in_directory _exist && echo filename >> found.txt || echo filename >> not_found "
tried running this:
def findfile(name, path):
for dirpath, dirname, filename in os.walk(path):
if name in filename:
return os.path.join(dirpath, name)
def compere_checksum(not_missing_files):
not_missing_files_file = open(not_missing_files, 'r')
lines_not_missing_files_file = not_missing_files_file.readlines()
#Extract a list of fiels i know exist
for line in lines_not_missing_files_file:
line.replace(' ','')
line_list=line.split(",")
address=line_list[0].replace("LUND: file://", "")
#address= path to the folder
fille=address[address.rindex('/')+1:]
#fille the mane of the file
address=address.replace(fille,"")
#search for the file using bash
os.system("test -e {} && echo Found {}".format(line_list[0],fille))
#search for the file using python function abovea
filepath=findfile(address,fille)
print(filepath)
address is something along the lines of "/projects/dir/dir/dir/dir/dir/mc20/v12/4.0GeV/v2.2.1-3e/"
and fille is looks like this "mc_v12-4GeV-3e-inclusive_run1310195_t1601591250.root"
The script returns:
Found mc_v12-4GeV-3e-inclusive_run1310220_t1601591602.root
None
Found mc_v12-4GeV-3e-inclusive_run1310246_t1601592829.root
None
Found mc_v12-4GeV-3e-inclusive_run1310247_t1601591229.root
None
Found mc_v12-4GeV-3e-inclusive_run1310248_t1601591216.root
None
Found mc_v12-4GeV-3e-inclusive_run1310249_t1601591416.root
None
Found mc_v12-4GeV-3e-inclusive_run1310250_t1601591472.root
None
so the bash script finds it but the python does not

I can use:
while open(file) as f:
do stuff
Dont know why this works and not
path.exists
or
def findfile(name, path):
for dirpath, dirname, filename in os.walk(path):
if name in filename:
return os.path.join(dirpath, name)
but whatever, as long as it works it is fine.

import os
def findfile(name, path):
for dirpath, dirname, filename in os.walk(path):
if name in filename:
return os.path.join(dirpath, name)
filepath = findfile("file2.txt", "/")
print(filepath)

Related

Extract full Path and File Name

Attempting to write a function that walks a file system and returns the absolute path and filename for use in another function.
Example "/testdir/folderA/222/filename.ext".
Having tried multiple versions of this I cannot seem to get it to work properly.
filesCheck=[]
def findFiles(filepath):
files=[]
for root, dirs, files in os.walk(filepath):
for file in files:
currentFile = os.path.realpath(file)
print (currentFile)
if os.path.exists(currentFile):
files.append(currentFile)
return files
filesCheck = findFiles(/testdir)
This returns
"filename.ext" (only one).
Substitute in currentFile = os.path.join(root, file) for os.path.realpath(file) and it goes into a loop in the first directory. Tried os.path.join(dir, file) and it fails as one of my folders is named 222.
I have gone round in circles and get somewhat close but haven't been able to get it to work.
Running on Linux with Python 3.6
There's a several things wrong with your code.
There are multiple values are being assigned to the variable name files.
You're not adding the root directory to each filename os.walk() returns which can be done with os.path.join().
You're not passing a string to the findFiles() function.
If you fix those things there's no longer a need to call os.path.exists() because you can be sure it does.
Here's a working version:
import os
def findFiles(filepath):
found = []
for root, dirs, files in os.walk(filepath):
for file in files:
currentFile = os.path.realpath(os.path.join(root, file))
found.append(currentFile)
return found
filesCheck = findFiles('/testdir')
print(filesCheck)
Hi I think this is what you need. Perhaps you could give it a try :)
from os import walk
path = "C:/Users/SK/Desktop/New folder"
files = []
for (directoryPath, directoryNames, allFiles) in walk(path):
for file in allFiles:
files.append([file, f"{directoryPath}/{file}"])
print(files)
Output:
[ ['index.html', 'C:/Users/SK/Desktop/New folder/index.html'], ['test.py', 'C:/Users/SK/Desktop/New folder/test.py'] ]

Move pairs of files (.txt & .xml) into their corresponding folder using Python

I have been working this challenge for about a day or so. I've looked at multiple questions and answers asked on SO and tried to 'MacGyver' the code used for my purpose, but still having issues.
I have a directory (lets call it "src\") with hundreds of files (.txt and .xml). Each .txt file has an associated .xml file (let's call it a pair). Example:
src\text-001.txt
src\text-001.xml
src\text-002.txt
src\text-002.xml
src\text-003.txt
src\text-003.xml
Here's an example of how I would like it to turn out so each pair of files are placed into a single unique folder:
src\text-001\text-001.txt
src\text-001\text-001.xml
src\text-002\text-002.txt
src\text-002\text-002.xml
src\text-003\text-003.txt
src\text-003\text-003.xml
What I'd like to do is create an associated folder for each pair and then move each pair of files into its respective folder using Python. I've already tried working from code I found (thanks to a post from Nov '12 by Sethdd, but am having trouble figuring out how to use the move function to grab pairs of files. Here's where I'm at:
import os
import shutil
srcpath = "PATH_TO_SOURCE"
srcfiles = os.listdir(srcpath)
destpath = "PATH_TO_DEST"
# grabs the name of the file before extension and uses as the dest folder name
destdirs = list(set([filename[0:9] for filename in srcfiles]))
def create(dirname, destpath):
full_path = os.path.join(destpath, dirname)
os.mkdir(full_path)
return full_path
def move(filename, dirpath):
shutil.move(os.path.join(srcpath, filename)
,dirpath)
# create destination directories and store their names along with full paths
targets = [
(folder, create(folder, destpath)) for folder in destdirs
]
for dirname, full_path in targets:
for filename in srcfile:
if dirname == filename[0:9]:
move(filename, full_path)
I feel like it should be easy, but Python isn't something I work with everyday and it's been a while since my scripting days... Any help would be greatly appreciated!
Thanks,
WK2EcoD
Use the glob module to interate all of the 'txt' files. From that you can parse and create the folders and copy the files.
The process should be as simple as it appears to you as a human.
for file_name in os.listdir(srcpath):
dir = file_name[:9]
# if dir doesn't exist, create it
# move file_name to dir
You're doing a lot of intermediate work that seems to be confusing you.
Also, insert some simple print statements to track data flow and execution flow. It appears that you have no tracing output so far.
You can do it with os module. For every file in directory check if associated folder exists, create if needed and then move the file. See the code below:
import os
SRC = 'path-to-src'
for fname in os.listdir(SRC):
filename, file_extension = os.path.splitext(fname)
if file_extension not in ['xml', 'txt']:
continue
folder_path = os.path.join(SRC, filename)
if not os.path.exists(folder_path):
os.mkdir(folderpath)
os.rename(
os.path.join(SRC, fname),
os.path.join(folder_path, fname)
)
My approach would be:
Find the pairs that I want to move (do nothing with files without a pair)
Create a directory for every pair
Move the pair to the directory
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import os, shutil
import re
def getPairs(files):
pairs = []
file_re = re.compile(r'^(.*)\.(.*)$')
for f in files:
match = file_re.match(f)
if match:
(name, ext) = match.groups()
if ext == 'txt' and name + '.xml' in files:
pairs.append(name)
return pairs
def movePairsToDir(pairs):
for name in pairs:
os.mkdir(name)
shutil.move(name+'.txt', name)
shutil.move(name+'.xml', name)
files = os.listdir()
pairs = getPairs(files)
movePairsToDir(pairs)
NOTE: This script works when called inside the directory with the pairs.

using subprocess over different files python

I've got a problem with a short script, it'd be great if you could have a look!
import os
import subprocess
root = "/Users/software/fmtomov1.0/remaker_lastplot/source_relocation/observed_arrivals_loc3d"
def loop_loc3d(file_in):
"""Loops loc3d over the source files"""
return subprocess.call (['loc3d'], shell=True)
def relocation ():
for subdir, dirs, files in os.walk(root):
for file in files:
file_in = open(os.path.join(subdir, file), 'r')
return loop_loc3d(file_in)
I think the script is quite easy to understand, it's very simple. However I'm not getting the result wanted. In a few word I just want 'loc3d' to operate over all the files contents present in the 'observed_arrivals_loc3d' directory, which means that I need to open all the files and that's what I've actually done. In fact, if I try to 'print files' after:
for subdir, dirs, files in os.walk(root)
I'll get the name of every file. Furthermore, if I try a 'print file_in' after
file_in = open(os.path.join(subdir, file), 'r')
I get something like this line for every file:
<open file '/Users/software/fmtomov1.0/remaker_lastplot/source_relocation/observed_arrivals_loc3d/EVENT2580', mode 'r' at 0x78fe38>
subprocess has been tested alone on only one file and it's working.
Overall I'm getting no errors but just -11 which means absolutely nothing to me. The output from loc3d should be completly different.
So does the code look fine to you? Is there anything I'm missing? Any suggestion?
Thanks for your help!
I assume you would call loc3d filename from the CLI. If so, then:
def loop_loc3d(filename):
"""Loops loc3d over the source files"""
return subprocess.call (['loc3d',filename])
def relocation():
for subdir, dirs, files in os.walk(root):
for file in files:
filename = os.path.join(subdir, file)
return loop_loc3d(filename)
In other words, don't open the file yourself, let loc3d do it.
Currently your relocation method will return after the first iteration (for the first file). You shouldn't need to return at all.
def loop_loc3d(filename):
"""Loops loc3d over the source files"""
return subprocess.call (['loc3d',filename])
def relocation ():
for subdir, dirs, files in os.walk(root):
for file in files:
filename = os.path.join(subdir, file)
loop_loc3d(filename)
This is only one of the issues. The other is concerning loc3d itself. Try providing the full path for loc3d.
-11 exit code might mean that the command killed by signal Segmentation fault.
It is a bug in loc3d. A well-behaved program should not produce 'Segmentation fault' on any user input.
Feed loc3d only files that it can understand. Print filenames or use subprocess.check_call() to find out which file it doesn't like:
#!/usr/bin/env python
import fnmatch
import os
import subprocess
def loc3d_files(root):
for dirpath, dirs, files in os.walk(root, topdown=True):
# skip hidden directories
dirs[:] = [d for d in dirs if not d.startswith('.')]
# process only known files
for file in fnmatch.filter(files, "*some?pattern[0-9][0-9].[ch]"):
yield os.path.join(dirpath, file)
for path in loc3d_files(root):
print path
subprocess.check_call(['loc3d', path]) # raise on any error
Just found out that loc3d, as unutbu said, relies on several variables and in the specific case one called 'observal_arrivals' that I have to create and delete every time from my directory. In Pythonic terms it means:
import os
import shutil
import subprocess
def loop_loc3d(file_in):
"""Loops loc3d over the source files"""
return subprocess.call(["loc3d"], shell=True)
path = "/Users/software/fmtomo/remaker_lastplot/source_relocation"
path2 = "/Users/Programming/working_directory/2test"
new_file_name = 'observed_arrivals'
def define_object_file ():
for filename in os.listdir("."):
file_in = os.rename (filename, new_file_name) # get the observal_arrivals file
file_in = shutil.copy ("/Users/simone/Programming/working_directory/2test/observed_arrivals", "/Users/software/fmtomo/remaker_lastplot/source_relocation")
os.chdir(path) # goes where loc3d is
loop_loc3d (file_in)
os.remove("/Users/software/fmtomo/remaker_lastplot/source_relocation/observed_arrivals")
os.remove ("/Users/Programming/working_directory/2test/observed_arrivals")
os.chdir(path2)
Now, this is working very well, so it should answer my question. I guess it's quite easy to understand, it's just copying, changing dir and that kind of stuff.

Process a set of files from a source directory to a destination directory in Python

Being completely new in python I'm trying to run a command over a set of files in python. The command requires both source and destination file (I'm actually using imagemagick convert as in the example below).
I can supply both source and destination directories, however I can't figure out how to easily retain the directory structure from the source to the destination directory.
E.g. say the srcdir contains the following:
srcdir/
file1
file3
dir1/
file1
file2
Then I want the program to create the following destination files on destdir: destdir/file1, destdir/file3, destdir/dir1/file1 and destdir/dir1/file2
So far this is what I came up with:
import os
from subprocess import call
srcdir = os.curdir # just use the current directory
destdir = 'path/to/destination'
for root, dirs, files in os.walk(srcdir):
for filename in files:
sourceFile = os.path.join(root, filename)
destFile = '???'
cmd = "convert %s -resize 50%% %s" % (sourceFile, destFile)
call(cmd, shell=True)
The walk method doesn't directly provide what directory the file is under srcdir other than concatenating the root directory string with the file name. Is there some easy way to get the destination file, or do I have to do some string manipulation in order to do this?
Change your loop to:
for root, dirs, files in os.walk(srcdir):
destroot = os.path.join(destdir, root[len(srcdir):])
for adir in dirs:
os.makedirs(os.path.join(destroot, adir))
for filename in files:
sourceFile = os.path.join(root, filename)
destFile = os.path.join(destroot, filename)
processFile(sourceFile, destFile)
There are a few relative path scripts out there that will do what you want -- namely find the relative path between two paths. E.g.:
http://www.voidspace.org.uk/python/pathutils.html
(relpath method)
http://code.activestate.com/recipes/302594-another-relative-filepath-script/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/390d8d3e3ac8ef44/d8c74f96468c6a36?q=relative+path&rnum=1&pli=1
Unfortunately, I don't think this functionality has ever been added to core python.
While not pretty, this will preserve the directory structure of the tree:
_, _, subdirs = root.partition(srcdir)
destfile = os.path.join(destdir, subdirs[1:], filename)

Deleting files which start with a name Python

I have a few files I want to delete, they have the same name at the start but have different version numbers. Does anyone know how to delete files using the start of their name?
Eg.
version_1.1
version_1.2
Is there a way of delting any file that starts with the name version?
Thanks
import os, glob
for filename in glob.glob("mypath/version*"):
os.remove(filename)
Substitute the correct path (or . (= current directory)) for mypath. And make sure you don't get the path wrong :)
This will raise an Exception if a file is currently in use.
If you really want to use Python, you can just use a combination of os.listdir(), which returns a listing of all the files in a certain directory, and os.remove().
I.e.:
my_dir = # enter the dir name
for fname in os.listdir(my_dir):
if fname.startswith("version"):
os.remove(os.path.join(my_dir, fname))
However, as other answers pointed out, you really don't have to use Python for this, the shell probably natively supports such an operation.
In which language?
In bash (Linux / Unix) you could use:
rm version*
or in batch (Windows / DOS) you could use:
del version*
If you want to write something to do this in Python it would be fairly easy - just look at the documentation for regular expressions.
edit:
just for reference, this is how to do it in Perl:
opendir (folder, "./") || die ("Cannot open directory!");
#files = readdir (folder);
closedir (folder);
unlink foreach (grep /^version/, #files);
import os
os.chdir("/home/path")
for file in os.listdir("."):
if os.path.isfile(file) and file.startswith("version"):
try:
os.remove(file)
except Exception,e:
print e
The following function will remove all files and folders in a directory which start with a common string:
import os
import shutil
def cleanse_folder(directory, prefix):
for item in os.listdir(directory):
path = os.path.join(directory, item)
if item.startswith(prefix):
if os.path.isfile(path):
os.remove(path)
elif os.path.isdir(os.path.join(directory, item)):
shutil.rmtree(path)
else:
print("A simlink or something called {} was not deleted.".format(item))
import os
import re
directory = "./uploaded"
pattern = "1638813371180"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if ( re.search(pattern,file))]
for file in filtered_files:
path_to_file = os.path.join(directory, file)
os.remove(path_to_file)

Categories