Python - Efficiently building a dictionary - python

I am trying to build a dict(dict(dict())) out of multiple files, which are stored in different numbered directories, i.e.
/data/server01/datafile01.dat
/data/server01/datafile02.dat
...
/data/server02/datafile01.dat
/data/server02/datafile02.dat
...
/data/server86/datafile01.dat
...
/data/server86/datafile99.dat
I have a couple problems at the moment:
Switching between directories
I know that I have 86 servers, but the number of files per server may vary. I am using:
for i in range(1,86):
basedir='/data/server%02d' % i
for file in glob.glob(basedir+'*.dat'):
Do reading and sorting here
but I cant seem to switch between the directories properly. It just sits in the first one and gets stuck it seems when there are no files in the directory
Checking if key already exists
I would like to have a function that somehow checks if a key is already present or not, and in case it isnt creates that key and certain subkeys, since one cant define dict[Key1][Subkey1][Subsubkey1]=value
BTW i am using Python 2.6.6

Björn helped with the defaultdict half of your question. His suggestion should get you very close to where you want to be in terms of the default value for keys that do not yet exist.
The best tool for walking a directory and looking at files is os.walk. You can combine the directory and filename names that you get from it with os.path.join to find the files you are interested in. Something like this:
import os
data_path = '/data'
# option 1 using nested list comprehensions**
data_files = (os.path.join(root,f) for (root, dirs, files) in os.walk(data_path)
for f in files) # can use [] instead of ()
# option 2 using nested for loops
data_files = []
for root, dirs, files in os.walk(data_path):
for f in files:
data_files.append(os.path.join(root, f))
for data_file in data_files:
# ... process data_file ...
**Docs for list comprehensions.

I can't help you with your first problem, but the second one can be solved by using a defaultdict. This is a dictionary that has a function that is called to generate a value when a requested key did not exist. Using lambda you can nest them:
>>> your_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
>>> your_dict[1][2][3]
0

I'm assuming these 'directories' are remotely mounted shares?
Couple of things:
I'd use os.path.join instead of 'basedir' + '*.dat'
For FS related stuff I've had very good results parallelizing the computation using
multiprocessing.Pool to get around those times where a remote fs might be extremely slow and hold up the whole process.
import os
import glob
import multiprocessing as mp
def processDir(path):
results = {}
for file in glob.iglob(os.path.join(path,'*.dat')):
results.update(add to the results here)
return results
dirpaths = ['/data/server%02d'%i for i in range(1,87)]
_results = mp.Pool(8).map(processDir,dirpaths)
results = combine _results here...
For your dict-related problems, use defaultdict, as mentioned in the other answers, or even your own dict subclass, or function?
def addresult(results,key,subkey,subsubkey,value):
if key not in results:
results[key] = {}
if subkey not in results[key]:
results[key][subkey] = {}
if subsubkey not in results[key][subkey]:
results[key][subkey][subsubkey] = value
There are almost certainly more efficient ways to accomplish this, but that's a start.

Related

How to iterate over files in specific directories?

I'd like to iterate over files in two folders in a directory only, and ignore any other files/directories.
e.g in path: "dirA/subdirA/folder1" and "dirA/subdirA/folder2"
I tried passing both to pathlib as:
root_dir_A = "dirA/subdirA/folder1"
root_dir_B = "dirA/subdirA/folder2"
for file in Path(root_dir_A,root_dir_B).glob('**/*.json'):
json_data = open(file, encoding="utf8")
...
But it only iterates over the 2nd path in Path(root_dir_A,root_dir_B).
You can't pass two separate directories to Path(). You'll need to loop over them.
for dirpath in (root_dir_A, root_dir_B):
for file in Path(dirpath).glob('**/*.json'):
...
According to the documentation, Path("foo", "bar") should produce "foo/bar"; but it seems to actually use only the second path segment if it is absolute. Either way, it doesn't do what you seemed to hope it would.
Please check the output of Path(root_dir_A,root_dir_B) to see if it returns what you want.
In your specific case this should work:
path_root = Path('dirA')
for path in path_root.glob('subdirA/folder[12]/*/*.json'):
...
If your paths aren't homogeneous enough you might have to chain generators. I. e.:
from itertools import chain
content_dir_A = Path(root_dir_A).glob('**/*.json')
content_dir_B = Path(root_dir_B).glob('**/*.json')
content_all = chain(content_dir_A, content_dir_B)
for path in content_all:
...

Loop through list of files

I'm in the process of developing a data column check, but I'm having a tough time figuring out how to properly loop through a list of files. I have a folder with a list of csv files. I need to check if each file maintains a certain structure. I'm not worried about checking the structure of each file, I'm more worried about how to properly pull each individual file from the dir, dataframe it, and then move on to the next file. Any help would be much appreciated.
def files(path):
files = os.listdir(path)
len_files = len(files)
cnt = 0
while cnt < len_files:
print(files)
for file in os.listdir(path):
if os.path.isfile(os.path.join(path, file)):
with open(path + file, 'r') as f:
return data_validate(f)
def data_validate(file):
# Validation check code will eventually go here...
print(pd.read_csv(file))
def run():
files("folder/subfolder/")
Which version of python do you use?
I use Pathlib and python3.6+ to do a lot of file processing with pandas. I find Pathlib easy to use, though you still have to dip back into os for a couple of functions they haven't implemented yet. A plus is that Path objects can be passed into the os functions without modification - so I like the flexibility.
This is a function I used to recursively go through an arbitrary directory structure that I have modified to look more like what you're trying to achieve above, returning a list of DataFrames.
If your directory is always going to be flat, you can simplify this even more.
def files(directory):
top_dir = Path(directory)
validated_files = list()
for item in top_dir.iterdir():
if item.is_file():
validated_files.append(data_validate(item))
elif item.is_dir():
validated_files.append(files(item))
return validated_files

Finding files with Python using os.walk() in a list comprehension?

I have been using os.walk() method in Python to make a list of the paths to all the folders and subfolders where a specific file can be found.
I was tired of using a bunch of loops and elifs, and packed it all into a (quite messy) list comprehension that does excacly what I want:
import os
directory = "C:\\Users\\User\\Documents"
file_name = "example_file.txt"
list_of_paths = [path for path in (os_tuple[0] for os_tuple in os.walk(directory) if file_name in (item.lower() for item in os_tuple[2]))]
I have two questions. The first, and most important, is: Is there a more efficiant way to do this? I often expect to find several hundred files in just as many folders, and if it's on a server it can take several minutes.
The second question is: How can I make it more readable? Having two generator comprehensions inside a list comprehension feels pretty messy.
Update: I was told to use Glob, so naturally I had to try it. It seems to work just as well as my list comprehension with os.walk(). My next step will therefore be to test the two versions on a couple of different files and folders.
import glob
directory = "C:\\Users\\User\\Documents"
file_name = "example_file.txt"
list_of_paths = [path.lower().replace(("\\" + file_name), "") for path in (glob.glob(directory + "/**/*" + file_name, recursive=True))]
Any additional comments are very welcome.
Update 2: After testing both methods, the results I'm getting suggests that the os.walk() method is about twice as fast as the glob.glob() method. The test was performed on about 400 folders with a total of 326 copies of the file I was looking for.

pythonic way to access each file in a list of directories

I have working code that looks like this:
# Wow. Much nesting. So spacebar
if __name__ == '__main__:
for eachDir in list_of_unrelated_directories:
for eachFile in os.listdir(eachDir):
if eachFile.endswith('.json'):
# do stuff here
I'd like to know if there is a more elegant way of doing that. I would like to not have my code nested three layers deep like that, and if I could get this to a one-liner like
for each file that ends with .json in all these directories:
# do stuff
That would be even more awesome. I also edited this to point out that the directories are not all in the same folder. Like you might be looking for .json files in your home folder and also your /tmp folder. So I'm not trying to move recursively through a single folder.
The most Pythonic way is (in my opinion) to write a function that yields the files of a certain type and use it. Then your calling code is very clear and concise. Some of the other answers are very concise but incredibly confusing; in your actual code you should value clarity over brevity (although when you can get both that is, of course, preferred).
Here's the function:
import os
def files_from_directories(directories, filetype):
"""Yield files of filetype from all directories given."""
for directory in directories:
for file in glob.glob(os.path.join(directory, '*' + filetype))
yield file
Now your calling code really is a one-liner:
# What a good one-liner!!!!
for json_file in files_from_directories(directories, '.json'):
# do stuff
So now you have a one-liner, and a very clear one. Plus if you want to process any other type of file, you can just reuse the function with a different filetype.
You can use a generator expression to remove the nested loops:
for json_file in (f for dir in list_of_unrelated_dirs
for f in os.listdir(dir)
if f.endswith('.json')):
print json_file.
If you want to apply a function to them, you could even improve it removing the remaining for loop with map() function:
map(fun,
(f for dir in list_of_unrelated_dirs
for f in os.listdir(dir)
if f.endswith('.json'))
)
Hope this helps!
Surely, the following code is not Pythonic because is not the simplest or the most clear one, definitely doesn't follow the Zen of Python
However, It's a one-line approach and It was fun to do it ;-):
def manage_file(filepath):
print('File to manage:', filepath)
EDIT: Based in accepted answer and I've updated my answer to use glob(), the result is still a bit Freak code but It's less code that my previous approach
map(manage_file, [fn for fn in sum((glob('%s/*.json' % eachDir) for eachDir in data_path), [])])
glob() can reduce you to two levels:
for d in list_of_unrelated_directories:
for f in glob(join(d, '*.json')):
_process_json_file(f)
If your list_of_unrelated_directories is really a list of totally unrelated directories, I don't see how you can avoid the first loop.If they do have something in common (say a common root, and some common prefix), then you can use os.walk() to traverse the tree and grab all matching files.
It's not really less nesting, it's just nesting within a comprehension.
This gets all things that end with '.json' and are also confirmed as files (ignores folders that end with '.json').
Standalone Code
import os
unrelated_paths = ['c:/', 't:/']
json_files = (os.path.join(p, o) for p in unrelated_paths
for o in os.listdir(p)
if (o.lower().endswith('.json')
and os.path.isfile(os.path.join(p, o))))
for json_file in json_files:
print json_file

Find duplicate filenames, and only keep newest file using python

I have +20 000 files, that look like this below, all in the same directory:
8003825.pdf
8003825.tif
8006826.tif
How does one find all duplicate filenames, while ignoring the file extension.
Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)
For example:
"8003825" appears twice
Then look at the metadata of each duplicate file and only keep the newest one.
Similar to this post:
Keep latest file and delete all other
I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?
I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...
Python 2.6
Windows 7
You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.
import os
from collections import namedtuple
directory = #file directory
os.chdir(directory)
newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file = newest_files.get(name)
this_file_date = os.path.getmtime(file_name)
if cashed_file is None:
newest_files[name] = Entry(this_file_date,file_name)
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.
In the end you have a dictionary with the most recent files.
Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).
For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file_name = newest_files.get(name).file_name
if file_name != cashed_file_name: #it's not the newest with this name
os.remove(file_name)
As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
os.remove(cashed_file.file_name) #this line added
First, get a list of file names and sort them. This will put any duplicates next to each other.
Then, strip off the file extension and compare to neighbors, os.path.splitext() and itertools.groupby() may be useful here.
Once you have grouped the duplicates, pick the one you want to keep using os.stat().
In the end your code might looks something like this:
import os, itertools
files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
dups = list(g)
if len(dups) > 1:
# figure out which file(s) to remove
You shouldn't have to worry about memory here, you're looking at something on the order of a couple of megabytes.
For the filename counter you could use a defaultdict that stores how many times each file appears:
import os
from collections import defaultdict
counter = defaultdict(int)
for file_name in file_names:
file_name = os.path.splitext(os.path.basename(file_name))[0]
counter[file_name] += 1

Categories