Data structures suitable for searching - python

I have common problem. I have some data and I want search in them. My issue is, that I dont know a proper data structures and algorhitm suitable for this situation.
There are two kind of objects - Process and Package. Both have some properties, but they are only data structures (dont have any methods). Next, there are PackageManager and something what can be called ProcessManager, which both have function returning list of files that belongs to some Package or files that is used by some Process.
So semantically, we can imagine these data as
Packages:
Package_1
file_1
_ file_2
file_3
Package_2
file_4
file_5
file_6
Actually file that belongs to Package_k can not belong to Package_l for k != l :-)
Processes:
Process_1
file_2
file_3
Process_2
file_1
Files used by processes corresponds to files owned by packages. Also, there the rule doesn't applies on this as for packages - that means, n processes can use one same file at the same time.
Now what is the task. I need to find some match between processes and packages - for given list of packages, I need to find list of processes which uses any of files owned by packages.
My temporary solution was making list of [package_name, package_files] and list of [process_name, process_files] and for every file from every package I was searching through every file of every process searching for match, but of course it could be only temporary solution vzhledem to horrible time complexity (even when I sort the files and use binary search to it).
What you can recommend me for this kind of searching please?
(I am coding it in python)

Doing the matching with sets should be faster:
watched_packages = [Package_1, Package_3] # Packages to consider
watched_files = { # set comprehension
file_
for package in watched_packages
for file_ in package.list_of_files
}
watched_processes = [
process
for process in all_processes
if any(
file_ in watched_files
for file_ in process.list_of_files
)
]

Based on my understanding of what you are trying to do - given a file name, you want to find a list of all the processes that use that file, this snippet of code should help:
from collections import defaultdict
# First make a dictionary that contains a file, and all processes it is a member of.
file_process_map=defaultdict(list)
[file_process_map[fn].append(p) for p in processes for fn in p.file_list]
Basically, we're converting your existing structure (where a process has one or more files) into a structure where we have a filename, and a list of processes that use it.
Now when you have a file you need to search for (in the processes) just look it up in the "file_process_map" dictionary and you'll have a list of all the processes that use the given file.
It is assumed here that "processes" is a list of objects, and each object has a file_list attribute that contains a list of associated files. Obviously, depending on your data structure you might need to alter the code..

Related

Creating arrays based on folder name

I have data that has been collected and organized in multiple folders.
In each folder, there can be multiple similar runs -- e.g. collected data under the same conditions, at different times. These filenames contain a number in them that increments. Each folder contains similar data collected under different conditions. For example, I can have an idle folder, and in it can be files named idle_1.csv, idle_2.csv, idle_3.csv, etc. Then I can have another folder pos1 folder, and similarly, pos1_1.csv, pos1_2.csv, etc.
In order to keep track of what folder and what file the data in the arrays came from, I want to use the folder name, "idle", "pos1", etc, as the array name. Then, each file within that folder (or the data resulting from processing each file in that folder, rather) becomes another row in that array.
For example, if the name of the folder is stored in variable arrname, and the file index is stored in variable arrndx, I want to write the value into that array:
arrname[arrndx]=value
This doesn't work, giving the following error:
TypeError: 'str' object does not support item assignment
Then, I thought about using a dictionary to do this, but I think I still would run into the same issue. If I use a dictionary, I think I need each dictionary's name to be the name derived from the folder name -- creating the same issue. If I instead try to use it as a key in a dictionary, the entries get overwritten with data from every file from the same folder since the name is the same:
arrays['name']=arrname
arrays['index']=int(arrndx)
arrays['val']=value
arrays['name': arrname, 'index':arrndx, 'val':value]
I can't use 'index' either since it is not unique across each different folder.
So, I'm stumped. I guess I could predefine all the arrays, and then write to the correct one based on the variable name, but that could result in a large case statement (is there such a thing in python?) or a big if statement. Maybe there is no avoiding this in my case, but I'm thinking there has to be a more elegant way...
EDIT
I was able to work around my issue using globals():
globals()[arrname].insert(int(arrndx),value)
However, I believe this is not the "correct" solution, although I don't understand why it is frowned upon to do this.
Use a nested dictionary with the folder names at the first level and the file indices (or names) at the second.
from pathlib import Path
data = {}
base_dir = 'base'
for folder in Path(base_dir).resolve().glob('*'):
if not folder.is_dir():
continue
data[folder.name] = {}
for csv in folder.glob('*.csv'):
file_id = csv.stem.split('_')[1]
data[folder.name][file_id] = csv
The above example just saves the file name in the structure but you could alternatively load the file's data (e.g. using Pandas) and save that to the dictionary. It all depends what you want to do with it afterwards.
What about :
foldername = 'idle' # Say your folder name is idle for example
files = {}
files[filename] = [filenmae + "_" + str(i) + ".csv" for i in range(1, number_of_files_inside_folder + 2)]
does that solve your problem ?

Why does appending to a list take forever?

I wrote the following code:
import fnmatch
ll = []
for items in src:
for file in os.listdir('/Users/swaghccc/Downloads/PulledImages/'):
if fnmatch.fnmatch(file, items.split('/')[-1]):
print file
ll.append(file)
my src list contains paths to images.
something like:
/path/to/image.jpg
These images are a subset of the images contained in the directory PulledImages.
The printing of the matched images works correctly.
But when I try to put those imagesnames into a list ll it takes forever.
What on earth am I doing wrong?
Appending doesn't take forever. Searching through a list, however, takes more time the longer your list is; and os.listdir(), being an operating system call, can be unavoidably slow when running against a large directory.
To avoid that, use a dictionary or set, not a list, to track the names you want to compare against -- and build that set only once, outside your loop.
# run os.listdir only once, storing results in a set for constant-time lookup
import sets
files = sets.Set(os.listdir('/Users/swaghccc/Downloads/PulledImages/'))
ll = []
for item in src:
if item.split('/')[-1] in files:
ll.append(file)
Community Wiki because I don't believe this question to be within topic guidelines without a MCVE; thus, not taking rep/credit for this answer.

Get file path of continuously updating file

I have found a few approaches to search for the newest file created by a user in a directory, but I need to determine if an easier approach exists. Most posts on this topics work in some instances or have major hurdles, so I am hoping to unmuddy the water.
I am having difficulty looking through a growing file system, as well as bringing more users in with more potential errors.
I get data from a Superlogics Winview CP 32 for a continuously streaming system. On each occasion of use of the system, I have the operator input a unique identifier for the file name containing a few of the initial conditions of the system we need to track. I would like to get that file name with no help from the operator/user.
Eventually, the end goal is to pare down a list of files I want to search, filtered based on keys, so my first instinct was to use only matching file types, trim all folders in a pathway into a list, and sort based on max timestamp. I used some pretty common functions from these pages:
def fileWalkIn(path='.',matches=[],filt='*.csv'): # Useful for walking through a given directory
"""Iterates through all files under the given path using a filter."""
for root, dirnames, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, filt):
matches.append(os.path.join(root, filename))
yield os.path.join(root, filename)
def getRecentFile(path='.',matches=[],filt='*.dat'):
rr = max(fileWalkIn(path=path,matches=matches,filt=filt), key=os.path.getmtime)
return rr
This got me far, but is rather bulky and slow, which means I cannot do this repeatedly if I want to explore the files that match, lest I have to carry around a bulky list of the matching files.
Ideally, I will be able to process the data on the fly, executing and printing live while it writes, so this approach is not usable in that instance.
I borrowed from these pages a new approach by alex-martelli, which does not use a filter, gives the option of giving files, opposed to directories, is much slimmer than fileWalkIn, and works quicker if using the timestamp.
def all_subdirs_of(b='.'): # Useful for walking through a given directory
# Create hashable list of files or directories in the parent directory
results = []
for d in os.listdir(b):
bd = os.path.join(b, d)
if os.path.isfile(bd):
results.append(bd)
elif os.path.isdir(bd):
results.append(bd)
# return both
return results
def newest(path='.'):
rr = max(all_subdirs_of(b=path), key=os.path.getmtime)
return rr
def getActiveFile(newFile ='.'):
while os.path.exists(newFile):
newFile = newest(newFile)
if os.path.isfile(newFile):
return newFile
else:
if newFile:
continue
else:
return newFile
This gets me the active file in a directory much more quickly, but only if no other files have written since launching my data collection. I can see all kinds of problems here and need some help determining if I have gone down a rabbit hole and there is a more simple solution, like testing file sizes, or whether a more cohesive solution with less potential snags exists.
I found other answers for different languages (java, how-to-get-the-path-of-a-running-jar-file), but would need something in Python. I have explored functions like watchdog and win32, but both require steep learning curves, and I feel like I am either very close, or need to change my paradigm entirely.
dircache might speed up the second approach a bit. It's a wrapper around listdir that checks the directory timestamp and only re-reads directory contents if there's been a change.
Beyond that you really need something that listens to file system events. A quick google turned up two pip packages, pyinotify for Linux only and watchdog.
Hope this helps.

Over-riding os.walk to return a generator object as the third item

While checking the efficiency of os.walk, I created 6,00,000 files with the string Hello <number> (where number is just a number indicating the number of the file in the directory), e.g. the contents of the files in the directory would look like:-
File Name | Contents
1.txt | Hello 1
2.txt | Hello 2
.
.
600000.txt|Hello 600000
Now, I ran the following code:-
a= os.walk(os.path.join(os.getcwd(),'too_many_same_type_files')) ## Here, I am just passing the actual path where those 6,00,000 txt files are present
print a.next()
The problem what I felt was that the a.next() takes too much time and memory, because the 3rd item that a.next() would return is the list of files in the directory (which has 600000 items). So, I am trying to figure out a way to reduce the space complexity (at least) by somehow making a.next() to return a generator object as the 3rd item of the tuple, instead of list of file names.
Would that be a good idea to reduce the space complexity?
As folks have mentioned already, 600,000 files in a directory is a bad idea. Initially I thought that there's really no way to do this because of how you get access to the file list, but it turns out that I'm wrong. You could use the following steps to achieve what you want:
Use subprocess or os.system to call ls or dir (whatever OS you happen to be on). Direct the output of that command to a temporary file (say /tmp/myfiles or something. In Python there's a module that can return you a new tmp file).
Open that file for reading in Python.
File objects are iterable and will return each line, so as long as you have just the filenames, you'll be fine.
It's such a good idea, that's the way the underlying C API works!
If you can get access to readdir, you can do it: unfortunately this isn't directly exposed by Python.
This question shows two approaches (both with drawbacks).
A cleaner approach would be to write a module in C to expose the functionality you want.
os.walk calls listdir() under the hood to retrieve the contents of the root directory then proceeds to split the returned list of items to dirs and non-dirs.
To achieve what you want you'll need to dig much lower down and implement not only your own version of walk() but also an alternative listdir() that returns a generator. Note that even then you will not be able to provide independent generators for both dirs and files unless you make two separate calls to the modifiedlistdir() and filter the results on the fly.
As suggested by Sven in the comments above, it might be better to address the actual problem (too many files in a dir) rather than over-engineer a solution.

Simple way to storing data from multiple processes

I have a Python script that does something along the line of:
def MyScript(input_filename1, input_filename2):
return val;
i.e. for every pair of input, I calculate some float value. Note that val is a simple double/float.
Since this computation is very intensive, I will be running them across different processes (might be on the same computer, might be on multiple computers).
What I did before was I output this value to a text file: input1_input2.txt . Then I will have 1000000 files that I need to reduce into one file. This process is not very fast since OS doesn't like folders that have too many files.
How do I efficiently get all these data into one single computer? Perhaps having MongoDB running on a computer and all the processes send the data along?
I want something easy. I know that I can do this in MPI but I think it is overkill for such a simple task.
If the inputs have a natural order to them, and each worker can find out "which" input it's working on, you can get away with one file per machine. Since Python floats are 8 bytes long, each worker would write the result to its own 8-byte slot in the file.
import struct
RESULT_FORMAT = 'd' # Double-precision float.
RESULT_SIZE = struct.calcsize(RESULT_FORMAT)
RESULT_FILE = '/tmp/results'
def worker(position, input_filename1, input_filename2):
val = MyScript(input_filename1, input_filename2)
with open(RESULT_FILE, 'rb+') as f:
f.seek(RESULT_SIZE * position)
f.write(struct.pack(RESULT_FORMAT, val))
Compared to writing a bunch of small files, this approach should also be a lot less I/O intensive, since many workers will be writing to the same pages in the OS cache.
(Note that on Windows, you may need some additional setup to allow sharing the file between processes.)
You can use python parallel processing support.
http://wiki.python.org/moin/ParallelProcessing
Specially, I would mention NetWorkSpaces.
http://www.drdobbs.com/web-development/200001971
You can generate a folder structure that contains generated sub folders that contain generated sub folders.
For example you have a main folder that contains 256 sub folder and each sub folder contains 256 sub folders. 3 levels deep will be enough. You can use sub strings of guids for generating unique folder names.
So guid AB67E4534678E4E53436E becomes folder AB that contains sub folder 67 and that folder contains folder E4534678E4E53436E.
Using 2 substrings of 2 characters makes it possible to genereate 256 * 256 folders. More than enough to store 1 million files.
You could run one program that collects the outputs, as example over XMLRPC.

Categories