How to create a dependency tree for python functions - python

I'm writing some code using python/numpy that I will be using for data analysis of some experimental data sets. Certain steps of these analysis routines can take a while. It is not practical to rerun every step of the analysis every time (I.E when debugging) so it makes sense to save the output from these steps to a file and just reuse them if they're already available.
The data I ultimately want to obtain can be derived from various steps along this analysis process. I.E, A can be used to calculate B and C. D can be calculated from B. E can then be calculated using C and D. etc etc.
The catch here is that it's not uncommon to make it through a few (or many) datasets only to find that there's some tiny little gotcha in the code somewhere that requires some part of the tree to be recalculated. I.E - I discover a bug in B, so now anything that depends on B also needs to be recalculated because it was derived from incorrect data.
The end goal here is to basically protect myself from having sets of data that I forget to reprocess when bugs are found. In other words, I want to be confident that all of my data is calculated using the newest code.
Is there a way to implement this in Python? I have no specific form this solution needs to take so long as it's extensible as I add new steps. I also am okay with the "recalculation step" only being performed when the dependent quantity is recalculated (Rather than at the time one of the parents are changed).
My first thought of how this might be done is to embed information in a header of each saved file (A, B, C, etc) indicating what version of each module it was created with. Then, when loading the saved data the code can check if the version in the file matches the current version of the parent module. (Some sort of parent.getData() which checks if the data has been calculated for that dataset and if it's up to date)
The thing is, at least at first glance, I could see that this could have problems when the change happens several steps up in the dependency chain because the derived file may still be up to date with its module even though its parents are out of date. I suppose I could add some sort of parent.checkIfUpToDate() that checks its own files and then asks each of its parents if they're up to date (which then ask their parents, etc), and updates it if not. The version number can just be a static string stored in each module.
My concern with that approach is that it might mean reading potentially large files from disk just to get a version number. If I went with the "file header" approach, does Python actually load the whole file in to memory when I do an open(myFile), or can I do that, just read the header lines, and close the file without loading the whole thing in to memory?
Last - is there a good way to embed this type of information beyond just having the first line of the file be some variation of # MyFile made with MyModule V x.y.z and writing some piece of code to parse that line?
I'm kind of curious if this approach makes sense, or if I'm reinventing the wheel and there's already something out there to do this.
edit: And something else that occurred to me after I submitted this - does Python have any mechanism to define templates that modules must follow? Just as a means to make sure I keep the format for the data reading steps consistent from module to module.

I cannot answer all of your questions but you can read in only a small part of data from a large file as you can see here:
How to read specific part of large file in Python
I do not see why you would need a parent.checkIfUpToDate() function. You could as well just store the version number of the parent functions in the file itself as well.
To me your approach sounds reasonable, however I have never done anything similar. Alternatively you could create an additional file that holds the specified information but I think storing the information in the actual file should prevent Version errors between your "data file" and the "function version file".

Related

Python - Change and update the same header files from two different projects

I am performing data analysis. I want to segment the steps of data analysis into different projects, as the analysis will be performed in the same order, but not usually all at the same time. There is so much code and data cleaning that keeping all of this in the same project may get confusing.
However, I have been keeping any header files for tracking columns of information in the data consistent. It is possible I will change the sequence of these headers at some point and want to run all sequences of code. I also want to make sure that the header used remains the same so I don't erroneously analyze one piece of data instead of another. I use headers so that if a column order changes at any time, I am accessing the index of the data based on the header that matches the data changes rather than changing every instance of appearance of a particular column number throughout my code.
To accomplish this, I would like to file track multiple projects that access the SAME header files, and update and alter the header files without having to access the header files from each project individually.
Finally, I don't want to just store it somewhere on my computer and not track it, because I work from two different work stations.
Any good solutions or best practice for what I want to do? Have I made an error somewhere in project set-up? I am mostly self-taught and so have developed my own project organization and sequence of data analysis based on my own ideas and research as I go, so if I've done some terribly bad practice that would be great to know.
I've found a possible solution that uses independent branches in the same repo for tracking two separate projects, but I'm not convinced this is the best solution either.
Thanks!

How to write a line or block in the middle of bgzf

I'd like to reference the following post as well, and mention that I'm familiar with BioPython.
How to obtain random access of a gzip compressed file
I'm familiar with the Bio.bgzf's potential for indexes and random reads. I'm building a library that uses the module to build an index against the blocks that contain data that is relevant to my interests. The technology is very interesting but I'm struggling to understand the pace of development or limitations of what Bio.bgzf or even the bgzf standard are capable of.
Can Bio.bgzf overwrite a specific line in the file, just as it can read from the virtual offset to the end of the line? If it could, would the new data necessarily need to be exactly the same number of bits?
After using make_virtual_offset() to acquire a position in the .bgzf file for a line that I'd like to overwrite, I'm looking for a method like filehandle.writeline() to replace the line in the block with some new text. If that's not possible, then is it possible to get the coordinates for the entire block and then rewrite that. And if not, it could be said that bgzf index files are sufficient for reading only. Is this correct?

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Optimizing a Mass ID3 Tag Scan [duplicate]

This question already has an answer here:
Optimizing Python Code Using SQLite3 + Mutagen
(1 answer)
Closed 9 years ago.
I'm building a small tool that I want to scan over a music collection, read the ID3 info of a track, and store it as long as that particular artist does not have a song that has been accessed more than twice. I'm planning on using Mutagen for reading the tags.
However, the music collections of myself and many others are massive, exceeding 20,000 songs. As far as I know, libraries like Mutagen have to open and close every song to get the ID3 info from it. While MP3s aren't terribly performance-heavy, that's a lot of songs. I'm already planning a minor optimization in the form of keeping a count of each artist and not storing any info if their song count exceeds 2, but as far as I can tell I still need to open every song to check the artist ID3 tag.
I toyed with the idea of using directories as a hint for the artist name and not reading any more info in that directory once the artist song count exceeds 2, but not everyone has their music set up in neat Artist/Album/Songs directories.
Does anyone have any other optimizations in mind that might cut down on the overhead of opening so many MP3s?
Beware of premature optimization. Are you really sure that this will be a performance problem? What are your requirements -- how quickly does the script need to run? How fast does it run with the naïve approach? Profile and evaluate before you optimize. I think there's a serious possibility that you're seeing a performance problem where none actually exists.
You can't avoid visiting each file once if you want a guaranteed correct answer. As you've seen, optimizations that entirely skip files will basically amount to automated guesswork.
Can you keep a record of previous scans you've done, and on a subsequent scan use the last-modified dates of the files to avoid re-scanning files you've already scanned once? This could mean that your first scan might take a little bit of time, but subsequent scans would be faster.
If you need to do a lot of complex queries on a music collection quickly, consider importing the metadata of the entire collection into a database (for instance SQLite or MySQL). Importing will take time -- updating to insert new files will take a little bit of time (checking the last-modified dates as above). Once the data is in your database, however, everything should be fairly snappy assuming that the database is set up sensibly.
In general for this question i would recommend you using multiple ways of detecting an artist or track title:
1st way to check: Is the filename maybe in ARTIST-TITLE.mp3 format? (or similar)
(filename for this would be "Artist-Track.mp3")
for file in os.listdir(PATH_TO_MP3s):
artist = re.split("[\_\-\.]", file)[-3]
track = re.split("[\_\-\.]", file)[-2]
filetype = re.split("[\_\-\.]", file)[-1]
Of course you have to make sure if the file is in that format first.
2nd step (if first doesn't fit for that file) would be checking if the directory names fit (like you said)
3rd and last one would be to check the ID3 tags.
But make sure to check if the values are the right before trusting it.
For example if someone would use "Track-Artist.mp3" for the code i provided artist and track would be switched.

How does git fetches commits associated to a file?

I'm writing a simple parser of .git/* files. I covered almost everything, like objects, refs, pack files etc. But I have a problem. Let's say I have a big 300M repository (in a pack file) and I want to find out all the commits which changed /some/deep/inside/file file. What I'm doing now is:
fetching last commit
finding a file in it by:
fetching parent tree
finding out a tree inside
recursively repeat until I get into the file
additionally I'm checking hashes of each subfolders on my way to file. If one of them is the same as in commit before, I assume that file was not changed (because it's parent dir didn't change)
then I store the hash of a file and fetch parent commit
finding file again and check if hash change occurs
if yes then original commit (i.e. one before parent) was changing a file
And I repeat it over and over until I reach very first commit.
This solution works, but it sucks. In worse case scenario, first search can take even 3 minutes (for 300M pack).
Is there any way to speed it up ? I tried to avoid putting so large objects in memory, but right now I don't see any other way. And even that, initial memory load will take forever :(
Greets and thanks for any help!
That's the basic algorithm that git uses to track changes to a particular file. That's why "git log -- some/path/to/file.txt" is a comparatively slow operation, compared to many other SCM systems where it would be simple (e.g. in CVS, P4 et al each repo file is a server file with the file's history).
It shouldn't take so long to evaluate though: the amount you ever have to keep in memory is quite small. You already mentioned the main point: remember the tree IDs going down to the path to quickly eliminate commits that didn't even touch that subtree. It's rare for tree objects to be very big, just like directories on a filesystem (unsurprisingly).
Are you using the pack index? If you're not, then you essentially have to unpack the entire pack to find this out since trees could be at the end of a long delta chain. If you have an index, you'll still have to apply deltas to get your tree objects, but at least you should be able to find them quickly. Keep a cache of applied deltas, since obviously it's very common for trees to reuse the same or similar bases- most tree object changes are just changing 20 bytes from a previous tree object. So if in order to get tree T1, you have to start with object T8 and apply Td7 to get T7, T6.... etc. it's entirely likely that these other trees T2-8 will be referenced again.

Categories