Parsing through data with variable values

Parsing through data with variable values - python

I'm running into a confusing issue with a current project. Basically, I am analyzing a MIDI file byte by byte (yes, I have to do it this way). MIDI files are structured such that the data for any given event in a track is conveyed via a series of bytes called a 'message'. A MIDI device knows what type of message based on a status code, which is 1-3 bytes within the 'message' that have specific IDs for specific functions.
I need to be able to look through a MIDI file's data and record each 'message' as a self-contained object so that I can put them in sequence for later use. The problem I run into is that as I progress through the data, I need an efficient way to read each status code and match it with the function I've written that processes the data for that type of message. What I'm trying to do now is this:
First, at any given point in the overall MIDI file, I look at the next 3 bytes and compare them with my list of all of the possible 3 byte MIDI status codes. Here's my first problem. For all 3 byte status code (which is equivalent to 6 hexadecimal digits, which is the encoding that is often used in MIDI editing), the 5th-8th bits (aka the second hexadecimal digit) specifies the MIDI channel that the current message affects, which is variable and unpredictable. Since I can't predict the channel, I need to treat it as a wild card value. In other words, I need to somehow be able to take the 6 digits and ignore the 2nd when I reference it against my list of codes. I don't think I explained that well, so here's an example:
6 hex-digit code inside the file: Bn7800 where n is the channel number
List of 6 digit codes: ['Bn7800','Bn7900','FF2001','FF2F00',...]
I need to search through that list for the correct code while ignoring the 2nd digit since I have no way of telling what it is ahead of time.
My other issue is that once I have that code, I need to call the specific function that I've written that handles that particular status code. I've named all of the functions after the status code they correspond to, but the problem is, I don't know how to take the string that I find in the list above and use it to call a function. For instance:
I need: Bn7800
In the list, I find it! So I store it as a new variable called stat_code = 'Bn7800'.
But I can't then call my function in this way -> self.stat_code(data)
because stat_code isn't the name of the function I want. I want the function that's named after the value of stat_code.
I'm not sure how I can go about calling a function based on the value of a stored variable.
Sorry if this is confusing, I find it difficult to explain fully without trying to explain the whole MIDI file structure. Please ask me questions if something is unclear. Thanks!
EDIT: Here is some of the code:
self.event_status_codes = {'ff0002': ff_00_02,'ff01': ff_01,'ff02': ff_02,'ff03': ff03,'ff04': ff_04,'ff05': ff_05,'ff06': ff_06,'ff07': ff_07,'ff2001': ff_20_01,'ff2f00': ff_2f_00,'ff5103': ff_51_03,'ff5405': ff_54_05,'ff5804': ff_58_04,'ff5902': ff_59_02,'ff7f': ff_7f,'8n': _8n,'9n': _9n,'an': an,'bn': bn,'cn': cn,'dn': dn,'en': en,'bn7800': bn_78_00,'bn7900': bn_79_00,'bn7a': bn_7a,'bn7b00': bn_7b_00,'bn7c00': bn_7c_00,'bn7d00': bn7d00,'bn7e': bn_7e,'bn7f00': bn_7f_00,'f0': f0,'f7': f7}
if track_data[hbc:hbc+3] in self.event_status_codes:
status_code = track_data[hbc:hbc+3]
self.status_code(track_data)
For reference, track_data is the raw hexadecimal sequence that I'm parsing through and hbc is a simple pointer that iterates through the file one hex digit at a time. I posted the whole status code dictionary, but right now am mainly concerned about the 3 digit ones because they typically specify channel number. Also, it's technically a Python dictionary because I had to name each corresponding function something slightly different than the actual code to make it readable and functional, so you can ignore the values of each key in the dictionary.

I'm not sure how I can go about calling a function based on the value of a stored variable.
This is doable by referring to the function by name. In your question, you imply that you're calling the function on self, which makes it easier: you might be able to get the attribute containing the function pointer by using the getattr() function, and then call it:
stat_code = 'B67800'
func = getattr(self, f'function_name_{stat_code}')
value = func()
# do stuff with value
This also works for if your function is part of any other module - just replace self in that example with the name of the module you'd call it from.
If you need to call a globally-defined function (not part of a module), you might need to index into locals() or globals():
func = globals()[f'function_name_{stat_code}']
value = func()

Related

how to open a file closest to the file specified python

I am using python and I wand to run a program the will open the file specified by the user. But the problem is that, if the user doesn't specify the exact file name, then it will give an error. If the user wants to open "99999-file-name.mp3"
and he has typed "filename.mp3", then how can the program open the file closest to the one specified?

First get a list of files in the particular folder
Then use difflib.get_close_matches like so:
difflib.get_close_matches(user_specified_file, list_of_files)
to find "good" matches.
N.B: Consider putting a providing a small cutoff i.e 0.1 as suggested by #tobias_k to ensure you do you get a match always as the default cutoff of 0.6 means sometimes nothing will be a "good match" for what the user entered.
Similarly if you need to get only one file name also pass in the optional parameter n=1 to get the closest match since if you don't specify it you will get the 3 best matches.

To answer this question, you need to first define "closest" because in computing this can mean very different things. If you want to compare strings and find the most similar, then one good way of doing that is checking the edit distance. There are Python libraries out there for that, i.e. https://pypi.python.org/pypi/editdistance.
You give it two strings and it tells you how much you have to change one string to get the other. As per the documentation:
>>> import editdistance
>>> editdistance.eval('banana', 'bahama')
2L
PS. Can't help but to mention that I think this is a bad idea. If you want to do sth with the file opened and the program starts opening random files, then either you're eventually gonna overwrite a file that is not meant to be overwritten or you try to process a file that can't be processed in your intended way. I would recommend using a file select box that you can easily use with tKinter for example (even though tKinter is cancer).

python parsing file into data structure

So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?

use string content as variable - python

I am using a package that has operations inside the class (? not sure what either is really), and normally the data is called this way data[package.operation]. Since I have to do multiple operations thought of shortening it and do the following
list =["o1", "o2", "o3", "o4", "o5", "o6"]
for i in list:
print data[package.i]
but since it's considering i as a string it doesnt do the operation, and if I take away the string then it is an undefined variable. Is there a way to go around this? Or will I just have to write it the long way?.
In particular I am using pymatgen, its package Orbital and with the .operation I want to call specific suborbitals. A real example of how it would be used is data[0][Orbital.s], the first [0] denotes the element in question for which to get the orbitals s (that's why I omitted it in the code above).

You can use getattr in order to dynamically select attributes from objects (the Orbital package in your case; for example getattr(Orbital, 's')).
So your loop would be rewritten to:
for op in ['o1', 'o2', 'o3', 'o4', 'o5', 'o6']:
print(data[getattr(package, op)])

Python - wrap same object to make it unique

I have a dictionary that is being built while iterating through objects. Now same object can be accessed multiple times. And I'm using object itself as a key.
So if same object is accessed more than once, then key becomes not unique and my dictionary is no longer correct.
Though I need to access it by object, because later on if someone wants access contents by it, they can request to get it by current object. And it will be correct, because it will access the last active object at that time.
So I'm wondering if it is possible to wrap object somehow, so it would keep its state and all attributes the same, but the only difference would be this new kind of object which is actually unique.
For example:
dct = {}
for obj in some_objects_lst:
# Well this kind of wraps it, but it loses state, so if I would
# instantiate I would lose all information that was in that obj.
wrapped = type('Wrapped', (type(obj),), {})
dct[wrapped] = # add some content
Now if there are some better alternatives than this, I would like to hear it too.
P.S. objects being iterated would be in different context, so even if object is the same, it would be treated differently.
Update
As requested, to give better example where the problem comes from:
I have this excel reports generator module. Using it, you can generate various excel reports. For that you need to write configuration using python dictionary.
Now before report is generated, it must do two things. Get metadata (metadata here is position of each cell that will be when report is about to be created) and second, parse configuration to fill cells with content.
One of the value types that can be used in this module, is formula (excel formulas). And the problem in my question is specifically with one of the ways formula can be computed: formula values that are retrieved for parent , that are in their childs.
For example imagine this excel file structure:
A | B | C
Total Childs Name Amount
1 sum(childs)
2 child_1 10
3 child_2 20
4 sum(childs)
...
Now in this example sum on cell 1A, would need to be 10+20=30 if sum would use expression to sum their childs column (in this case C column). And all of this is working until same object (I call it iterables) is repeated. Because when building metadata it I need to store it, to retrieve later. And key is object being iterated itself. So now when it will be iterated again when parsing values, it will not see all information, because some will overwritten by same object.
For example imagine there are invoice objects, then there are partner objects which are related with invoices and there are some other arbitrary objects that given invoice and partner produce specific amounts.
So when extracting such information in excel, it goes like this:
inoice1 -> partner1 -> amount_obj1, amount_obj2
invoice2 -> partner1 -> amount_obj3, amount_obj4.
Notice that partner in example is the same. Here is the problem, because I can't store this as key, because when parsing values, I will iterate over this object twice when metadata will actually hold values for amount_obj3 and amount_obj4
P.S Don't know if I explained it better, cause there is lots of code and I don't want to put huge walls of code here.
Update2
I'll try to explain this problem from more abstract angle, because it seems being too specific just confuses everyone even more.
So given objects list and empty dictionary, dictionary is built by iterating over objects. Objects act as a key in dictionary. It contains metadata used later on.
Now same list can be iterated again for different purpose. When its done, it needs to access that dictionary values using iterated object (same objects that are keys in that dictionary). But the problem is, if same object was used more than once, it will have only latest stored value for that key.
It means object is not unique key here. But the problem is the only thing I know is the object (when I need to retrieve the value). But because it is same iteration, specific index of iteration will be the same when accessing same object both times.
So uniqueness I guess then is (index, object).

I'm not sure if I understand your problem, so here's two options. If it's object content that matters, keep object copies as a key. Something crude like
new_obj = copy.deepcopy(obj)
dct[new_obj] = whatever_you_need_to_store(new_obj)
If the object doesn't change between the first time it's checked by your code and the next, the operation is just performed the second time with no effect. Not optimal, but probably not a big problem. If it does change, though, you get separate records for old and new ones. For memory saving you will probably want to replace copies with hashes, __str__() method that writes object data or whatever. But that depends on what your object is; maybe hashing will take too much time for miniscule savings in memory. Run some tests and see what works.
If, on the other hand, it's important to keep the same value for the same object, whether the data within it have changed or not (say, object is a user session that can change its data between login and logoff), use object ids. Not the builtin id() function, because if the object gets GCed or deleted, some other object may get its id. Define an id attribute for your objects and make sure different objects cannot possibly get the same one.

How do I compare the contents of two Google Protocol Buffer messages for equality?

I can't seem to find a comparison method in the API. I have these two messages, and they have a lot of different values that sometimes drill down to more values (for example, I have a Message that has a string, an int, and a custom_snapshot, where custom_snapshot is comprised of an int, a string, and so on). I want to see if these two messages are the same. I don't want to compare each value one by one since that will take a while, so I was wondering if there was a quick way to do this in Python?
I tried doing messageA.debugString() == messageB.debugString(), but apparently there is no debugString method that I could access when I tried.

protocol buffers have a method SerializeToString(daterministic=True)
Use it to compare your messages.

google.protobuf.text_format.MessageToString converts the proto message to its text format, so it's probably easier to inspect any diff (if any) than the binary string produced by SerializeToString. It also has many options, e.g. to ignore unknown fields.

You can compare two proto objects using equals method
For example :
Object1.equals(Object2)
It will check content of Object1 equals to content of Object2. If you are using enum in the any of the proto then should maintain sequence . Otherwise it will give u false as sequence does not match

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.