How to use read_msgpack with a Dataframe child class - python

What I have is a class that inherits from DataFrame, but overrides some behavior for business logic reasons. All is well and good, but I need the ability to import and export them. msgpack appears to be a good choice, but doesn't actually work. (Using the standard msgpack library doesn't even work on regular Dataframes, and the advice there is to use the dataframe msgpack functions.)
class DataFrameWrap(pandas.DataFrame):
pass
df = DataFrameWrap()
packed_df = df.to_msgpack()
pandas.read_msgpack(packed_df)
This results in the error
File "C:\Users\REDACTED\PROJECT_NAME\lib\site-packages\pandas\io\packers.py", line 627, in decode
return globals()[obj[u'klass']](BlockManager(blocks, axes))
KeyError: u'DataFrameWrap'
when it reaches the read_msgpack() line. This works if I replace the DataFrameWrap() with a regular DataFrame().
Is there a way to tell pandas where to find the DataFrameWrap class? From reading the code, it looks like if I could inject {"DataFrameWrap": DataFrameWrap} into the globals as seen from this file, it would work, but I'm not sure how to actually do that. There also might be a proper way to do this, but it's not obvious.

Figured it out. As usual, it was much less complicated than I assumed:
from pandas.io import packers
class DataFrameWrap(pandas.DataFrame):
pass
packers.DataFrameWrap = DataFrameWrap
df = DataFrameWrap()
packed_df = df.to_msgpack()
pandas.read_msgpack(packed_df)

Related

Why won't IntelliSense work with pandas pipe()?

In VScode, it seems that Intellisense is not able to infer the return type of calls to pandas.DataFrame.pipe. It is a source of some inconvenience as I cannot rely on autocompletion after using pipe. But I haven't seen this issue mentioned anywhere, so it makes me wonder if it's just me or if I am missing something.
This is what I do:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3]})
df2 = df.pipe(lambda x: x + 1)
VSCode recognizes df as a DataFrame: , but has no clue what df2 might be:
A first thought would be that this is due to the lack of type hinting in the lambda function. But if I try this instead:
def add_one(df: pd.DataFrame) -> pd.DataFrame:
return df + 1
df3 = df.pipe(add_one)
Still IntelliSense can't guess the type of df3:
Of course as a last recourse I can add a hint to df3 itself:
df3: pd.DataFrame = df.pipe(add_one)
But it seems like it shouldn't be necessary. IntelliSense seems very capable of inferring return types in other complex scenarios, such as involving map:
UPDATE:
I experimented a bit more and found some interesting patterns which narrow down the range of possible causes.
I am not sufficiently familiar with Pylance to really understand why this is happening, but here is what I find:
Finding 1
It is happening to pandas.core.common.pipe if import it. (I know pd.DataFrame.pipe calls pandas.core.generic.pipe, but that internally calls pandas.core.common.pipe, and I can reproduce the issue in pandas.core.common.pipe.)
Finding 2
If I copy the definition of that same function from pandas.core.common, together with the relevant imports of Callable and TypeVar, and declare T as TypeVar('T'), IntelliSense actually does its magic.
(Actually in pandas.core.common, T is not defined as TypeVar('T') but imported from pandas._typing, where it is defined as TypeVar('T'). If I import it instead of defining it myself, it still works fine.)
From this I am tempted to conclude that pandas does everything right, but that Pylance is failing to keep track of type information for some unknown reason...
Finding 3
If I just copy pandas.core.common into a local file pandascommon.py and import pipe from that, it works fine too!
I also simulated in vscode and found that this problem does exist. I think it may be related to the return value definition in the pipe () method. I submit the problem on GitHub and hope to gain something.
I got it!
It was due to the stubs shipped with Pylance. Specifically in ~/.vscode/extensions/ms-python.vscode-pylance-2022.3.2/dist/bundled/stubs/pandas/.
For example in core/common.pyi I found this stub:
def pipe(obj, func, *args, **kwargs): ...
Pylance uses this instead of the annotations in pandas.core.common.pipe, causing the issue.
One heavy-handed solution is to just erase (or rename) the pandas stubs in that folder. Then pipe works again. On the other hand, it breaks some other things, for example read_csv is no longer correctly inferred to return a DataFrame. I think the better long run solution would be for the Pylance maintainers to improve those stubs...
A minimally invasive solution to the original pipe issue is to edit ~/.vscode/extensions/ms-python.vscode-pylance-2022.3.2/dist/bundled/stubs/pandas/core/frame.pyi in the following manner:
add from pandas._typing import T
replace the line starting with def pipe by:
def pipe(self, func: Callable[..., T], *args, **kwargs) -> T: ...

Using class as functions and "global" variables container: bad design?

I spent last months rewriting from scratch a new version of my Python algorithm. One of my goals was to write a perfectly documented code, easy to read and understand for "anyone".
In the same project folder I put a lot of different modules and each module contain a class. I used classes as functions and related variables container, in that way a class contain all the functions with a specific task, for example wrinting on Excel files all the output results of the algorithm.
Here an example:
Algorithm.py
import os
import pandas as pd
import numpy as np
from Observer import Observer
def main(hdf_path):
for hdf_file in os.listdir(hdf_path):
filename = str(hdf_file.replace('.hdf', '.xlsx'))
Observer.create_workbook(filename)
dataframe = pd.read_hdf(hdf_file)
years_array = dataframe.index.levels[0].values
for year in years_array:
year_mean = np.mean(dataframe.loc[year].values)
Observer.mean_values = np.append(Observer.mean_values, dataframe_mean)
Observer.export_result()
if __name__ == "main":
hdf_path = 'bla/bla/bla/'
main(hdf_path)
Observer.py
import numpy as np
import openpyxl
class Observer:
workbook = None
workbookname = None
mean_values = np.array([])
def create_workbook(filename):
Observer.workbook = openpyxl.Workbook()
Observer.workbookname = filename
# do other things
def save_workbook():
Observer.workbook.save('results_path' + Observer.workbookname)
def export_results():
# print Observer.mean_values values in different workbook cells
# export result on a specific sheet
I hope that you can understand from this simple example how do I use class on my project. For every class I define a lot of variables (workbook for example) and I call them from other modules as if they were global variables. In that way I can easily access them from anywhere and I dont need to pass them to functions explicitly, cause I can simply write Classname.varname.
My question is: is it bad design? Will it create some problems or performance slowdown?
Thanks for your help.
My question is: is it bad design?
Yes.
I can simply write Classname.varname.
You are creating a very strong coupling between classes when you enforce calling Classname.varname. The class that access this variable is now strongly coupled with Classname. This prevent you from changing the behavior in OOP way by passing different parameters, and will complicate testing of the class - since you will be unable to mock Classname and use its mock instead of the "real" class.
This will result in code duplication when you try to run 2 pieces of very similar code in two parts of your app, which only vary in these parameters. You will end up creating two almost identical classes, one using Workbook and the other using Notepad classes.
And remember the vicious cycle:
Hard to test code -> Fear of refactor -> Sloppy code
^ |
| |
---------------------------------------
Using proper objects, with ability to mock them (and dependency injection) is going to guarantee your code is easily testable, and the rest will will follow.

How to get complete list of arguments for function from module in python

I am using iPython in command prompt, Windows 7.
I thought this would be easy to find, I searched and found directions on how to use the inspect package but it seems like the inspect package is meant to be used for functions that are created by the programmer rather than functions that are part of a package.
My main goal to to be able to use the help files from within command prompt of iPython, to be able to look up a function such as csv.reader() and figure out all the possible arguments for it AND all possible values for these arguements.
In R programming this would simply be args(csv.reader())
I have tried googling this but they all point me to the inspect package, perhaps I'm misunderstanding it's use?
For example,
If I wanted to see a list of all possible arguments and the corresponding possible values for these arguments for the csv.reader() function (from the import csv package), how would I go about doing that?
I've tried doing help(csv.reader) but this doesn't provide me a list of all possible arguments and their potential values. 'Dialect' shows up but it doesn't tell me the possible values of the dialect argument of the csv.reader function.
I can easily go to the site: https://docs.python.org/3/library/csv.html#csv-fmt-params and see that the dialect options are: delimiter, doublequote, escapechar, etc.. etc..but is there a way to see this in Python console?
I've also tried dir(csv.reader) but this isn't what I was looking for either.
Going bald trying to figure this out....
There is no way to do this generically, help(<function>) will at a minimum return you the function signature (including the argument names), Python is dynamically typed so you don't get any types and arguments by themselves don't tell you what the valid values are. This is where a good docstring comes in.
However, the csv module does have a specific function for listing the dialects:
>>> csv.list_dialects()
['excel', 'excel-tab', 'unix']
>>> help(csv.excel)
Help on class excel in module csv:
class excel(Dialect)
| Describe the usual properties of Excel-generated CSV files.
...
The inspect module is extremely powerful. To get a list of classes, for example in the csv module, you could go:
import inspect, csv
from pprint import pprint
module = csv
mod_string = 'csv'
module_classes = inspect.getmembers(module, inspect.isclass)
for i in range(len(module_classes)):
myclass = module_classes[i][0]
myclass = mod_string+'.'+myclass
myclass = eval(myclass)
# could construct whatever query you want about this class here...
# you'll need to play with this line to get what you want; it will failasis
#line = inspect.formatargspect(*inspect.getfullargspec(myclass))
pprint(myclass)
Hope this helps get you started!

python, saving changes made to an imported module

I have a situation where users can pick another method from another clases and use it in thier own class using the .im_func. i give an example below
import foo1
import foo2
foo1.ClassX.methodX = foo2.ClassX.methodX.im_func
Where methodX could be implemented differently in both modules.
When i instantiate the object say foo1.Class(), methodX from modulefoo2` is used.
My problem is how to save the changes made maybe as foo3.py to a new source code file.
saving it as new py could be a problem but you can easily use serialization for it (pickle module)
see: http://docs.python.org/library/pickle.html
The source code can be retrieved with inspect module. However, the problem with that is, that it's the original source code, not source code of dynamically modified object.
Have you considered using parser combined with the aforementioned inspect to do this? In this situation it might be better to simply go with text processing rather than attempting to use imported modules.
EDIT: An example of using the parser to print the file:
with open('foo1.py','r') as fh:
st = parser.suite(fh.read())
src1 = parser.st2list(st)
with open('foo2.py','r') as fh:
st = parser.suite(fh.read())
src2 = parser.st2list(st)
You'd have to then do some tricky programming to merge the methods from the source code and write it to a file. But then again I have the strange feeling I'm not quite understanding the question...

How to change the date/time in Python for all modules?

When I write with business logic, my code often depends on the current time. For example the algorithm which looks at each unfinished order and checks if an invoice should be sent (which depends on the no of days since the job was ended). In these cases creating an invoice is not triggered by an explicit user action but by a background job.
Now this creates a problem for me when it comes to testing:
I can test invoice creation itself easily
However it is hard to create an order in a test and check that the background job identifies the correct orders at the correct time.
So far I found two solutions:
In the test setup, calculate the job dates relative to the current date. Downside: The code becomes quite complicated as there are no explicit dates written anymore. Sometimes the business logic is pretty complex for edge cases so it becomes hard to debug due to all these relative dates.
I have my own date/time accessor functions which I use throughout my code. In the test I just set a current date and all modules get this date. So I can simulate an order creation in February and check that the invoice is created in April easily. Downside: 3rd party modules do not use this mechanism so it's really hard to integrate+test these.
The second approach was way more successful to me after all. Therefore I'm looking for a way to set the time Python's datetime+time modules return. Setting the date is usually enough, I don't need to set the current hour or second (even though this would be nice).
Is there such a utility? Is there an (internal) Python API that I can use?
Monkey-patching time.time is probably sufficient, actually, as it provides the basis for almost all the other time-based routines in Python. This appears to handle your use case pretty well, without resorting to more complex tricks, and it doesn't matter when you do it (aside from the few stdlib packages like Queue.py and threading.py that do from time import time in which case you must patch before they get imported):
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2010, 4, 17, 14, 5, 35, 642000)
>>> import time
>>> def mytime(): return 120000000.0
...
>>> time.time = mytime
>>> datetime.datetime.now()
datetime.datetime(1973, 10, 20, 17, 20)
That said, in years of mocking objects for various types of automated testing, I've needed this approach only very rarely, as most of the time it's my own application code that needs the mocking, and not the stdlib routines. After all, you know they work already. If you are encountering situations where your own code has to handle values returned by library routines, you may want to mock the library routines themselves, at least when checking how your own app will handle the timestamps.
The best approach by far is to build your own date/time service routine(s) which you use exclusively in your application code, and build into that the ability for tests to supply fake results as required. For example, I do a more complex equivalent of this sometimes:
# in file apptime.py (for example)
import time as _time
class MyTimeService(object):
def __init__(self, get_time=None):
self.get_time = get_time or _time.time
def __call__(self):
return self.get_time()
time = MyTimeService()
Now in my app code I just do import apptime as time; time.time() to get the current time value, whereas in test code I can first do apptime.time = MyTimeService(mock_time_func) in my setUp() code to supply fake time results.
Update: Years later there's an alternative, as noted in Dave Forgac's answer.
The freezegun package was made specifically for this purpose. It allows you to change the date for code under test. It can be used directly or via a decorator or context manager. One example:
from freezegun import freeze_time
import datetime
#freeze_time("2012-01-14")
def test():
assert datetime.datetime.now() == datetime.datetime(2012, 1, 14)
For more examples see the project: https://github.com/spulec/freezegun
You can patch the system, by creating a custom datetime module (even a fake one - see example below) acting as a proxy and then insert it in sys.modules dictionary. From there on, each import to the datetime module will return your proxy.
There is still the caveat of datetime class, especially when someone does from datetime import datetime; for that, you can simply add another proxy only for that class.
Here is an example of what I am saying - of course it is just something I've thrown in 5 minutes, and may have several issues (for instance, the type of datetime class is not correct); but hopefully it may already be of use.
import sys
import datetime as datetime_orig
class DummyDateTimeModule(sys.__class__):
""" Dummy class, for faking datetime module """
def __init__(self):
sys.modules["datetime"] = self
def __getattr__(self, attr):
if attr=="datetime":
return DummyDateTimeClass()
else:
return getattr(datetime_orig, attr)
class DummyDateTimeClass(object):
def __getattr__(self, attr):
return getattr(datetime_orig.datetime, attr)
dt_fake = DummyDateTimeModule()
Finally - is it worth?
Frankly speaking, I like our second solution much more than this one :-).
Yes, python is a very dynamic language, where you can do quite a lot of interesting things, but patching code in this way has always a certain degree of risk, even if we are talking here of test code.
But mostly, I think the accessory function would make test patching more explicit, and also your code would be more explicit in terms of what it is going to be tested, thus increasing readability.
Therefore, if the change is not too expensive, I would go for your second approach.
I would use the helpers from the 'testfixtures' package to mock out the date, datetime or time calls you're making:
http://packages.python.org/testfixtures/datetime.html
Well one way to do it is to dynamic patch the time /datetime module
something like
import time
import datetime
class MyDatetime:
def now(self):
return time.time()
datetime.datetime = MyDatetime
print datetime.datetime().now()
there might be few ways of doing this, like creating the orders (with the current timestamp) and then changing that value in the DB directly by some external process (assuming data is in the DB).
I'll suggest something else. Have you though about running your application in a virtual machine, setting the time to say Feb, creating orders, and then just changing the VMs time? This approach is the closest as you can get to the real-life situation.

Categories