I am trying to create a function which resamples time series data in pandas. I would like to have the option to specify the type of aggregation that occurs depending on what type of data I am sending through (i.e. for some data, taking the sum of each bin is appropriate, while for others, taking the mean is needed, etc.). For example data like these:
import pandas as pd
import numpy as np
dr = pd.date_range('01-01-2020', '01-03-2020', freq='1H')
df = pd.DataFrame(np.random.rand(len(dr)), index=dr)
I could have a function like this:
def process(df, freq='3H', method='sum'):
r = df.resample(freq)
if method == 'sum':
r = r.sum()
elif method == 'mean':
r = r.mean()
#...
#more options
#...
return r
For a small amount of aggregation methods, this is fine, but seems like it could be tedious if I wanted to select from all of the possible choices.
I was hoping to use getattr to implement something like this post (under "Putting it to work: generalizing method calls"). However, I can't find a way to do this:
def process2(df, freq='3H', method='sum'):
r = df.resample(freq)
foo = getattr(r, method)
return r.foo()
#fails with:
#AttributeError: 'DatetimeIndexResampler' object has no attribute 'foo'
def process3(df, freq='3H', method='sum'):
r = df.resample(freq)
foo = getattr(r, method)
return foo(r)
#fails with:
#TypeError: __init__() missing 1 required positional argument: 'obj'
I get why process2 fails (calling r.foo() looks for the method foo() of r, not the variable foo). But I don't think I get why process3 fails.
I know another approach would be to pass functions to the parameter method, and then apply those functions on r. My inclination is that this would be less efficient? And it still doesn't allow me to access the built-in Resample methods directly.
Is there a working, more concise way to achieve this? Thanks!
Try .resample().apply(method)
But unless you are planning some more computation inside the function, it will probably be easier to just hard-code this line.
Related
This is what I want to do drive a computational experiment
import pandas
def foo(a = 1, b = 2):
print("In foo with %s" % str(locals()))
return a + b
def expand_grid(dictionary):
from itertools import product
return pandas.DataFrame([row for row in product(*dictionary.values())], columns=dictionary.keys())
experiment = {"a":[1,2], "b":[10,12]}
grid = expand_grid(experiment)
for g in grid.itertuples(index=False):
foo(**g._asdict())
That works, but the issue is it has a private _asdict() call. There is a standard in Python not to use "_" calls in external code, so here is my question:
Can you do the above without the _asdict() and if so how?
Also note while foo(a =g.a, b=g.b) is a solution, the actual code is a heavily parameterized call, and for my own knowledge, I was just trying to figure out how to treat it as kwargs without the _ call if it was possible.
Thanks for the hint.
I'm new to python and pandas but I have a problem I cannot wrap my head around.
I'm trying to add a new column to my DataFrame. To achieve that I use the assign() function.
Most of the examples on the internet are painfully trivial and I cannot find a solution for my problem.
What works:
my_dataset.assign(new_col=lambda x: my_custom_long_function(x['long_column']))
def my_custom_long_function(input)
return input * 2
What doesn't work:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.upper()
What confuses me is that in the debug I can see that even for my_custom_long_function the parameter is a Series, not a long.
I just want to use the lambda function and pass a value of the column to do my already written complicated functions. How do I do this?
Edit: The example here is just for demonstrative purpose, the real code is basically an existing complex function that does not care about panda's types and needs a str as a parameter.
Because the column doesn't have a upper method, in order to use it, you need to do str.upper:
my_dataset.assign(new_col=lambda x: my_custom_string_function(x['string_column'])
def my_custom_string_function(input)
return input.str.upper()
That said, I would use:
my_dataset['new column'] = my_dataset['string_column'].str.upper()
For efficiency.
Edit:
my_dataset['new column'] = my_dataset['string_column'].apply(lambda x: my_custom_string_function(x))
def my_custom_string_function(input):
return input.upper()
I have a simple class as:
connection has db connection
import pandas as pd
from animal import kettle
class cat:
def foo(connection):
a=pd.read_sql('select * from zoo',connection)
return1= kettle.boo1(a)
return2= kettle.boo2(a)
return return1,return2
Now I want to pass a to both boo1 and boo2 of kettle, am I passing it the correct way in above foo()?
I thought above way is correct and I tried this way , but is this correct way to pass?
animal.py:
class kettle:
def boo1(return1):
print(return1)
def boo2(return2):
print(return2)
sorry if this doesn't make any sense,
my intention is passing a to both boo1 and boo2 of kettle class
This looks like the correct approach to me: by assigning the return value of pd.read_sql('select * from zoo', connection) to a first and then passing a to kettle.boo1 and kettle.boo2 you ensure you only do the potentially time-consuming database IO only once.
One thing to keep in mind with this design pattern when you are passing objects such as lists/dicts/dataframes is the question of whether kettle.boo1 changes the value that is in a. If it does, kettle.boo2 will receive the modified version of a as an input, which can lead to unexpected behavior.
A very minimal example is the following:
>>> def foo(x):
... x[0] = 'b'
...
>>> x = ['a'] # define a list of length 1
>>> foo(x) # call a function that modifies the first element in x
>>> print(x) # the value in x has changed
['b']
There are (many) possible solutions for your problem, whatever that might be. I assume you just start out object oriented programming in Python, and get errors along the lines of
unbound method boo1() must be called with kettle instance as first argument
and probably want this solution:
Give your class methods an instance parameter:
def boo1(self, return1):
Instantiate the class kettle in cat.foo:
k = kettle()
Then use it like:
k.boo1(a)
Same for the boo2 method.
Also you probably want to:
return return1 # instead of or after print(return1)
as your methods return None at the moment.
For example,
import pandas as pd
weather = pd.read_csv(r'D:\weather.csv')
weather.count
weather.count()
weather is one dataframe with multiple columns and rows
Then what's the difference when asking for weather.count and weather.count() ?
It depends. In general this question has nothing to do with pandas. The answer is relevant to how Python is designed.
In this case, .count is a method. Particularly, a method of pandas.DataFrame, and it will confirm it:
df = pd.DataFrame({'a': []})
print(df.count)
Outputs
<bound method DataFrame.count of Empty DataFrame
Columns: [a]
Index: []>
Adding () will call this method:
print(df.count())
Outputs
a 0
dtype: int64
However that is not always the case. .count could have been a non-callable attribute (ie a string, an int, etc) or a property.
In this case it's a non-callable attribute:
class Foo:
def __init__(self, c):
self.count = c
obj = Foo(42)
print(obj.count)
Will output
42
Adding () in this case will raise an exception because it makes no sense to call an integer:
print(obj.count())
TypeError: 'int' object is not callable
The double bracket are related to a method call.
If you declare a function like this one that sum two number:
def sum(a:int, b:int)->int:
return a+b
Then you can call the function using: sum(1,2).
The function without bracket, are used when you don't need to call the function but, instead, you need to pass the reference to that function to another method.
For example, pass a reference to the function can be useful for multithread management:
from multiprocessing import Process
t = Process(target=my_long_running_function)
Please, have a look to the #roippi reply for further information:
https://stackoverflow.com/a/21786508/9361998
Suppose I have a function like this:
from toolz.curried import *
#curry
def foo(x, y):
print(x, y)
Then I can call:
foo(1,2)
foo(1)(2)
Both return the same as expected.
However, I would like to do something like this:
#curry.inverse # hypothetical
def bar(*args, last):
print(*args, last)
bar(1,2,3)(last)
The idea behind this is that I would like to pre-configure a function and then put it in a pipe like this:
pipe(data,
f1, # another function
bar(1,2,3) # unknown number of arguments
)
Then, bar(1,2,3)(data) would be called as a part of the pipe. However, I don't know how to do this. Any ideas? Thank you very much!
Edit:
A more illustrative example was asked for. Thus, here it comes:
import pandas as pd
from toolz.curried import *
df = pd.DataFrame(data)
def filter_columns(*args, df):
return df[[*args]]
pipe(df,
transformation_1,
transformation_2,
filter_columns("date", "temperature")
)
As you can see, the DataFrame is piped through the functions, and filter_columns is one of them. However, the function is pre-configured and returns a function that only takes a DataFrame, similar to a decorator. The same behaviour could be achieved with this:
def filter_columns(*args):
def f(df):
return df[[*args]]
return f
However, I would always have to run two calls then, e.g. filter_columns()(df), and that is what I would like to avoid.
well I am unfamiliar with toolz module, but it looks like there is no easy way of curry a function with arbitrary number of arguments, so lets try something else.
First as a alternative to
def filter_columns(*args):
def f(df):
return df[*args]
return f
(and by the way, df[*args] is a syntax error )
to avoid filter_columns()(data) you can just grab the last element in args and use the slice notation to grab everything else, for example
def filter_columns(*argv):
df, columns = argv[-1], argv[:-1]
return df[columns]
And use as filter_columns(df), filter_columns("date", "temperature", df), etc.
And then use functools.partial to construct your new, well partially applied, filter to build your pipe like for example
from functools import partial
from toolz.curried import pipe # always be explicit with your import, the last thing you want is import something you don't want to, that overwrite something else you use
pipe(df,
transformation_1,
transformation_2,
partial(filter_columns, "date", "temperature")
)