How do I pass a pandas method as a parameter? - python

I have a function which calculates the mode of columns of a pandas dataframe:
def my_func(df):
for col in df.columns:
stat = df[col].mode()
print(stat)
But I would like to make it more generic so that I can change which statistic I calculate e.g. mean, max,... I tried to pass the method mode() as an argument to my function:
def my_func(df, pandas_stat):
for col in df.columns:
stat = df[col].pandas_stat()
print(stat)
having referred to: How do I pass a method as a parameter in Python
However this doesn't seem to work for me.
Using a simple example:
> A
a b
0 1.0 2.0
1 2.0 4.0
2 2.0 6.0
3 3.0 NaN
4 NaN 4.0
5 3.0 NaN
6 2.0 6.0
7 4.0 6.0
It doesn't recognise the command mode:
> my_func(A, mode)
Traceback (most recent call last):
File "<ipython-input-332-c137de83a530>", line 1, in <module>
my_func(A, mode)
NameError: name 'mode' is not defined
so I tried pd.DataFrame.mode:
> my_func(A, pd.DataFrame.mode)
Traceback (most recent call last):
File "<ipython-input-334-dd913410abd0>", line 1, in <module>
my_func(A, pd.DataFrame.mode)
File "<ipython-input-329-8acf337bce92>", line 3, in my_func
stat = df[col].pandas_stat()
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'pandas_stat'
Is there a way to pass the mode function?

You can use [getattr][1] built-in and __name__ attribute to do so, but I guess it makes your code somewhat unclear. May be a better approach exists.
df = pd.DataFrame({'col1': list(range(5)), 'col2': list(range(5, 0, -1))})
df
Out:
col1 col2
0 0 5
1 1 4
2 2 3
3 3 2
4 4 1
Define my_func this way and apply it to df:
def my_func(df, pandas_stat):
for col in df.columns:
stat = getattr(df[col], pandas_stat.__name__)()
print(stat)
my_func(df, pd.DataFrame.mean)
Out
2.0
3.0
Explanation: pd.DataFrame.mean has attribute __name__ which value is 'mean'. Getattr can get this attribute from pd.DataFrame object, than you can call it.
You can even pass an arguments, if you need it:
def my_func(df, pandas_stat, *args, **kwargs):
for col in df.columns:
stat = getattr(df[col], pandas_stat.__name__)(*args, **kwargs)
print(stat)
my_func(df, pd.DataFrame.apply, lambda x: x ** 2)
Out:
0 0
1 1
2 4
3 9
4 16
Name: col1, dtype: int64
0 25
1 16
2 9
3 4
4 1
Name: col2, dtype: int64
But I repeat, I guess this approach is a little confusing.
Edit
About an error:
> my_func(A, pd.DataFrame.mode)
Traceback (most recent call last):
File "<ipython-input-334-dd913410abd0>", line 1, in <module>
my_func(A, pd.DataFrame.mode)
File "<ipython-input-329-8acf337bce92>", line 3, in my_func
stat = df[col].pandas_stat()
File "/anaconda3/envs/py36/lib/python3.6/site-packages/pandas/core/generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'pandas_stat'
When df[col].pandas_stat() is executed, a dot . operator invokes a __getattribute__ method of dataframe object. It is an analog of a getattr, but it gets self as a first argument automaticly.
So, the second is the 'name' of a method, which is 'pandas_stat' in your code. It breaks down the execution, because pandas dataframe has no attribute with a such name.
If you provide correct name of actual method ('mean', 'apply' or so) to the getattr, this function find this method in pd.DataFrame.__dict__ where all the methods are listed, and return it. So you can call it via (*args, **kwargs) syntax.

You can do this with getattr:
def my_func(df, pandas_stat):
for col in df.columns:
print(getattr(df[col], pandas_stat)()) # the empty parenthesis
# are required to call
# the method
df_max = my_func(df, "max")

Related

TypeError while formatting pandas.df.pct_change() output to percentage

I'm trying to calculate the daily returns of stock in percentage format from a CSV file by defining a function.
Here's my code:
def daily_ret(ticker):
return f"{df[ticker].pct_change()*100:.2f}%"
When I call the function, I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-7122588f1289> in <module>()
----> 1 daily_ret('AAPL')
<ipython-input-39-7dd6285eb14d> in daily_ret(ticker)
1 def daily_ret(ticker):
----> 2 return f"{df[ticker].pct_change()*100:.2f}%"
TypeError: unsupported format string passed to Series.__format__
Where am I going wrong?
f-strings can't be used to format iterables like that, even Series:
Use map or apply instead:
def daily_ret(ticker):
return (df[ticker].pct_change() * 100).map("{:.2f}%".format)
def daily_ret(ticker):
return (df[ticker].pct_change() * 100).apply("{:.2f}%".format)
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(1, 6)})
print(daily_ret('A'))
0 nan%
1 100.00%
2 50.00%
3 33.33%
4 25.00%
Name: A, dtype: object

TypeError: 'float' object has no attribute '__getitem__' in function

I am trying to pass a dataframe to a function and compute mean and std dev from different columns of the dataframe. When I execute each line of the function step by step (without writing a function as such) it works fine. However, when I try to write a function to compute, I keep getting this error:
TypeError: 'float' object has no attribute '__getitem__'
This is my code:
def computeBias(data):
meandata = np.array(data['mean'])
sddata = np.array(data.sd)
ni = np.array(data.numSamples)
mean = np.average(meandata, weights=ni)
pooled_sd = np.sqrt((np.sum(np.multiply((ni - 1), np.array(sddata)**2)))/(np.sum(ni) - 1))
return mean, pooled_sd
mean,sd = df.apply(computeBias)
This is sample data:
id type mean sd numSamples
------------------------------------------------------------------------
1 33 -0.43 0.40 101
2 23 -0.76 0.1 100
3 33 0.89 0.56 101
4 45 1.4 0.9 100
This is the full error traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-134-f4dc392140dd> in <module>()
----> 1 mean,sd = df.apply(computeBias)
C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66440)()
<ipython-input-133-2af38e3e29f0> in computeBias(data)
1 def computeBias(data):
2
----> 3 meandata = np.array(data['mean'])
4 sddata = np.array(data.sd)
5 ni = np.array(data.numSamples)
TypeError: 'float' object has no attribute '__getitem__'
Does anyone know of any workaround? TIA!
meandata = np.array(data['mean'])
TypeError: 'float' object has no attribute '__getitem__'
__getitem__ is the method that Python tries to call when you use indexing. In the marked line that means data['mean'] is producing the error. Evidently data is a number, a float object. You can't index a number.
data['mean'] looks like you are either trying to get an item from a dictionary, or from a dataframe, using a named index. I won't dig into the rest of your code to determine what you intend.
What you need to do it understand what data really it, and what produces it.
You are using this in a df.apply(....), and apparently think that it just means
computeBias(df) # or
computeBias(df.data)
Rather I suspect the apply is iterating, in some dimension, over the dataframe, and passing values or dataseries to your code. It isn't passing the whole dataframe.

Python 3.5 multiprocessing pools && 'numpy.int64 has no attribute .loc'

I'm trying to learn about multiprocessing and pools to process some tweets I've got in a MySQL DB. Here is the code and error messages.
import multiprocessing
import sqlalchemy
import pandas as pd
import config
from nltk import tokenize as token
q = multiprocessing.Queue()
engine = sqlalchemy.create_engine(config.sqlConnectionString)
def getRow(pandasSeries):
df = pd.DataFrame()
tweetTokenizer = token.TweetTokenizer()
print(pandasSeries.loc['BODY'], "\n", type(pandasSeries.loc['BODY']))
for tokens in tweetTokenizer.tokenize(pandasSeries.loc['BODY']):
df = df.append(pd.Series(data=[pandasSeries.loc['ID'], tokens, pandasSeries.loc['AUTHOR'],
pandasSeries.loc['RETWEET_COUNT'], pandasSeries.loc['FAVORITE_COUNT'],
pandasSeries.loc['FOLLOWERS_COUNT'], pandasSeries.loc['FRIENDS_COUNT'],
pandasSeries.loc['PUBLISHED_AT']],
index=['id', 'tweet', 'author', 'retweet', 'fav', 'followers', 'friends',
'published_at']), ignore_index=True)
df.to_sql(name="tweet_tokens", con=engine, if_exists='append')
if __name__ == '__main__':
##LOADING SQL INTO DATAFRAME##
databaseData = pd.read_sql_table(config.tweetTableName, engine)
pool = multiprocessing.Pool(6)
for row in databaseData.iterrows():
print(row)
pool.map(getRow, row)
pool.close()
q.close()
q.join_thread()
"""
OUPUT
C:\Users\Def\Anaconda3\python.exe C:/Users/Def/Dropbox/Dissertation/testThreadCopy.py
(0, ID 3247
AUTHOR b'Elon Musk News'
RETWEET_COUNT 0
FAVORITE_COUNT 0
FOLLOWERS_COUNT 20467
FRIENDS_COUNT 14313
BODY Elon Musk Takes an Adorable 5th Grader's Idea ...
PUBLISHED_AT 2017-03-03 00:00:01
Name: 0, dtype: object)
Elon Musk Takes an Adorable 5th Grader's
<class 'str'>
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
File "C:\Users\Def\Dropbox\Dissertation\testThreadCopy.py", line 16, in getRow
print(pandasSeries.loc['BODY'], "\n", type(pandasSeries.loc['BODY']))
AttributeError: 'numpy.int64' object has no attribute 'loc'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/Def/Dropbox/Dissertation/testThreadCopy.py", line 34, in <module>
pool.map(getRow, row)
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 608, in get
raise self._value
AttributeError: 'numpy.int64' object has no attribute 'loc'
Process finished with exit code 1
"""
What I don't understand is why it prints out the first Series and then crashes? And why does it say that pandasSeries.loc['BODY'] is of type numpy.int64 when the print out it says that it is of type string? I'm sure I've gone wrong in a number of other places if you can see where please can you point it out.
Thanks.
When I construct a simple dataframe:
frame
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
and iterate twice I get:
for row in databaseData.iterrows():
for i in row:
print(i, type(i))
That inner loop produces 2 items, a row index/label, and a Series with the values.
0 <class 'numpy.int64'>
0 0
1 1
2 2
3 3
Name: 0, dtype: int32 <class 'pandas.core.series.Series'>
Your map does the same, sending a numeric index to one process (which produces the error), and a series to another.
If I use pool.map without the for row:
pool.map(getRow, databaseData.iterrows())
then getRow receives a 2 element tuple.
def getRow(aTuple):
rowlbl, rowSeries = aTuple
print(rowSeries)
...
Your print(row) shows this tuple; it's just harder to see because the Series part is multiline. If I add a \n it might be clearer
(0, # row label
ID 3247 # multiline Series
AUTHOR b'Elon Musk News'
RETWEET_COUNT 0
....
Name: 0, dtype: object)

Pandas ExcelFile.parse() reading file in as dict instead of dataframe

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)
I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:
xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df
Results:
{'Security exposure - 21 day lag': Percent of Total Holdings \
0 KMNFC vs. 3 Month LIBOR AUD
1 04-OCT-16
2 Australian Dollar
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 Long/Short Net Exposure
9 Total
10 NaN
11 Long
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
(This goes on for 20-30 more rows and 5-6 more columns)
I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:
However, I cannot use iloc:
df.iloc[:,1]
Traceback (most recent call last):
File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
df.iloc[:,1]
AttributeError: 'dict' object has no attribute 'iloc'
Any thoughts? What am I missing?
EDIT:
To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.
EDIT (2):
Per a suggestion, I tried 'read_excel', with the same results:
df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
The problem was here:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
which returned a single element list. I changed it to the following:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]
which returns a string, and the code then performs as expected.
All thanks to ayhan for pointing this out.

python pandas functions with and without parentheses

I notice that many DataFrame functions if used without parentheses seem to behave like 'properties' e.g.
In [200]: df = DataFrame (np.random.randn (7,2))
In [201]: df.head ()
Out[201]:
0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
In [202]: df.head
Out[202]:
<bound method DataFrame.head of 0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
5 0.940440 -2.279781
6 1.152370 -2.733546>
How is this done and is it good practice ?
This is with pandas 0.15.1 on linux
They are different and not recommended, one clearly shows that it's a method and happens to output the results whilst the other shows the expected output.
Here's why you should not do this:
In [23]:
t = df.head
In [24]:
t.iloc[0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-24-b523e5ce509d> in <module>()
----> 1 t.iloc[0]
AttributeError: 'function' object has no attribute 'iloc'
In [25]:
t = df.head()
t.iloc[0]
Out[25]:
0 0.712635
1 0.363903
Name: 0, dtype: float64
So OK you don't use parentheses to call the method correctly and see an output that appears valid but if you took a reference to this and tried to use it, you are operating on the method rather than the slice of the df which is not what you intended.

Categories