I built the following function with the aim of estimating an optimal exponential moving average of a pandas' DataFrame column.
from scipy import optimize
from sklearn.metrics import mean_squared_error
import pandas as pd
## Function that finds best alpha and uses it to create ewma
def find_best_ewma(series, eps=10e-5):
def f(alpha):
ewm = series.shift().ewm(alpha=alpha, adjust=False).mean()
return mean_squared_error(series, ewm.fillna(0))
result = optimize.minimize(f,.3, bounds=[(0+eps, 1-eps)])
return series.shift().ewm(alpha=result.x, adjust=False).mean()
Now I want to apply this function to each of the groups created using pandas-groupby on the following test df:
## test
data1 data2 key1 key2
0 -0.018442 -1.564270 a x
1 -0.038490 -1.504290 b x
2 0.953920 -0.283246 a x
3 -0.231322 -0.223326 b y
4 -0.741380 1.458798 c z
5 -0.856434 0.443335 d y
6 -1.416564 1.196244 c z
To do so, I tried the following two ways:
## First way
test.groupby(["key1","key2"])["data1"].apply(find_best_ewma)
## Output
0 NaN
1 NaN
2 -0.018442
3 NaN
4 NaN
5 NaN
6 -0.741380
Name: data1, dtype: float64
## Second way
test.groupby(["key1","key2"]).apply(lambda g: find_best_ewma(g["data1"]))
## Output
key1 key2
a x 0 NaN
2 -0.018442
b x 1 NaN
y 3 NaN
c z 4 NaN
6 -0.741380
d y 5 NaN
Name: data1, dtype: float64
Both ways produce a pandas.core.series.Series but ONLY the second way provides the expected hierarchical index.
I do not understand why the first way does not produce the hierarchical index and instead returns the original dataframe index. Could you please explain me why this happens?
What am I missing?
Thanks in advance for your help.
The first way creates a pandas.core.groupby.DataFrameGroupBy object, which becomes a pandas.core.groupby.SeriesGroupBy object once you select a specific column from it; It is to this object that the 'apply' method is applied to, hence a series is returned.
test.groupby(["key1","key2"])["data1"]#.apply(find_best_ewma)
<pandas.core.groupby.SeriesGroupBy object at 0x7fce51fac790>
The second way remains a DataFrameGroupBy object. The function you apply to that object selects the column, which means the function 'find_best_ewma' is applied to each member of that column, but the 'apply' method is applied to the original DataFrameGroupBy, hence a DataFrame is returned, the 'magic' is that the indexes of the DataFrame are hence still present.
Related
I have a pandas data frame, where some string values are "NA". I want to replace these values in a specific column (i.e. the 'strCol' in the example below) using method chaining.
How do I do this? (I googled quite a bit without success even though this should be easy?! ...)
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],
'B':['val1','val2','NA','val3']})
df = (
df
.rename(columns={'A':'intCol', 'B':'strCol'}) # method chain example operation 1
.astype({'intCol':float}) # method chain example operation 2
# .where(df['strCol']=='NA', pd.NA) # how to replace the sting 'NA' here? this does not work ...
)
df
You can try replace instead of where:
df.replace({'strCol':{'NA':pd.NA}})
Use lambda in where clause to evaluate the chained dataframe:
df = (df.rename(columns={'A':'intCol', 'B':'strCol'})
.astype({'intCol':float})
.where(lambda x: x['strCol']=='NA', pd.NA))
Output:
>>> df
intCol strCol
0 NaN <NA>
1 NaN <NA>
2 3.0 NA
3 NaN <NA>
Many methods like where, mask, groupby, apply can take a callable or a function so you can pass a lambda function.
pandas.DataFrame.where does
Replace values where the condition is False.
So you need condition to not hold where you want to make replacement, simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9]})
df2 = df.where(df.x%2==0,-1)
print(df2)
gives output
x
0 -1
1 2
2 -1
3 4
4 -1
5 6
6 -1
7 8
8 -1
Observe that odd values were replaced by -1s, whilst condition does hold for even values.
I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.
I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace
.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7
I have a dataframe as shown below. The last column shows the sum of values from all the columns i.e. A,B,D,K and T. Please note some of the columns have NaN as well.
word1,A,B,D,K,T,sum
na,,63.0,,,870.0,933.0
sva,,1.0,,3.0,695.0,699.0
a,,102.0,,1.0,493.0,596.0
sa,2.0,487.0,,2.0,15.0,506.0
su,1.0,44.0,,136.0,214.0,395.0
waw,1.0,9.0,,34.0,296.0,340.0
How can I calculate the entropy for each row? i.e. I should find something like following
df['A']/df['sum']*log(df['A']/df['sum']) + df['B']/df['sum']*log(df['B']/df['sum']) + ...... + df['T']/df['sum']*log(df['T']/df['sum'])
The condition is that whenever the value inside the log becomes zero or NaN, the whole value should be treated as zero (by definition, the log will return an error as log 0 is undefined).
I am aware of using lambda operation to apply on individual columns. Here I am not able to think for a pure pandas solution where a fixed column sum is applied on different columns A,B,D etc.. Though I can think of a simple loopwise iteration over CSV file with hard-coded column values.
I think you can use ix for selecting columns from A to T, then divide by div with numpy.log. Last use sum:
print (df['A']/df['sum']*np.log(df['A']/df['sum']))
0 NaN
1 NaN
2 NaN
3 -0.021871
4 -0.015136
5 -0.017144
dtype: float64
print (df.ix[:,'A':'T'].div(df['sum'],axis=0)*np.log(df.ix[:,'A':'T'].div(df['sum'],axis=0)))
A B D K T
0 NaN -0.181996 NaN NaN -0.065191
1 NaN -0.009370 NaN -0.023395 -0.005706
2 NaN -0.302110 NaN -0.010722 -0.156942
3 -0.021871 -0.036835 NaN -0.021871 -0.104303
4 -0.015136 -0.244472 NaN -0.367107 -0.332057
5 -0.017144 -0.096134 NaN -0.230259 -0.120651
print((df.ix[:,'A':'T'].div(df['sum'],axis=0)*np.log(df.ix[:,'A':'T'].div(df['sum'],axis=0)))
.sum(axis=1))
0 -0.247187
1 -0.038471
2 -0.469774
3 -0.184881
4 -0.958774
5 -0.464188
dtype: float64
df1 = df.iloc[:, :-1]
df2 = df1.div(df1.sum(1), axis=0)
df2.mul(np.log(df2)).sum(1)
word1
na -0.247187
sva -0.038471
a -0.469774
sa -0.184881
su -0.958774
waw -0.464188
dtype: float64
Setup
from StringIO import StringIO
import pandas as pd
text = """word1,A,B,D,K,T,sum
na,,63.0,,,870.0,933.0
sva,,1.0,,3.0,695.0,699.0
a,,102.0,,1.0,493.0,596.0
sa,2.0,487.0,,2.0,15.0,506.0
su,1.0,44.0,,136.0,214.0,395.0
waw,1.0,9.0,,34.0,296.0,340.0"""
df = pd.read_csv(StringIO(text), index_col=0)
df
Suppose I have the data set below in a dataframe, df:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B'], 'Date' : ['1-Jan','2-Jan','3-Jan','1-Jan','2-Jan','3-Jan'],'VAL' : [45,23,54,65,76,23]})
I am trying to insert a column, say 'new_col', that calculates the percent change in VAL that is grouped by ID. So, for example, I would want the percent change from 45 to 23, 23 to 54, and then restart for ID 'B'. The below code works but it calculates the percent change regardless of ID.
df['new_col'] = (df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)
I tried adding the group by function in front of it but I am still getting an error:
df['new_col'] = df.groupby('ID')[(df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)]
^^^^^^^^^^^^^^^^
You can't just just stick your expression in brackets onto the groupby like that. What you need to do is use apply to apply a function that calculates what you want. What you want can be calculated more simply using the diff method:
>>> df.groupby('ID')['VAL'].apply(lambda g: g.diff()/g.shift())
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
As DSM notes in a comment, in this case you can do it directly with the pct_change method:
>>> df.groupby('ID')['VAL'].pct_change()
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
However, it is good to be aware of how to do it with apply because you'll need to do things that way if you want to do a more complex operation on the groups (i.e., an operation for which there is no predefined one-shot method).