Python: take log difference for each column in a dataframe - python

I have a list of dataframes and would like to take log for every element in these dataframes and find the first difference. In time series econometrics, this procedure gives an approximate growth rate. The following codes
for i in [0, 1, 2, 5]:
df1_list[i] = 100 * np.log(df_list[i]).diff()
gives an error
__main__:7: RuntimeWarning: divide by zero encountered in log
__main__:7: RuntimeWarning: invalid value encountered in log
When I look at the result, many of the elements resulting dataframes are nan. How can I fix the codes? Thanks !!

The problem is not with your code, but with your data. You do not get an error but two warnings. The most likely reasons are the following kinds of values within your DataFrames:
Zeros
Negative numbers
Non-numerical values
The logarithm of any of these is just not defined, so you get NaN.

Some test data
df = pd.DataFrame(np.random.rand(5, 5))
df = df.mask(np.random.random(df.shape) < .1)
0 1 2 3 4
0 0.579643 0.614592 0.333945 0.241791 0.426162
1 0.576076 0.841264 0.235148 0.577707 0.278260
2 0.735097 0.594789 0.640693 0.913639 0.620021
3 0.015446 NaN 0.062203 0.253076 0.042025
4 0.401775 0.522634 0.521139 0.032310 NaN
Applying your code
for c in df:
print(100 * np.log(df[c]).diff())
yields output like this (for c = 1):
0 NaN
1 31.394708
2 -34.670002
3 NaN
4 NaN
You can remove nans with .dropna()
for c in df:
print(100 * np.log(df[c].dropna()).diff())
which yields (for c = 1)
0 NaN
1 31.394708
2 -34.670002
4 -12.932474
As you can see, we have "lost" one row as a consequence of .dropna() and your 0th row will always be nan as there is no difference to take.
If you are interested in replacing nans with other values, there are different techniques such as fillna or imputation.

Related

In pandas: Interpolate between two rows such that the sum of interpolated values is equal to the second row

I am looking for a way to interpolate between two values (A and G) such that the sum of the interpolated values is equal to the second value (G), preferably while the distances between the interpolated values are linear/equally-sized.
What I got is:
Label
Value
A
0
B
NaN
C
NaN
D
NaN
E
NaN
F
NaN
G
10
... and I want to get to this:
Label
Value
A
0
B
2
C
2
D
2
E
2
F
2
G
10
The function pandas.interpolate unfortunately does not allow for this. I could manually create sections in these columns using something like numpy.linspace but this appears to be a rather makeshift solution and not particularly efficient for larger tables where the indices that require interpolation are irregularly scatter across rows.
What is the most efficient way to do this in Python?
I don't know if this is the most efficient way but it works for any number of breaks, including none, using only numpy and pandas:
df['break'] = np.where(df['Value'].notnull(), 1, 0)
df['group'] = df['break'].shift().fillna(0).cumsum()
df['Value'] = df.groupby('group').Value.apply(lambda x: x.fillna( x.max() / (len(x)-1) ) )
You will get a couple of warnings from the underlying numpy calculations due to NaNs and zeroes but the replacement still works.
RuntimeWarning: invalid value encountered in double_scalars
RuntimeWarning: divide by zero encountered in double_scalars

Pandas Groupby with lambda gives some NANs

I have a DF where I'd like to create a new column with the difference of 2 other column values.
name rate avg_rate
A 10 3
B 6 5
C 4 3
I wrote this code to calculate the difference :
result= df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate)
df['rate_diff']=result.reset_index(drop=True)
df.tail(3)
But I notice that some of the values calculated are NANs. What is the best way to handle this?
Output i am getting:
name rate avg_rate rate_diff
A 10 3 NAN
B 6 5 NAN
C 4 3 NAN
If you want to use groupby and apply then following should work,
res = df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate).reset_index().set_index('level_1')
df = pd.merge(df,res,on=['name'],left_index = True, right_index=True).rename({0:'rate_diff'},axis=1)
However, as #sacuL suggested in the comments, you don't need to use groupby to calculate the difference as you are just going to get the difference by simply subtracting columns (side by side), and groupby apply will be overkill for this simple task.
df["rate_diff"] = df.rate - df.avg_rate

Moving pandas series value by switching column name?

I have a DF, however the last value of some series should be placed in a different one. This happened due to column names not being standardized - i.e., some are "Wx_y_x_PRED" and some are "Wx_x_y_PRED". I'm having difficulty writing a function that will simply find the columns with >= 225 NaN's and changing the column it's assigned to.
I've written a function that for some reason will sometimes work and sometimes won't. When it does, it further creates approx 850 columns in its wake (the OG dataframe is around 420 with the duplicate columns). I'm hoping to have something that just reassigns the value. If it automatically deletes the incorrect column, that's awesome too, but I just used .dropna(thresh = 2) when my function worked originally.
Here's what it looks like originally:
in: df = pd.DataFrame(data = {'W10_IND_JAC_PRED': ['NaN','NaN','NaN','NaN','NaN',2],
'W10_JAC_IND_PRED': [1,2,1,2,1,'NAN']})
out:df
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 NAN
I wrote this, which occasionally works but most of the time doesn't and i'm not sure why.
def switch_cols(x):
"""Takes mismatched columns (where only the last value != NaN) and changes order of team column names"""
if x.isna().sum() == 5:
col_string = x.name.split('_')
col_to_switch = ('_').join([col_string[0],col_string[2],col_string[1],'PRED'])
df[col_to_switch]['row_name'] = x[-1]
else:
pass
return x
Most of the time it just returns to me the exact same DF, but this is the desired outcome.
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 2
Anyone have any tips or could share why my function works maybe 10% of the time?
Edit:
so this is an ugly "for" loop I wrote that works. I know there has to be a much more pythonic way of doing this while preserving original column names, though.
for i in range(df.shape[1]):
if df.iloc[:,i].isna().sum() == 5:
split_nan_col = df.columns[i].split('_')
correct_col_name = ('_').join([split_nan_col[0],split_nan_col[2],split_nan_col[1],split_nan_col[3]])
df.loc[5,correct_col_name] = df.loc[5,df.columns[i]]
else:
pass
Doing with split before frozenset(will return the order list), then we do join: Notice this solution can be implemented to more columns
df.columns=df.columns.str.split('_').map(frozenset).map('_'.join)
df.mask(df=='NaN').groupby(level=0,axis=1).first() # groupby first will return the first not null value
PRED_JAC_W10_IND
0 1
1 2
2 1
3 2
4 1
5 2

Python pandas groupby category and integer variables results in pandas last and tail difference

UPDATE:
Please download my full dataset here.
my datatype is:
>>> df.dtypes
increment int64
spread float64
SYM_ROOT category
dtype: object
I have realized that the problem might have been caused by the fact that my SYM_ROOT is a category variable.
To replicate the issue you might want to do the following first:
df=pd.read_csv("sf.csv")
df['SYM_ROOT']=df['SYM_ROOT'].astype('category')
But I am still puzzled as in why my SYM_ROOT will result in the gaps in increment being filled with NA? Unless groupby category and integer value will result in a balanced panel by default.
I noticed that the behaviour of pd.groupby().last is different from that of pd.groupby().tail(1).
For example, suppose I have the following data:
increment is an integer that spans from 0 to 4680. However, for some SYM_ROOT variable, there are gaps in between. For example, 4 could be missing from it.
What I want to do is to keep the last observation per group.
If I do df.groupby(['SYM_ROOT','increment']).last(), the dataframe becomes:
While if I do df.groupby(['SYM_ROOT','increment']).tail(1), the dataframe becomes:
It looks to me that the last() statement will create a balanced time-series data and fill in the gaps with NaN, while the tail(1) statement doesn't. Is it correct?
Update :
Your columns increment is category
df=pd.DataFrame({'A':[1,1,2,2],'B':[1,1,2,3],'C':[1,1,1,1]})
df.B=df.B.astype('category')
df.groupby(['A','B']).last()
Out[590]:
C
A B
1 1 1.0
2 NaN
3 NaN
2 1 NaN
2 1.0
3 1.0
When you using tail it will not make up the miss level since , tail is more like dataframe base , not single columns
df.groupby(['A','B']).tail(1)
Out[593]:
A B C
1 1 1 1
2 2 2 1
3 2 3 1
After hange it using astype
df.B=df.B.astype('int')
df.groupby(['A','B']).last()
Out[591]:
C
A B
1 1 1
2 2 1
3 1
It is actually an issue here at Github, where the problem is mainly caused by groupby categories guessing the values.

Applying an operation on multiple columns with a fixed column in pandas

I have a dataframe as shown below. The last column shows the sum of values from all the columns i.e. A,B,D,K and T. Please note some of the columns have NaN as well.
word1,A,B,D,K,T,sum
na,,63.0,,,870.0,933.0
sva,,1.0,,3.0,695.0,699.0
a,,102.0,,1.0,493.0,596.0
sa,2.0,487.0,,2.0,15.0,506.0
su,1.0,44.0,,136.0,214.0,395.0
waw,1.0,9.0,,34.0,296.0,340.0
How can I calculate the entropy for each row? i.e. I should find something like following
df['A']/df['sum']*log(df['A']/df['sum']) + df['B']/df['sum']*log(df['B']/df['sum']) + ...... + df['T']/df['sum']*log(df['T']/df['sum'])
The condition is that whenever the value inside the log becomes zero or NaN, the whole value should be treated as zero (by definition, the log will return an error as log 0 is undefined).
I am aware of using lambda operation to apply on individual columns. Here I am not able to think for a pure pandas solution where a fixed column sum is applied on different columns A,B,D etc.. Though I can think of a simple loopwise iteration over CSV file with hard-coded column values.
I think you can use ix for selecting columns from A to T, then divide by div with numpy.log. Last use sum:
print (df['A']/df['sum']*np.log(df['A']/df['sum']))
0 NaN
1 NaN
2 NaN
3 -0.021871
4 -0.015136
5 -0.017144
dtype: float64
print (df.ix[:,'A':'T'].div(df['sum'],axis=0)*np.log(df.ix[:,'A':'T'].div(df['sum'],axis=0)))
A B D K T
0 NaN -0.181996 NaN NaN -0.065191
1 NaN -0.009370 NaN -0.023395 -0.005706
2 NaN -0.302110 NaN -0.010722 -0.156942
3 -0.021871 -0.036835 NaN -0.021871 -0.104303
4 -0.015136 -0.244472 NaN -0.367107 -0.332057
5 -0.017144 -0.096134 NaN -0.230259 -0.120651
print((df.ix[:,'A':'T'].div(df['sum'],axis=0)*np.log(df.ix[:,'A':'T'].div(df['sum'],axis=0)))
.sum(axis=1))
0 -0.247187
1 -0.038471
2 -0.469774
3 -0.184881
4 -0.958774
5 -0.464188
dtype: float64
df1 = df.iloc[:, :-1]
df2 = df1.div(df1.sum(1), axis=0)
df2.mul(np.log(df2)).sum(1)
word1
na -0.247187
sva -0.038471
a -0.469774
sa -0.184881
su -0.958774
waw -0.464188
dtype: float64
Setup
from StringIO import StringIO
import pandas as pd
text = """word1,A,B,D,K,T,sum
na,,63.0,,,870.0,933.0
sva,,1.0,,3.0,695.0,699.0
a,,102.0,,1.0,493.0,596.0
sa,2.0,487.0,,2.0,15.0,506.0
su,1.0,44.0,,136.0,214.0,395.0
waw,1.0,9.0,,34.0,296.0,340.0"""
df = pd.read_csv(StringIO(text), index_col=0)
df

Categories