Create new column in pandas dataframe by calculation from previous index columns - python

Hi I am new and learning pandas for data analysis. I have 2 columns data
A B
1 2
2 3
3 4
4 5
I want to create a third column C which result would be calculated by column B , by subtracting upper value with current one and dividing by current.
A B C
1 2
2 3 0.33
3 4 0.25
4 5 0.2
for example first row value for C column is empty because there is no value above 2 .
0.33 = > 3 - 2 / 3 ,
0.25 = > 4 - 3 / 4 ,
0.2 = > 5 - 4 / 5 and so on
I am stuck while getting the upper value of current column. Need help how to achieve that.

Use shift to shift the column and then the remaining operations are the regular ones (sub and div):
df['B'].sub(df['B'].shift()).div(df['B'])
Out:
0 NaN
1 0.333333
2 0.250000
3 0.200000
Name: B, dtype: float64
This can also be done without chaining the methods, if you prefer.
(df['B'] - df['B'].shift()) / df['B']
Out[48]:
0 NaN
1 0.333333
2 0.250000
3 0.200000
Name: B, dtype: float64

Edit for handling NaN and decimals.
df['C'] = (1 - df.B.shift() / df.B).map(lambda x: '{0:.2f}'.format(round(x,2))).replace('nan','')
Output:
A B C
0 1 2
1 2 3 0.33
2 3 4 0.25
3 4 5 0.20
Let's simplify and use the following with shift to get the previous value:
df['C'] = 1 - df.B.shift() / df.B
Output:
A B C
0 1 2 NaN
1 2 3 0.333333
2 3 4 0.250000
3 4 5 0.200000

Or you can simply using diff
df2.B.diff()/df2.B
Out[545]:
0 NaN
1 0.333333
2 0.250000
3 0.200000
Name: B, dtype: float64

Related

replace values by condition after group by

So I have a dataframe like the one below.
dff = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], 'categ':['A','A','A','B','C','A','A','A','B','C','A','A','A','B','C'],'cost':[3,1,1,3,10,1,2,3,4,10,2,2,2,4,13] })
dff
id categ cost
0 1 A 3
1 1 A 1
2 1 A 1
3 1 B 3
4 1 C 10
5 2 A 1
6 2 A 2
7 2 A 3
8 2 B 4
9 2 C 10
10 3 A 2
11 3 A 2
12 3 A 2
13 3 B 4
14 3 C 13
Now i want to make a new grouped by 'id' dataframe and create a new column where if the sum of category A = 50% and B = 30% of the cost of C, then return True, otherwise false. My desired output is the one below.
new
id
1 True
2 False
3 False
I have tried some stuff but i can't make it work. Any idea on how to get my desired output? Thanks
Try pivot data frame first and then check if columns A, B, C satisfy the condition:
import numpy as np
dff.pivot_table('cost', 'id', 'categ', aggfunc='sum')\
.assign(new = lambda df: np.isclose(df.A, 0.5 * df.C) & np.isclose(df.B, 0.3 * df.C))
categ A B C new
id
1 5 3 10 True
2 6 4 10 False
3 6 4 13 False
Try with pd.crosstab with normalize, then apply a little bit math.
Notice : here we can not use equal due to float, we need np.isclose
s = pd.crosstab(df['id'], df['categ'], df['cost'],aggfunc='sum',normalize = 'index')
s['new'] = np.isclose(s.values.tolist(),[0.5/1.8,0.3/1.8,1/1.8],atol=0.0001).all(1)
s
Out[341]:
categ A B C new
id
1 0.277778 0.166667 0.555556 True
2 0.300000 0.200000 0.500000 False
3 0.260870 0.173913 0.565217 False

Pandas: re-index and interpolate in multi-index dataframe

I'm having trouble understanding pandas reindex. I have a series of measurements, munged into a multi-index df, and I'd like to reindex and interpolate those measurements to align them with some other data.
My actual data has ~7 index levels and several different measurements. I hope the solution for this toy data problem is applicable to my real data. It's "small data"; each individual measurement is a couple KB.
Here's a pair of toy problems, one which shows the expected behavior and one which doesn't seem to do anything.
Single-level index, works as expected:
"""
step,value
1,1
3,2
5,1
"""
df_i = pd.read_clipboard(sep=",").set_index("step")
print(df_i)
new_index = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df_i = df_i.reindex(new_index).interpolate()
print(df_i)
Outputs, the original df and the re-indexed and interpolated one:
value
step
1 1
3 2
5 1
value
step
1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
Works great.
Multi-index, currently not working:
"""
sample,meas_id,step,value
1,1,1,1
1,1,3,2
1,1,5,1
1,2,3,2
1,2,5,2
1,2,7,1
1,2,9,0
"""
df_mi = pd.read_clipboard(sep=",").set_index(["sample", "meas_id", "step"])
print(df_mi)
df_mi = df_mi.reindex(new_index, level="step").interpolate()
print(df_mi)
Output, unchanged after reindex (and therefore after interpolate):
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
How do I actually reindex a column in a multi-index df?
Here's the output I'd like, assuming linear interpolation:
value
sample meas_id step
1 1 1 1
2 1.5
3 2
5 1
6 1
7 1
8 1
9 1
2 1 NaN (or 2)
2 NaN (or 2)
3 2
4 2
5 2
6 1.5
7 1
8 0.5
9 0
I spent some sincere time looking over SO, and if the answer is in there, I missed it:
Fill multi-index Pandas DataFrame with interpolation
Resampling Within a Pandas MultiIndex
pandas multiindex dataframe, ND interpolation for missing values
Fill multi-index Pandas DataFrame with interpolation
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
Possibly related GitHub issues:
https://github.com/numpy/numpy/issues/11975
https://github.com/pandas-dev/pandas/issues/23104
https://github.com/pandas-dev/pandas/issues/17132
IIUC create the index by using MultiIndex.from_product, then just do reindex
idx=pd.MultiIndex.from_product([df_mi.index.levels[0],df_mi.index.levels[1],new_index])
df_mi.reindex(idx).interpolate()
Out[161]:
value
1 1 1 1.000000
2 1.500000
3 2.000000
4 1.500000
5 1.000000
6 1.142857
7 1.285714
8 1.428571
9 1.571429
2 1 1.714286 # here is bad , it take previous value into consideration
2 1.857143
3 2.000000
4 2.000000
5 2.000000
6 1.500000
7 1.000000
8 0.500000
9 0.000000
My think
def idx(x):
idx = pd.MultiIndex.from_product([x.index.get_level_values(0).unique(), x.index.get_level_values(1).unique(), new_index])
return idx
pd.concat([y.reindex(idx(y)).interpolate() for _,y in df_mi.groupby(level=[0,1])])
value
1 1 1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
2 1 NaN
2 NaN
3 2.0
4 2.0
5 2.0
6 1.5
7 1.0
8 0.5
9 0.0

How to find rate of change across successive rows using time and data columns after grouping by a different column using pandas?

I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64

pandas df.mean for multi-index across axis 0

How do you get the mean across axis 0 for certain mult-index (index_col [1])? I have
df:
1 2 3
h a 1 4 8
h b 5 4 6
i a 9 3 6
i b 5 2 5
j a 2 2 2
j b 4 4 4
I would like to create df1 - mean of 2nd index value across axis 0 ('a', 'b', 'a', 'b')
df1:
1 2 3
0 a 4 3 5.3
1 b 4.6 3.3 5
I know that I can select certain rows
df.loc[['a','b']].mean(axis=0)
but I'm not sure how this relates to multi-index dataframes?
I think you need groupby by second level with mean:
print (df.groupby(level=1).mean())
1 2 3
a 4.000000 3.000000 5.333333
b 4.666667 3.333333 5.000000
And if necesary round values:
print (df.groupby(level=1).mean().round(1))
1 2 3
a 4.0 3.0 5.3
b 4.7 3.3 5.0

How to get log rate of change between rows in Pandas DataFrame effectively?

Let's say I have some DataFrame (with about 10000 rows in my case, this is just a minimal example)
>>> import pandas as pd
>>> sample_df = pd.DataFrame(
{'col1': list(range(1, 10)), 'col2': list(range(10, 19))})
>>> sample_df
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 6 15
6 7 16
7 8 17
8 9 18
For my purposes, I need to calculate the series represented by ln(col_i(n+1) / col_i(n)) for each col_i in my DataFrame, where n represents a row number.
How can I calculate this?
Background knowledge
I know that I can get the difference between each column in a very simple way using
>>> sample_df.diff()
col1 col2
0 NaN NaN
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
Or the percentage change, which is (col_i(n+1) - col_i(n))/col_i(n+1), using
>>> sample_df.pct_change()
col1 col2
0 NaN NaN
1 1.000000 0.100000
2 0.500000 0.090909
3 0.333333 0.083333
4 0.250000 0.076923
5 0.200000 0.071429
6 0.166667 0.066667
7 0.142857 0.062500
8 0.125000 0.058824
I have just been struggling with a straightforward way to get the direct division of each consecutive column by the previous. Were I to know how to do that even, I could just apply the natural logarithm to every element in the series after the fact.
Currently to solve my problem, I'm resorting to creating another column shifted with row elements down by 1 for each column and then applying the formula between the two columns. It seems messy and sub-optimal to me, though.
Any help would be greatly appreciated!
IIUC:
log of a ratio is the difference of logs:
sample_df.apply(np.log).diff()
Or better still:
np.log(sample_df).diff()
Timing
just use np.log:
np.log(df.col1 / df.col1.shift())
you can also use apply as suggested by #nikita but that will be slower.
in addition, if you wanted to do it for the entire dataframe, you could just do:
np.log(df / df.shift())
You can use shift for that, which does what you have proposed.
>>> sample_df['col1'].shift()
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
Name: col1, dtype: float64
The final answer would be:
import math
(sample_df['col1'] / sample_df['col1'].shift()).apply(lambda row: math.log(row))
0 NaN
1 0.693147
2 0.405465
3 0.287682
4 0.223144
5 0.182322
6 0.154151
7 0.133531
8 0.117783
Name: col1, dtype: float64

Categories