Modify data frame based on the row index - Python - python

Given a pandas dataframe (20, 40), I would like to modify the first 10 rows of the first 20 columns using the value index.
For example, if:
df.iloc[5,6] = 0.98,
I would like to modify the value in the following way
new df.iloc[5,6] = 0.98 ** -(1/5)
where 5 is the row index.
And I should do the same for every value between the first 10 rows and the first 20 columns.
Can anyone help me?
Thank you very much in advance.

Can you explain what you want to do in a more general way?
I don't understand why you chose 5 here.
The way to make columns off other columns is
df["new column"] = df["column1"] ** (-1/df["column2"])
the way you do it with the index is the same
df["new column"] = df["column1"] ** (-1/df.index)

You can do this operation in-place with the following snippet.
from numpy.random import default_rng
from pandas import DataFrame
from string import ascii_lowercase
rng = default_rng(0)
df = DataFrame(
(data := rng.integers(1, 10, size=(4, 5))),
columns=[*ascii_lowercase[:data.shape[1]]]
)
print(df)
a b c d e
0 8 6 5 3 3
1 1 1 1 2 8
2 6 9 5 6 9
3 7 6 5 6 9
# you would replace :3, :4 with :10, :20 for your data
df.iloc[:3, :4] **= (-1 / df.index)
print(df)
a b c d e
0 0 0.166667 0.447214 0.693361 3
1 1 1.000000 1.000000 0.793701 8
2 0 0.111111 0.447214 0.550321 9
3 7 6.000000 5.000000 6.000000 9
In the event your index is not a simple RangeIndex you can use numpy.arange to mimic this:
from numpy import arange
df.iloc[:3, :4] **= (-1 / arange(df.shape[0]))
print(df)
a b c d e
0 0 0.166667 0.447214 0.693361 3
1 1 1.000000 1.000000 0.793701 8
2 0 0.111111 0.447214 0.550321 9
3 7 6.000000 5.000000 6.000000 9
Note: If 0 is in your index, like it is in this example, you'll encounter a RuntimeWarning of dividing by 0.

Related

replace values by condition after group by

So I have a dataframe like the one below.
dff = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], 'categ':['A','A','A','B','C','A','A','A','B','C','A','A','A','B','C'],'cost':[3,1,1,3,10,1,2,3,4,10,2,2,2,4,13] })
dff
id categ cost
0 1 A 3
1 1 A 1
2 1 A 1
3 1 B 3
4 1 C 10
5 2 A 1
6 2 A 2
7 2 A 3
8 2 B 4
9 2 C 10
10 3 A 2
11 3 A 2
12 3 A 2
13 3 B 4
14 3 C 13
Now i want to make a new grouped by 'id' dataframe and create a new column where if the sum of category A = 50% and B = 30% of the cost of C, then return True, otherwise false. My desired output is the one below.
new
id
1 True
2 False
3 False
I have tried some stuff but i can't make it work. Any idea on how to get my desired output? Thanks
Try pivot data frame first and then check if columns A, B, C satisfy the condition:
import numpy as np
dff.pivot_table('cost', 'id', 'categ', aggfunc='sum')\
.assign(new = lambda df: np.isclose(df.A, 0.5 * df.C) & np.isclose(df.B, 0.3 * df.C))
categ A B C new
id
1 5 3 10 True
2 6 4 10 False
3 6 4 13 False
Try with pd.crosstab with normalize, then apply a little bit math.
Notice : here we can not use equal due to float, we need np.isclose
s = pd.crosstab(df['id'], df['categ'], df['cost'],aggfunc='sum',normalize = 'index')
s['new'] = np.isclose(s.values.tolist(),[0.5/1.8,0.3/1.8,1/1.8],atol=0.0001).all(1)
s
Out[341]:
categ A B C new
id
1 0.277778 0.166667 0.555556 True
2 0.300000 0.200000 0.500000 False
3 0.260870 0.173913 0.565217 False

rolling pearson correlation between 2 different length of pandas columns

how can I calculate the rolling pearson correlation between 2 pandas columns please?
As shown in below, I have column A and column B, and I want to get column result.
import pandas as pd
d = ({
'A' : [1,2,3,4,5,6,7,8,9],
'B' : [2,4,6,8,6,4,2,1,4],
})
df = pd.DataFrame(data=d)
df['corr'] = df.index.map(lambda x: df['A'].corr(df.loc[:x, 'B']))
print(df)
A B corr
0 1 2 NaN
1 2 4 1.000000
2 3 6 1.000000
3 4 8 1.000000
4 5 6 0.832050
5 6 4 0.458682
6 7 2 0.000000
7 8 1 -0.301687
8 9 4 -0.262461

Python pandas: creating a discrete series from a cumulative

I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])

How to fuse a small pandas.dataframe into a larger one based on values of a column?

I have two pandas.dataframe df1 and df2:
>>>import pandas as pd
>>>import numpy as np
>>>from random import random
>>>df1=pd.DataFrame({'x1':range(10), 'y1':np.repeat(0,10).tolist()})
>>>df2=pd.DataFrame({'x2':range(0,10,2), 'y2':[random() for _ in range(5)]})
>>>df1
x1 y1
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
>>>df2
x2 y2
0 0 0.075922
1 2 0.606703
2 4 0.272918
3 6 0.842641
4 8 0.576636
Now I want to fuse df2 into df1. This is to say, I want to change the values of y1 in df1 into the values of y2 in df2 when the value of x1 in df1 is equal to the value of x2 in df2. The final result I need is like the following:
>>>df1
x1 y1
0 0 0.075922
1 1 0
2 2 0.606703
3 3 0
4 4 0.272918
5 5 0
6 6 0.842641
7 7 0
8 8 0.576636
9 9 0
Although I can use the follow codes to get the above result:
>>> for i in range(df1.shape[0]):
... for j in range(df2.shape[0]):
... if df1.iloc[i,0] == df2.iloc[j,0]:
... df1.iloc[i,1]=df2.iloc[j,1]
...
I think there must be better ways to achieve this. Do you know what they are? Thank you in advance.
You can use df.update to update your df1 in place, eg:
df1.update({'y1': df2.set_index('x2')['y2']})
Gives you:
x1 y1
0 0 0.075922
1 1 0.000000
2 2 0.606703
3 3 0.000000
4 4 0.272918
5 5 0.000000
6 6 0.842641
7 7 0.000000
8 8 0.576636
9 9 0.000000
Use map and then replace missing values by original values by fillna:
df1['y1'] = df1['x1'].map(df2.set_index('x2')['y2']).fillna(df1['y1'])
print (df)
x1 y1
0 0 0.696469
1 1 0.000000
2 2 0.286139
3 3 0.000000
4 4 0.226851
5 5 0.000000
6 6 0.551315
7 7 0.000000
8 8 0.719469
9 9 0.000000
You can also use update after setting indices of both dataframes:
import pandas as pd
import numpy as np
from random import random
df1=pd.DataFrame({'x1':range(10), 'y1':np.repeat(0,10).tolist()})
#set index of the first dataframe to be 'x1'
df1.set_index('x1', inplace=True)
df2=pd.DataFrame({'x2':range(0,10,2), 'y1':[random() for _ in range(5)]})
#set index of the second dataframe to be 'x2'
df2.set_index('x2', inplace=True)
#update values in df1 with values in df
df1.update(df2)
#reset index if necessary (though index will look exactly like x1 column)
df1 = df1.reset_index()
Update() seems to be the best option here !
import pandas as pd
import numpy as np
from random import random
# your dataframes
df1 = pd.DataFrame({'x1': range(10), 'y1': np.repeat(0, 10).tolist()})
df2 = pd.DataFrame({'x2': range(0, 10, 2), 'y2': [random() for _ in range(5)]})
# printing df1 and df2 values before update
print(df1)
print(df2)
df1.update({'y1': df2.set_index('x2')['y2']})
# printing df1 after update was performed
print(df1)
Another method, adding the two dataframes together:
# first give df2 the same column names as df2
df2.columns = ['x1','y1']
#now set 'x1' as the index for both dfs (since this is what you want to 'join' on)
df1 = df1.set_index('x1')
df2 = df2.set_index('x1')
print(df1)
y1
x1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
print(df2)
y1
x1
0 0.525232
2 0.907628
4 0.612100
6 0.497420
8 0.656509
#now you can simply add the two df's to eachother
df_new = df1 + df2
print(df_new)
y1
x1
0 0.317418
1 NaN
2 0.581443
3 NaN
4 0.728766
5 NaN
6 0.495450
7 NaN
8 0.171131
9 NaN
Two problems:
The dataframe has NA's where you want 0's. These are the positions where df2 was not defined. Those positions were effectively equal to NA in df2, and NA + anything = NA. This can be fixed with a
fillna
You want 'x1' to be a column, not the index so just reset the index
df_new=df_new.reset_index().fillna(0)
print(df_new)
x1 y1
0 0 0.118903
1 1 0.000000
2 2 0.465557
3 3 0.000000
4 4 0.533266
5 5 0.000000
6 6 0.518484
7 7 0.000000
8 8 0.308733
9 9 0.000000

How to get log rate of change between rows in Pandas DataFrame effectively?

Let's say I have some DataFrame (with about 10000 rows in my case, this is just a minimal example)
>>> import pandas as pd
>>> sample_df = pd.DataFrame(
{'col1': list(range(1, 10)), 'col2': list(range(10, 19))})
>>> sample_df
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 6 15
6 7 16
7 8 17
8 9 18
For my purposes, I need to calculate the series represented by ln(col_i(n+1) / col_i(n)) for each col_i in my DataFrame, where n represents a row number.
How can I calculate this?
Background knowledge
I know that I can get the difference between each column in a very simple way using
>>> sample_df.diff()
col1 col2
0 NaN NaN
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
Or the percentage change, which is (col_i(n+1) - col_i(n))/col_i(n+1), using
>>> sample_df.pct_change()
col1 col2
0 NaN NaN
1 1.000000 0.100000
2 0.500000 0.090909
3 0.333333 0.083333
4 0.250000 0.076923
5 0.200000 0.071429
6 0.166667 0.066667
7 0.142857 0.062500
8 0.125000 0.058824
I have just been struggling with a straightforward way to get the direct division of each consecutive column by the previous. Were I to know how to do that even, I could just apply the natural logarithm to every element in the series after the fact.
Currently to solve my problem, I'm resorting to creating another column shifted with row elements down by 1 for each column and then applying the formula between the two columns. It seems messy and sub-optimal to me, though.
Any help would be greatly appreciated!
IIUC:
log of a ratio is the difference of logs:
sample_df.apply(np.log).diff()
Or better still:
np.log(sample_df).diff()
Timing
just use np.log:
np.log(df.col1 / df.col1.shift())
you can also use apply as suggested by #nikita but that will be slower.
in addition, if you wanted to do it for the entire dataframe, you could just do:
np.log(df / df.shift())
You can use shift for that, which does what you have proposed.
>>> sample_df['col1'].shift()
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
Name: col1, dtype: float64
The final answer would be:
import math
(sample_df['col1'] / sample_df['col1'].shift()).apply(lambda row: math.log(row))
0 NaN
1 0.693147
2 0.405465
3 0.287682
4 0.223144
5 0.182322
6 0.154151
7 0.133531
8 0.117783
Name: col1, dtype: float64

Categories