Idiomatic clip on quantile for DataFrame

Idiomatic clip on quantile for DataFrame - python

I am trying to clip outliers in the DataFrame based on quantiles for each column. Let's say
df = pd.DataFrame(pd.np.random.randn(10,2))
0 1
0 0.734355 0.594992
1 -0.745949 0.597601
2 0.295606 0.972196
3 0.474539 1.462364
4 0.238838 0.684790
5 -0.659094 0.451718
6 0.675360 -1.286660
7 0.713914 0.135179
8 -0.435309 -0.344975
9 1.200617 -0.392945
I currently use
df_clipped = df.apply(lambda col: col.clip(*col.quantile([0.05,0.95]).values))
0 1
0 0.734355 0.594992
1 -0.706865 0.597601
2 0.295606 0.972196
3 0.474539 1.241788
4 0.238838 0.684790
5 -0.659094 0.451718
6 0.675360 -0.884488
7 0.713914 0.135179
8 -0.435309 -0.344975
9 0.990799 -0.392945
This works but I am wondering if there is a more elegant pandas/numpy based approach.

You can use clip and align on the first axis:
df.clip(df.quantile(0.05), df.quantile(0.95), axis=1)
Out:
0 1
0 0.734355 0.594992
1 -0.706864 0.597601
2 0.295606 0.972196
3 0.474539 1.241788
4 0.238838 0.684790
5 -0.659094 0.451718
6 0.675360 -0.884488
7 0.713914 0.135179
8 -0.435309 -0.344975
9 0.990799 -0.392945

Related

How do I perform rolling division with several columns in Pandas?

I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667

It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)

How to rest a row value to the nths rows values of another dataframe

I have this two df's
df1:
lon lat
0 -60.7 -2.8333333333333335
1 -55.983333333333334 -2.4833333333333334
2 -51.06666666666667 -0.05
3 -66.96666666666667 -0.11666666666666667
4 -48.483333333333334 -1.3833333333333333
5 -54.71666666666667 -2.4333333333333336
6 -44.233333333333334 -2.6
7 -59.983333333333334 -3.15
df2:
lon lat
0 -24.109 -2.0035
1 -17.891 -1.70911
2 -14.5822 -1.7470700000000001
3 -12.8138 -1.72322
4 -14.0688 -1.5028700000000002
5 -13.8406 -1.44416
6 -12.1292 -0.671266
7 -13.8406 -0.8824270000000001
8 -15.12 -18.223
I want to rest each value of df1['lat'] with all values of df2
Something like this :
results0=df1.loc[0,'lat']-df2.loc[:,'lat']
results1=df1.loc[1,'lat']-df2.loc[:,'lat']
#etc etc....
So i tried this:
for i,j in zip(range(len(df1)), range(len(df2))):
exec(f"result{i}=df1.loc[{i},'lat']-df2.loc[{j},'lat']")
But it only gave me one result value for each result, instead of 8 values for each result.
I will appreciate any possible solution. Thanks!

You can create list of Series:
L = [df1.loc[i,'lat']-df2['lat'] for i in df1.index]
Or you can use numpy for new DataFrame:
arr = df1['lat'].to_numpy() - df2['lat'].to_numpy()[:, None]
df3 = pd.DataFrame(arr, index=df2.index, columns=df1.index)
print (df3)
0 1 2 3 4 5 \
0 -0.829833 -0.479833 1.953500 1.886833 0.620167 -0.429833
1 -1.124223 -0.774223 1.659110 1.592443 0.325777 -0.724223
2 -1.086263 -0.736263 1.697070 1.630403 0.363737 -0.686263
3 -1.110113 -0.760113 1.673220 1.606553 0.339887 -0.710113
4 -1.330463 -0.980463 1.452870 1.386203 0.119537 -0.930463
5 -1.389173 -1.039173 1.394160 1.327493 0.060827 -0.989173
6 -2.162067 -1.812067 0.621266 0.554599 -0.712067 -1.762067
7 -1.950906 -1.600906 0.832427 0.765760 -0.500906 -1.550906
8 15.389667 15.739667 18.173000 18.106333 16.839667 15.789667
6 7
0 -0.596500 -1.146500
1 -0.890890 -1.440890
2 -0.852930 -1.402930
3 -0.876780 -1.426780
4 -1.097130 -1.647130
5 -1.155840 -1.705840
6 -1.928734 -2.478734
7 -1.717573 -2.267573
8 15.623000 15.073000

Since df1 has one less row than df2
df1['lat'] = df1['lat'] - df2.loc[:df1.shape[0]-1, 'lat']
output:
0 -0.829833
1 -0.774223
2 1.697070
3 1.606553
4 0.119537
5 -0.989173
6 -1.928734
7 -2.267573
Name: lat, dtype: float64

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.

You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1

Set value to slice of a Pandas dataframe

I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).

I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2

Iterative subtraction of each row using pandas?

I have a dataframe like this:
abc
9 32.242063
3 24.419279
8 25.464011
6 25.029761
10 18.851918
2 26.027582
1 27.885187
4 20.141231
5 31.179138
7 22.893074
11 31.640625
0 33.150434
I want to subtract the first row from 100, then subtract the 2nd row from the remaining value from (100 - first row) and so on.
I tried:
a = 100 - df["abc"]
but everytime it is subtracting it from 100.
can anybody suggest the correct way to do it?

It seems you need:
df['new'] = 100 - df['abc'].cumsum()
print (df)
abc new
9 32.242063 67.757937
3 24.419279 43.338658
8 25.464011 17.874647
6 25.029761 -7.155114
10 18.851918 -26.007032
2 26.027582 -52.034614
1 27.885187 -79.919801
4 20.141231 -100.061032
5 31.179138 -131.240170
7 22.893074 -154.133244
11 31.640625 -185.773869
0 33.150434 -218.924303

Option 1
np.cumsum -
df["abc"] = 100 - np.cumsum(df.abc.values)
df
abc
9 67.757937
3 43.338658
8 17.874647
6 -7.155114
10 -26.007032
2 -52.034614
1 -79.919801
4 -100.061032
5 -131.240170
7 -154.133244
11 -185.773869
0 -218.924303
This is faster than pd.Series.cumsum in the other answer.
Option 2
Loopy equivalent, cythonized.
%load_ext Cython
%%cython
def foo(r):
x = [100 - r[0]]
for i in r[1:]:
x.append(x[-1] - i)
return x
df['abc'] = foo(df['abc'])
df
abc
9 66.849566
3 42.430287
8 16.966276
6 -8.063485
10 -26.915403
2 -52.942985
1 -80.828172
4 -100.969403
5 -132.148541
7 -155.041615
11 -186.682240
0 -219.832674

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Idiomatic clip on quantile for DataFrame - python

Related

How do I perform rolling division with several columns in Pandas?

How to rest a row value to the nths rows values of another dataframe

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

Set value to slice of a Pandas dataframe

Iterative subtraction of each row using pandas?

Categories

Resources