I'd like to replicate this method of winsorizing a dataframe with specified percentile regions in python. I tried using the scipy winsorize function but that didn't get the results I was looking for.
Example expected output for a dataframe winsorized by 0.01% low percentage value and 0.99% high percentage value across each date:
Original df:
A B C D E
2020-06-30 0.033 -0.182 -0.016 0.665 0.025
2020-07-31 0.142 -0.175 -0.016 0.556 0.024
2020-08-31 0.115 -0.187 -0.017 0.627 0.027
2020-09-30 0.032 -0.096 -0.022 0.572 0.024
Winsorized data:
A B C D E
2020-06-30 0.033 -0.175 -0.016 0.64 0.025
2020-07-31 0.142 -0.169 -0.016 0.54 0.024
2020-08-31 0.115 -0.18 -0.017 0.606 0.027
2020-09-30 0.032 -0.093 -0.022 0.55 0.024
I'm new to python . I'm looking for a way to generate mean for row values based on column names(Column names are date series formats from January to December). I want to generate mean for every 10 days for over a period of an year. My dataframe is in the below format(2000 rows)
import pandas as pd
df= pd.DataFrame({'A':[81,80.09,83,85,88],
'B':[21.8,22.04,21.8,21.7,22.06],
'20210113':[0,0.05,0,0,0.433],
'20210122':[0,0.13,0,0,0.128],
'20210125':[0.056,0,0.043,0.062,0.16],
'20210213':[0.9,0.56,0.32,0.8,0],
'20210217':[0.7,0.99,0.008,0.23,0.56],
'20210219':[0.9,0.43,0.76,0.98,0.5]})
Expected Output:
In [2]: df
Out[2]:
A B c(Mean 20210111,..20210119 ) D(Mean of 20210120..20210129)..
0 81 21.8
1 80.09 22.04
2 83 21.8
3 85 21.7
4 88 22.06
One way would be to isolate the date columns from the rest of the DF. Transpose it to be able to use normal grouping operations. Then transpose back and merge to the unaffected portion of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [81, 80.09, 83, 85, 88],
'B': [21.8, 22.04, 21.8, 21.7, 22.06],
'20210113A.2': [0, 0.05, 0, 0, 0.433],
'20210122B.1': [0, 0.13, 0, 0, 0.128],
'20210125C.3': [0.056, 0, 0.043, 0.062, 0.16],
'20210213': [0.9, 0.56, 0.32, 0.8, 0],
'20210217': [0.7, 0.99, 0.008, 0.23, 0.56],
'20210219': [0.9, 0.43, 0.76, 0.98, 0.5]})
# Unaffected Columns Go Here
keep_columns = ['A', 'B']
# Get All Affected Columns
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
# Strip Extra Information From Column Names
new_df.columns = new_df.columns.map(lambda c: c[0:8])
# Transpose
new_df = new_df.T
# Convert index to DateTime for easy use
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
# Resample every 10 Days on new DT index (Drop any rows with no values)
new_df = new_df.resample("10D").mean().dropna(how='all')
# Transpose and Merge Back on DF
df = df[keep_columns].merge(new_df.T, left_index=True, right_index=True)
# For Display
print(df.to_string())
Output:
A B 2021-01-13 00:00:00 2021-01-23 00:00:00 2021-02-12 00:00:00
0 81.00 21.80 0.0000 0.056 0.833333
1 80.09 22.04 0.0900 0.000 0.660000
2 83.00 21.80 0.0000 0.043 0.362667
3 85.00 21.70 0.0000 0.062 0.670000
4 88.00 22.06 0.2805 0.160 0.353333
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
new_df
0 1 2 3 4
20210113 0.000 0.05 0.000 0.000 0.433
20210122 0.000 0.13 0.000 0.000 0.128
20210125 0.056 0.00 0.043 0.062 0.160
20210213 0.900 0.56 0.320 0.800 0.000
20210217 0.700 0.99 0.008 0.230 0.560
20210219 0.900 0.43 0.760 0.980 0.500
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
new_df
0 1 2 3 4
2021-01-13 0.000 0.05 0.000 0.000 0.433
2021-01-22 0.000 0.13 0.000 0.000 0.128
2021-01-25 0.056 0.00 0.043 0.062 0.160
2021-02-13 0.900 0.56 0.320 0.800 0.000
2021-02-17 0.700 0.99 0.008 0.230 0.560
2021-02-19 0.900 0.43 0.760 0.980 0.500
new_df = new_df.resample("10D").mean().dropna(how='all')
new_df
0 1 2 3 4
2021-01-13 0.000000 0.09 0.000000 0.000 0.280500
2021-01-23 0.056000 0.00 0.043000 0.062 0.160000
2021-02-12 0.833333 0.66 0.362667 0.670 0.353333
new_df.T
2021-01-13 2021-01-23 2021-02-12
0 0.0000 0.056 0.833333
1 0.0900 0.000 0.660000
2 0.0000 0.043 0.362667
3 0.0000 0.062 0.670000
4 0.2805 0.160 0.353333
I have an extract of a dataframe below:
ticker date open high low close
0 A2M 2020-08-28 18.45 18.71 17.39 17.47
1 A2M 2020-09-04 17.47 17.52 16.53 16.70
2 A2M 2020-09-11 16.70 16.97 16.13 16.45
3 A2M 2020-09-18 16.54 16.77 16.25 16.39
4 A2M 2020-09-25 16.36 17.13 16.32 17.02
5 AAN 2007-06-08 15.29 15.33 14.93 15.07
6 AAN 2007-06-15 15.10 15.23 14.95 15.18
7 AAN 2007-06-22 15.18 15.25 15.12 15.16
8 AAN 2007-06-29 15.14 15.25 15.11 15.22
9 AAN 2007-07-06 15.11 15.33 15.07 15.33
10 AAN 2007-07-13 15.29 15.35 15.12 15.26
11 AAN 2007-07-20 15.25 15.27 15.02 15.10
12 AAN 2007-07-27 15.05 15.15 14.00 14.82
13 AAN 2007-08-03 14.72 14.85 14.47 14.69
14 AAN 2007-08-10 14.56 14.90 14.22 14.54
15 AAN 2007-08-17 14.55 14.79 13.71 14.42
16 AAP 2000-10-06 7.11 7.14 7.10 7.12
17 AAP 2000-10-13 7.13 7.17 7.12 7.17
18 AAP 2000-10-20 7.16 7.25 7.16 7.23
19 AAP 2000-10-27 7.23 7.24 7.22 7.23
20 AAP 2000-11-03 7.16 7.25 7.12 7.25
21 AAP 2000-11-10 7.24 7.24 7.12 7.12
22 ABB 2002-07-26 2.70 3.05 2.60 2.95
23 ABB 2002-08-02 2.92 2.95 2.75 2.80
24 ABB 2002-08-09 2.80 2.84 2.70 2.70
25 ABB 2002-08-16 2.72 2.75 2.70 2.75
26 ABB 2002-08-23 2.71 2.85 2.71 2.75
27 ABB 2002-08-30 2.75 2.75 2.75 2.75
I've created the following code to find upPrices vs. downPrices:
i = 0
upPrices=[]
downPrices=[]
while i < len(df['close']):
if i == 0:
upPrices.append(0)
downPrices.append(0)
else:
if (df['close'][i]-df['close'][i-1])>0:
upPrices.append(df['close'][i]-df['close'][i-1])
downPrices.append(0)
else:
downPrices.append(df['close'][i]-df['close'][i-1])
upPrices.append(0)
i += 1
df['upPrices'] = upPrices
df['downPrices'] = downPrices
The result is the following dataframe:
ticker date open high low close upPrices downPrices
0 A2M 2020-08-28 18.45 18.71 17.39 17.47 0.00 0.00
1 A2M 2020-09-04 17.47 17.52 16.53 16.70 0.00 -0.77
2 A2M 2020-09-11 16.70 16.97 16.13 16.45 0.00 -0.25
3 A2M 2020-09-18 16.54 16.77 16.25 16.39 0.00 -0.06
4 A2M 2020-09-25 16.36 17.13 16.32 17.02 0.63 0.00
5 AAN 2007-06-08 15.29 15.33 14.93 15.07 0.00 -1.95
6 AAN 2007-06-15 15.10 15.23 14.95 15.18 0.11 0.00
7 AAN 2007-06-22 15.18 15.25 15.12 15.16 0.00 -0.02
8 AAN 2007-06-29 15.14 15.25 15.11 15.22 0.06 0.00
9 AAN 2007-07-06 15.11 15.33 15.07 15.33 0.11 0.00
10 AAN 2007-07-13 15.29 15.35 15.12 15.26 0.00 -0.07
11 AAN 2007-07-20 15.25 15.27 15.02 15.10 0.00 -0.16
12 AAN 2007-07-27 15.05 15.15 14.00 14.82 0.00 -0.28
13 AAN 2007-08-03 14.72 14.85 14.47 14.69 0.00 -0.13
14 AAN 2007-08-10 14.56 14.90 14.22 14.54 0.00 -0.15
15 AAN 2007-08-17 14.55 14.79 13.71 14.42 0.00 -0.12
16 AAP 2000-10-06 7.11 7.14 7.10 7.12 0.00 -7.30
17 AAP 2000-10-13 7.13 7.17 7.12 7.17 0.05 0.00
18 AAP 2000-10-20 7.16 7.25 7.16 7.23 0.06 0.00
19 AAP 2000-10-27 7.23 7.24 7.22 7.23 0.00 0.00
20 AAP 2000-11-03 7.16 7.25 7.12 7.25 0.02 0.00
21 AAP 2000-11-10 7.24 7.24 7.12 7.12 0.00 -0.13
22 ABB 2002-07-26 2.70 3.05 2.60 2.95 0.00 -4.17
23 ABB 2002-08-02 2.92 2.95 2.75 2.80 0.00 -0.15
24 ABB 2002-08-09 2.80 2.84 2.70 2.70 0.00 -0.10
25 ABB 2002-08-16 2.72 2.75 2.70 2.75 0.05 0.00
26 ABB 2002-08-23 2.71 2.85 2.71 2.75 0.00 0.00
27 ABB 2002-08-30 2.75 2.75 2.75 2.75 0.00 0.00
Unfortunately the logic is not correct. The upPrices and downPrices need to be for each ticker. At the moment, you can see that in rows 5, 16 and 22 it compares the previous close from another ticker. Essentially, I need this formula to groupby or some other means to restart at each ticker. However, when I try add in groupby it returns index length mismatch errors.
Please help!
Your intuition of groupby is correct. groupby ticker then diff the closing prices. You can use where to get it separated into the style of up and down columns you wanted. Plus, now no more loop! For something that just requires "basic" math operations a vectorized approach is much better.
import pandas as pd
data = {"ticker":["A2M","A2M","A2M","A2M","A2M","AAN","AAN","AAN","AAN"], "close":[17.47,16.7,16.45,16.39,17.02,15.07,15.18,15.16,15.22]}
df = pd.DataFrame(data)
df["diff"] = df.groupby("ticker")["close"].diff()
df["upPrice"] = df["diff"].where(df["diff"] > 0, 0)
df["downPrice"] = df["diff"].where(df["diff"] < 0, 0)
del df["diff"]
print(df)
I have a dataframe that looks like that:
scn cl_name lqd_mp lqd_wp gas_mp gas_wp res_mp res_wp
12 C6 Hexanes 3.398 1.723 2.200 5.835 2.614 2.775
13 NaN Me-Cyclo-pentane 1.193 0.591 0.439 1.146 0.707 0.733
14 NaN Benzene 0.037 0.017 0.013 0.030 0.021 0.020
15 NaN Cyclo-hexane 1.393 0.690 0.697 1.820 0.944 0.979
16 C7 Heptanes 6.359 3.748 1.122 3.477 2.980 3.679
17 NaN Me-Cyclo-hexane 4.355 2.515 0.678 2.068 1.985 2.401
18 NaN Toluene 0.407 0.220 0.061 0.174 0.183 0.208
19 C8 Octanes 10.277 6.901 0.692 2.438 4.092 5.759
20 NaN Ethyl-benzene 0.146 0.091 0.010 0.032 0.058 0.076
21 NaN Meta/Para-xylene 0.885 0.553 0.029 0.095 0.333 0.436
22 NaN Ortho-xylene 0.253 0.158 0.002 0.007 0.091 0.119
23 C9 Nonanes 8.683 6.552 0.280 1.113 3.266 5.160
24 NaN Tri-Me-benzene 0.496 0.351 0.000 0.000 0.176 0.261
25 C10 Decanes 8.216 6.877 0.108 0.451 2.985 5.233
I'd like to replace all the NaN values with the values from the previous row in 'scn' column and then to reindex the dataframe using multiindex on two columns 'scn' and 'cl_name'.
I do it with those two lines of code:
df['scn'] = df['scn'].ffill()
df.set_index(['scn', 'cl_name'], inplace=True)
The first line with ffil() does what I want replacing NaNs with above values. But after doing set_index() these values are disappearing leaving blank cells.
lqd_mp lqd_wp gas_mp gas_wp res_mp res_wp
scn cl_name
C6 Hexanes 3.398 1.723 2.200 5.835 2.614 2.775
Me-Cyclo-pentane 1.193 0.591 0.439 1.146 0.707 0.733
Benzene 0.037 0.017 0.013 0.030 0.021 0.020
Cyclo-hexane 1.393 0.690 0.697 1.820 0.944 0.979
C7 Heptanes 6.359 3.748 1.122 3.477 2.980 3.679
Me-Cyclo-hexane 4.355 2.515 0.678 2.068 1.985 2.401
Toluene 0.407 0.220 0.061 0.174 0.183 0.208
C8 Octanes 10.277 6.901 0.692 2.438 4.092 5.759
Ethyl-benzene 0.146 0.091 0.010 0.032 0.058 0.076
Meta/Para-xylene 0.885 0.553 0.029 0.095 0.333 0.436
Ortho-xylene 0.253 0.158 0.002 0.007 0.091 0.119
C9 Nonanes 8.683 6.552 0.280 1.113 3.266 5.160
Tri-Me-benzene 0.496 0.351 0.000 0.000 0.176 0.261
C10 Decanes 8.216 6.877 0.108 0.451 2.985 5.233
I'd like no blanks in 'scn' part of the index. What am I doing wrong?
Thanks
I have three columns: A BatchID, UnitID, and Score.
At the moment, the data set looks something like this:
BatchID UnitID Score
A123 A123-100 0.111
A123 A123-101 0.121
A123 A123-102 0.101
A123 A123-103 0.102
B456 B456-200 0.211
B456 B456-201 0.221
C789 C789-001 0.199
C789 C789-002 0.189
C789 C789-003 0.192
C789 C789-004 0.201
... ... ...
I want to add a column "median" that gets the median of the score each BATCH, and places it next to the rest of the data (repeating the same median value for each Unit in a unique Batch). Something like this:
BatchID UnitID Score Median
A123 A123-100 0.111 0.1065
A123 A123-101 0.121 0.1065
A123 A123-102 0.101 0.1065
A123 A123-103 0.102 0.1065
B456 B456-200 0.211 0.2160
B456 B456-201 0.221 0.2160
C789 C789-001 0.199 0.1955
C789 C789-002 0.189 0.1955
C789 C789-003 0.192 0.1955
C789 C789-004 0.201 0.1955
... ... ... ...
I tried groupby, among other things, but given that I don't really know how to use it in this case, that's not giving me the desired output.
Thank you!
Use groupby with transform:
df['Median'] = df.groupby('BatchID')['Score'].transform('median')
Output:
BatchID UnitID Score Median
0 A123 A123-100 0.111 0.1065
1 A123 A123-101 0.121 0.1065
2 A123 A123-102 0.101 0.1065
3 A123 A123-103 0.102 0.1065
4 B456 B456-200 0.211 0.2160
5 B456 B456-201 0.221 0.2160
6 C789 C789-001 0.199 0.1955
7 C789 C789-002 0.189 0.1955
8 C789 C789-003 0.192 0.1955
9 C789 C789-004 0.201 0.1955