Winsorizing data by column in pandas with NaN - python

I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column.
MWE:
import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize
# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan
Here is the baseline distribution:
df.quantile([0, 0.01, 0.5, 0.99, 1])
output:
one two three four
0.00 2.336618e-10 2.294259e-07 0.002437 2.305353
0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520
0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980
0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985
1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779
This is how I'm winsorizing:
def using_mstats(s):
return winsorize(s, limits=[0.01, 0.01])
wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which gives this:
Out[356]:
one two three four
0.00 0.000001 0.001060 1.536882 1003.820149
0.01 0.000001 0.001060 1.536882 1003.820149
0.25 0.000025 0.024975 25.200378 25099.994780
0.50 0.000050 0.049810 50.290946 50374.548980
0.75 0.000075 0.074842 74.794537 75217.343920
0.99 0.000099 0.098986 98.978263 98991.436957
1.00 0.000100 0.100000 99.996793 98991.436957
Column four is correct because it has no NaN but the others are incorrect. The 99th percentile and Max should be the same. The observations counts are identical for both:
In [357]: df.count()
Out[357]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
In [358]: wins.count()
Out[358]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
This is how I can 'solve' it, but at the cost of losing a lot of my data:
wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Output:
Out[360]:
one two three four
0.00 9.686203e-07 0.000928 0.965702 1005.209503
0.01 9.686203e-07 0.000928 0.965702 1005.209503
0.25 2.486052e-05 0.024829 25.204032 25210.837443
0.50 4.980946e-05 0.049894 50.299004 50622.227179
0.75 7.492750e-05 0.075059 74.837900 75299.906415
0.99 9.895563e-05 0.099014 98.972310 99014.311761
1.00 9.895563e-05 0.099014 98.972310 99014.311761
In [361]: wins2.count()
Out[361]:
one 51700
two 51700
three 51700
four 51700
dtype: int64
How can I winsorize the data, by column, that is not NaN, while maintaining the data shape (i.e. not removing rows)?

As often happens, simply creating the MWE helped clarify. I need to use clip() in combination with quantile() as below:
df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Output:
one two three four
0.00 9.862626e-07 0.000974 0.975807 1003.814520
0.01 9.862666e-07 0.000974 0.975816 1003.820092
0.25 2.485043e-05 0.024975 25.200378 25099.994780
0.50 4.975859e-05 0.049810 50.290946 50374.548980
0.75 7.486737e-05 0.074842 74.794537 75217.343920
0.99 9.897462e-05 0.098986 98.978245 98991.436977
1.00 9.897463e-05 0.098986 98.978263 98991.438985
In [384]: df2.count()
Out[384]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
The numbers are different from above because I have maintained all of the data in each column that is not missing (NaN).

Related

How to create multiple DataFrames from a single DataFrame based on a condition in values in the columns that start with a string?

I have a DataFrame:
import numpy as np
import pandas as pd
main_df = pd.DataFrame([(0.12, 0.00, 1.0), (0.96, 0.04, 0.96), (0.54, 0.55, .45), (0.18, 1.0, 0.00)], columns=['Adj_R2', 'Feature importance of x1', 'Feature importance of x2'])
display(main_df)
Adj_R2
Feature importance of x1
Feature importance of x2
0.12
0.00
1.00
0.96
0.04
0.96
0.54
0.55
0.45
0.18
1.00
0.00
I have filtered the columns that start with a certain string:
filter_col = [col for col in main_df if col.startswith('Feature importance of')]
filter_col
I want to create three separate DataFrames with the output as shown below. The condition is that if a cell in a column starting with 'Feature importance of' has a value of 0, then the entire row should go to a separate DataFrame without the column where the value 0 was encountered. The rest of the rows should be put in separate DataFrames based on this condition. In general, the length of the list containing separate DataFrames should be at most "(2^n) - 1" where n is the number of columns starting with 'Feature importance of'
df1 = pd.DataFrame([(0.12, 0.04, 0.96), (0.12, 0.55, .45)], columns=['Adj_R2', 'Feature importance of x1', 'Feature importance of x2'])
display(df1)
# [df1](https://i.stack.imgur.com/CbrHR.png)
df2 = pd.DataFrame([(0.12, 1.0)], columns=['Adj_R2', 'Feature importance of x2'])
display(df2)
#[df2](https://i.stack.imgur.com/x0uY4.png)
df3 = pd.DataFrame([(0.18, 1.0)], columns=['Adj_R2', 'Feature importance of x1'])
display(df3)
# [df3](https://i.stack.imgur.com/z1JLq.png)
Adj_R2
Feature importance of x1
Feature importance of x2
0.12
0.04
0.96
0.12
0.55
0.45
Adj_R2
Feature importance of x2
0.12
1.00
Adj_R2
Feature importance of x1
0.18
1.00
How do I go about this in pandas, numpy, by looping or vectorizing (either is fine)? Considering I might need to extend it for multiple features and create multiple DataFrames, I need to generalize the code by looping it over the filtered columns.

Winsorize dataframe columns per month while ignoring NaN's

I have a dataframe with monthly data and the following colums: date, bm and cash
date bm cash
1981-09-30 0.210308 2.487146
1981-10-31 0.241291 2.897529
1981-11-30 0.221529 2.892758
1981-12-31 0.239002 2.726372
1981-09-30 0.834520 4.387087
1981-10-31 0.800472 4.297658
1981-11-30 0.815778 4.459382
1981-12-31 0.836681 4.895269
Now I want to winsorize my data per month while keeping NaN values in the data. I.e. I want to group the data per month and overwrite observations above the 0.99 and below the 0.01 percentile with the 99 percentile and 0.01 percentile respectively. From Winsorizing data by column in pandas with NaN I found that I should do this with the "clip" function. My code looks as follows:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df_grouped = df.groupby(pd.Grouper(freq='M'))
cols = df.columns
for c in cols:
df[c] = df_grouped[c].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
I get the following output: ValueError: cannot reindex from a duplicate axis
P.S. I realize that I have not included my required output, but I hope that the required output is clear. Otherwise I can try to put something together.
Edit: These solution from #Allolz is already of great help, but it does not work exactly as it is supposed to. Before I run the code from #Allolz I I ran :
df_in.groupby(pd.Grouper(freq='M', key='date'))['secured'].quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which returned:
date
1980-01-31 0.00 1.580564e+00
0.01 1.599805e+00
0.25 2.388106e+00
0.50 6.427071e+00
0.75 1.200685e+01
0.99 5.133111e+01
1.00 5.530329e+01
After winsorizing I get:
date
1980-01-31 0.00 1.599805
0.01 1.617123
0.25 2.388106
0.50 6.427071
0.75 12.006854
0.99 47.756152
1.00 51.331114
It is clear that the new 0.0 and 1.0 quantiles are equal to the original 0.01 and 0.09 quantiles, which is what we would expect. However, the new 0.01 and 0.99 quantiles are not equal to the original 0.01 and 0.99 quantiles where I would expect that these should remain the same. What can cause this and wat could solve it? My hunch is that it might have to do with NaN's in the data, but I'm not sure if that is really the cause.
One method which will be faster requires you to create helper columns. We will use groupby + transform to broadcast columns for the 0.01 and 0.99 quantile (for that Month group) back to the DataFrame and then you can use those Series to clip the original at once. (clip will leave NaN alone so it satisfies that requirement too). Then if you want, remove the helper columns (I'll leave them in for clariity).
Sample Data
import numpy as np
import panda as pd
np.random.seed(123)
N = 10000
df = pd.DataFrame({'date': np.random.choice(pd.date_range('2010-01-01', freq='MS', periods=12), N),
'val': np.random.normal(1, 0.95, N)})
Code
gp = df.groupby(pd.Grouper(freq='M', key='date'))['val']
# Assign the lower-bound ('lb') and upper-bound ('ub') for Winsorizing
df['lb'] = gp.transform('quantile', 0.01)
df['ub'] = gp.transform('quantile', 0.99)
# Winsorize
df['val_wins'] = df['val'].clip(upper=df['ub'], lower=df['lb'])
Output
The majority of rows will not be changed (only those outside of the 1-99th percentile) so we can check the small susbet rows that did change to see it works. You can see rows for the same months have the same bounds and the winsorized value ('val_wins') is properly clipped to the bound it exceeds.
df[df['val'] != df['val_wins']]
# date val lb ub val_wins
#42 2010-09-01 -1.686566 -1.125862 3.206333 -1.125862
#96 2010-04-01 -1.255322 -1.243975 2.995711 -1.243975
#165 2010-08-01 3.367880 -1.020273 3.332030 3.332030
#172 2010-09-01 -1.813011 -1.125862 3.206333 -1.125862
#398 2010-09-01 3.281198 -1.125862 3.206333 3.206333
#... ... ... ... ... ...
#9626 2010-12-01 3.626950 -1.198967 3.249161 3.249161
#9746 2010-11-01 3.472490 -1.259557 3.261329 3.261329
#9762 2010-09-01 3.460467 -1.125862 3.206333 3.206333
#9768 2010-06-01 -1.625013 -1.482529 3.295520 -1.482529
#9854 2010-12-01 -1.475515 -1.198967 3.249161 -1.198967
#
#[214 rows x 5 columns]

Python can't write to dataframe from dict. Simple

dict1 = {'0-10': -0.04,
'10-20': -0.01,
'20-30': -0.03,
'30-40': -0.04,
'40-50': -0.02,
'50-60': 0.01,
'60-70': 0.05,
'70-80': 0.01,
'80-90': 0.09,
'90-100': 0.04}
stat = pd.DataFrame()
for x,y in dict1.items():
stat[x] = y
I try to write dict values to my dataframe and associate the column name to the keys. But my output is this:
Empty DataFrame
Columns: [0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90, 90-100]
Index: []
Tried it multiple times. No syntax errors. What am I missing? Thanks.
Try this:
df = pd.DataFrame(dict1, index=[0])
or
df = pd.DataFrame([dict1])
print(df)
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
0 -0.042 -0.01 -0.03 -0.04 -0.02 0.01 0.05 0.01 0.09 0.04

Conditional If Statement applied to multiple columns of dataframe

I have a dataframe of minute stock returns and I would like to create a new column that is conditional on whether a return was exceeded (pos or negative), and if so that row is equal to the limit (pos or negative), otherwise equal to the last column that was checked. The example below illustrates this:
import pandas as pd
dict = [
{'ticker':'jpm','date': '2016-11-28','returns1': 0.02,'returns2': 0.03,'limit': 0.1},
{ 'ticker':'ge','date': '2016-11-28','returns1': 0.2,'returns2': -0.3,'limit': 0.1},
{'ticker':'fb', 'date': '2016-11-28','returns1': -0.2,'returns2': 0.5,'limit': 0.1},
]
df = pd.DataFrame(dict)
df['date'] = pd.to_datetime(df['date'])
df=df.set_index(['date','ticker'], drop=True)
The target would be this:
fin_return limit returns1 returns2
date ticker
2016-11-28 jpm 0.03 0.1 0.02 0.03
ge 0.10 0.1 0.20 -0.30
fb -0.10 0.1 -0.20 0.50
So in the first row, the returns never exceeded the limit, so the value becomes equal to the value in returns2 (0.03). In row 2, the returns were exceeded on the upside, so the value should be the positive limit. In row 3 the returns where exceeded on the downside first, so the value should be the negative limit.
My actual dataframe has a couple thousand columns, so I am not quite sure how to do this (maybe a loop?). I appreciate any suggestions.
The idea is to test a stop loss or limit trading algorithm. Whenever, the lower limit is triggered, it should replace the final column with the lower limit, same for the upper limit, whichever comes first for that row. So once either one is triggered, the next row should be tested.
I am adding a different example with one more column here to make this a bit clearer (the limit is +/- 0.1)
fin_return limit returns1 returns2 returns3
date ticker
2016-11-28 jpm 0.02 0.1 0.01 0.04 0.02
ge 0.10 0.1 0.20 -0.30 0.6
fb -0.10 0.1 -0.02 -0.20 0.7
In the first row, the limit was never triggered to the final return is from returns3 (0.02). In row 2 the limit was triggered on the upside in returns 1 so the fin_return is equal to the upper limit (anything that happens in returns2 and returns 3 is irrelevant for this row). In row 3, the limited was exceeded on the downside in returns 2, so the fin_return becomes -0.1, and anything in returns3 is irrelevant.
Use:
dict = [
{'ticker':'jpm','date': '2016-11-28','returns1': 0.02,'returns2': 0.03,'limit': 0.1,'returns3':0.02},
{ 'ticker':'ge','date': '2016-11-28','returns1': 0.2,'returns2': -0.3,'limit': 0.1,'returns3':0.6},
{'ticker':'fb', 'date': '2016-11-28','returns1': -0.02,'returns2': -0.2,'limit': 0.1,'returns3':0.7},
]
df = pd.DataFrame(dict)
df['date'] = pd.to_datetime(df['date'])
df=df.set_index(['date','ticker'], drop=True)
#select all columns without first (here limit column)
df1 = df.iloc[:, 1:]
#comapre if all columns under +-limit
mask = df1.lt(df['limit'], axis=0) & df1.gt(-df['limit'], axis=0)
m1 = mask.all(axis=1)
print (m1)
date ticker
2016-11-28 jpm True
ge False
fb False
dtype: bool
#replace first columns in limit with NaNs and back filling missing values, seelct first col
m2 = df1.mask(mask).bfill(axis=1).iloc[:, 0].gt(df['limit'])
print (m2)
date ticker
2016-11-28 jpm False
ge True
fb False
dtype: bool
arr = np.select([m1,m2, ~m2], [df1.iloc[:, -1], df['limit'], -df['limit']])
#set first column in DataFrame by insert
df.insert(0, 'fin_return', arr)
print (df)
fin_return limit returns1 returns2 returns3
date ticker
2016-11-28 jpm 0.02 0.1 0.02 0.03 0.02
ge 0.10 0.1 0.20 -0.30 0.60
fb -0.10 0.1 -0.02 -0.20 0.70

How to calculate average of numbers from multiple csv files?

I've files like the following as replicates from a simulation experiment I've been doing:
generation, ratio_of_player_A, ratio_of_player_B, ratio_of_player_C
So, the data is something like
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10
etc
Now, since I run this is in multiples, I've around ~1000 files for each experiment, giving various such numbers. Now, my problem is to average them all for 1 set of experiment.
Thus, I would like to have a file that contains the average ratio after each generation (averaged over multiple replicates, i.e. files)
All the replicate output files which need to be averaged are names like output1.csv, output2.csv, output3.csv .....output1000.csv
I'd be obliged if someone could help me out with a shell script, or a python script.
If I understood well, let's say you have 2 file like those:
$ cat file1
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10
$ cat file2
0, 0.99, 1, 0.02
1, 0.10, 0.90, 0.90
2, 0.30, 0.10, 0.30
And you want to do the mean between column of both files. So here is a way for the first column :
Edit : I found a better way, using pd.concat :
all_files = pd.concat([file1,file2]) # you can easily put your 1000 files here
result = {}
for i in range(3): # 3 being number of generations
result[i] = all_files[i::3].mean()
result_df = pd.DataFrame(result)
result_df
0 1 2
ratio_of_player_A 0.660 0.25 0.40
ratio_of_player_B 0.665 0.65 0.25
ratio_of_player_C 0.175 0.55 0.20
Other way with merge, but one needs to perform multiple merges
import pandas as pd
In [1]: names = ["generation", "ratio_of_player_A", "ratio_of_player_B", "ratio_of_player_C"]
In [2]: file1 = pd.read_csv("file1", index_col=0, names=names)
In [3]: file2 = pd.read_csv("file2", index_col=0, names=names)
In [3]: file1
Out[3]:
ratio_of_player_A ratio_of_player_B ratio_of_player_C
generation
0 0.33 0.33 0.33
1 0.40 0.40 0.20
2 0.50 0.40 0.10
In [4]: file2
Out[4]:
ratio_of_player_A ratio_of_player_B ratio_of_player_C
generation
0 0.99 1.0 0.02
1 0.10 0.9 0.90
2 0.30 0.1 0.30
In [5]: merged_file = file1.merge(file2, right_index=True, left_index=True, suffixes=["_1","_2"])
In [6]: merged_file.filter(regex="ratio_of_player_A_*").mean(axis=1)
Out[6]
generation
0 0.66
1 0.25
2 0.40
dtype: float64
Or this way (a bit faster I guess) :
merged_file.ix[:,::3].mean(axis=1) # player A
You can merge recursively before applying the mean() method if you have more than one file.
If I misunderstood the question, please show us what you expect from file1 and file2.
Ask if there is something you don't understand.
Hope this helps !
The following should work:
from numpy import genfromtxt
files = ["file1", "file2", ...]
data = genfromtxt(files[0], delimiter=',')
for f in files[1:]:
data += genfromtxt(f, delimiter=',')
data /= len(files)
You can load each of the 1000 experiments in a dataframe, sum them all then calculate the mean.
filepath = tkinter.filedialog.askopenfilenames(filetypes=[('CSV','*.csv')]) #select your files
for file in filepath:
df = pd.read_csv(file, sep=';', decimal=',')
dfs.append(df)
temp = dfs[0] #creates a temporary variable to store the df
for i in range(1,len(dfs)): #starts from 1 cause 0 is stored in temp
temp = temp + dfs[i];
result = temp/len(dfs)
your problem is not very clear..
if i understand it right..
>temp
for i in `ls *csv`
more "$i">>temp;
then you have all the data from different files in one big file. try to load in sqlite database (1. Create a table 2.Insert the data)
after that you can query your data like.
select sum(columns)/count(columns) from yourtablehavingtempdata etc.
try to see sqlite since your data is tabular.sqlite will be better suited in my opinion.

Categories