Winsorizing does not change the max value [duplicate] - python

Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).
I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.
However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.
Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.
#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())
print()
print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())

It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.
def winsorize_with_pandas(s, limits):
"""
s : pd.Series
Series to winsorize
limits : tuple of float
Tuple of the percentages to cut on each side of the array,
with respect to the number of unmasked data, as floats between 0. and 1
"""
return s.clip(lower=s.quantile(limits[0], interpolation='lower'),
upper=s.quantile(1-limits[1], interpolation='higher'))
winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0 3.0
1 3.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 18.0
19 18.0
Name: Age, dtype: float64
winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 7.0
7 8.0
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 19.0
Name: Age, dtype: float64

You can consider filling the missing values with the mean in the column, then winsorize and select only the original non nan
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0
or the other option by removing the nan before the winsorize.
df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0

I used the following code snipped as the basis for my problem (Whereas I needed to winsorize on a yearly basis, so i introduced two categories (A,B) in my toy data)
I got the same issue with not replacing the max p99 values because of the NaNs.
import pandas as pd
import numpy as np
# Getting the toy data
# To see all columns and 100 rows
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
df = pd.DataFrame({"Zahl":np.arange(100),"Group":[i for i in "A"*50+"B"*50]})
# Getting NaN Values for first 4 rows
df.loc[0:3,"Zahl"] = np.NaN
# Defining a grouped list of 99/1% percentile values
p99 = df.groupby("Group")["Zahl"].quantile(.9).rename("99%-Quantile")
p1 = df.groupby("Group")["Zahl"].quantile(.1).rename("1%-Quantile")
# Defining the winsorize function
def winsor(value,p99,p1):
if (value < p99) & (value > p1):
return value
elif (value > p99) & (value > p1):
return p99
elif (value < p99) & (value < p1):
return p1
else:
return value
df["New"] = df.apply(lambda row: winsor(row["Zahl"],p99[row["Group"]],p1[row["Group"]]),axis=1)
The good thing of the winsor-function is that it naturally ignores NaN Values!
Hope this Idea helps for your problem

Related

Winsorizing on column with NaN does not change the max value

Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).
I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.
However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.
Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.
#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())
print()
print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())
It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.
def winsorize_with_pandas(s, limits):
"""
s : pd.Series
Series to winsorize
limits : tuple of float
Tuple of the percentages to cut on each side of the array,
with respect to the number of unmasked data, as floats between 0. and 1
"""
return s.clip(lower=s.quantile(limits[0], interpolation='lower'),
upper=s.quantile(1-limits[1], interpolation='higher'))
winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0 3.0
1 3.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 18.0
19 18.0
Name: Age, dtype: float64
winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 7.0
7 8.0
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 19.0
Name: Age, dtype: float64
You can consider filling the missing values with the mean in the column, then winsorize and select only the original non nan
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0
or the other option by removing the nan before the winsorize.
df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0
I used the following code snipped as the basis for my problem (Whereas I needed to winsorize on a yearly basis, so i introduced two categories (A,B) in my toy data)
I got the same issue with not replacing the max p99 values because of the NaNs.
import pandas as pd
import numpy as np
# Getting the toy data
# To see all columns and 100 rows
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
df = pd.DataFrame({"Zahl":np.arange(100),"Group":[i for i in "A"*50+"B"*50]})
# Getting NaN Values for first 4 rows
df.loc[0:3,"Zahl"] = np.NaN
# Defining a grouped list of 99/1% percentile values
p99 = df.groupby("Group")["Zahl"].quantile(.9).rename("99%-Quantile")
p1 = df.groupby("Group")["Zahl"].quantile(.1).rename("1%-Quantile")
# Defining the winsorize function
def winsor(value,p99,p1):
if (value < p99) & (value > p1):
return value
elif (value > p99) & (value > p1):
return p99
elif (value < p99) & (value < p1):
return p1
else:
return value
df["New"] = df.apply(lambda row: winsor(row["Zahl"],p99[row["Group"]],p1[row["Group"]]),axis=1)
The good thing of the winsor-function is that it naturally ignores NaN Values!
Hope this Idea helps for your problem

How to convert string representation of tensors into numpy array in pandas

I have a pandas df containing prediction probability distributions from a torch model. It looks like this:
Framecount Expression Probability
0 0.0 8.0 tensor([6.9263e-06, 6.6337e-10, 8.2442e-03, 4....
11 10.0 8.0 tensor([6.4393e-05, 4.4693e-07, 8.2253e-02, 1....
22 20.0 9.0 tensor([7.5355e-05, 2.4437e-07, 9.7638e-02, 1....
33 30.0 3.0 tensor([4.9751e-05, 1.1386e-06, 4.7163e-03, 7....
44 40.0 9.0 tensor([1.3237e-05, 1.3779e-07, 2.8534e-03, 1....
When I run type(df.Probability.tolist()[0]), I get str. Why is this?
How can I convert this column to contain arrays of floats so that I can do numerical operations on them?
edit
When I create the df, I essentially do the follow:
d = []
framecount = 0
for x in data:
Expression = model.predict(x)[0] # 8.0
Probability = model.predict(x)[1] # tensor([6.9263e-06, 6.6337e-10,
...
d.append([framecount, Expression, Probability])
framecount += 10
df = pd.DataFrame(d)

Python pandas show repeated values

I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!

convert multiple columns to datetime without the date in pandas

I have a dataframe with 3 columns, one for hour, one for minute, and one for second, like this:
df = pd.DataFrame({'hour': [9.0, 9.0, 9.0, 10.0],
'min': [12.0, 13.0, 55.0, 2.0],
'sec': [42.0, 30.0, 12.0, 5.0]})
>>> df
hour min sec
0 9.0 12.0 42.0
1 9.0 13.0 30.0
2 9.0 55.0 12.0
3 10.0 2.0 5.0
I'm trying to combine the three columns into a new column made up of a datetime series. The goal would be to have this dataframe:
hour min sec time
0 9.0 12.0 42.0 9:12:42
1 9.0 13.0 30.0 9:13:30
2 9.0 55.0 12.0 9:55:12
3 10.0 2.0 5.0 10:02:05
So far I'm trying to use pd.to_datetime, as such:
df['time'] = pd.to_datetime(df[['hour', 'min', 'sec']],
format = '%H:%M:S')
But I'm getting the following ValueError:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing.
I was trying to avoid this by including the format argument with only hour minute second, but apparently that doesn't work.
A similar question was asked here, but the solutions proposed do not seem to work in this case, I'm still getting this ValueError
Any ideas to solve this would be appreciated!
Thanks!
[EDIT]: I also needed the method to be able to deal with NaNs, so a dataframe such as this:
df = pd.DataFrame({'hour': [9.0, 9.0, 9.0, 10.0, np.nan],
'min': [12.0, 13.0, 55.0, 2.0, np.nan],
'sec': [42.0, 30.0, 12.0, 5.0, np.nan]})
The solution proposed by #PiRSquared works
Not sure if there is a more direct way but this works
df['time'] = pd.to_datetime(df['hour'].astype(int).astype(str)+':'+df['min'].astype(int).astype(str)+':'+df['sec'].astype(int).astype(str), format = '%H:%M:%S').dt.time
hour min sec time
0 9.0 12.0 42.0 09:12:42
1 9.0 13.0 30.0 09:13:30
2 9.0 55.0 12.0 09:55:12
3 10.0 2.0 5.0 10:02:05
We can use pd.to_datetime on a dataframe with the requisite column names to create a series of datetimes.
However, OPs initial dataframe has a 'min' column that needs to be renamed 'minute' and a 'sec' column that needs to be renamed 'second'.
In addition, I'll add the missing columns 'year', 'month', and 'day' using pd.DataFrame.assign.
Finally, I'll add the 'time' column with pd.DataFrame.assign again.
new = dict(year=2017, month=1, day=1)
rnm = dict(min='minute', sec='second')
df.assign(
time=pd.to_datetime(
df.rename(columns=rnm).assign(**new)
).dt.time
)
hour min sec time
0 9.0 12.0 42.0 09:12:42
1 9.0 13.0 30.0 09:13:30
2 9.0 55.0 12.0 09:55:12
3 10.0 2.0 5.0 10:02:05

pandas read_csv skiprows not working

I am trying to skip some rows that have incorrect values in them.
Here is the data when i read it in from a file without using the skiprows argument.
>> df
MstrRecNbrTxt UnitIDNmb PersonIDNmb PersonTypeCde
2194593 P NaN NaN NaN
2194594 300146901 1.0 1.0 1.0
4100689 DAT NaN NaN NaN
4100690 300170330 1.0 1.0 1.0
5732515 DA NaN NaN NaN
5732516 300174170 2.0 1.0 1.0
I want to skip rows 2194593, 4100689, and 5732515. I would expect to not see those rows in the table that I have read in.
>> df = pd.read_csv(file,sep='|',low_memory=False,
usecols= cols_to_use,
skiprows=[2194593,4100689,5732515])
Yet when I print it again, those rows are still there.
>> df
MstrRecNbrTxt UnitIDNmb PersonIDNmb PersonTypeCde
2194593 P NaN NaN NaN
2194594 300146901 1.0 1.0 1.0
4100689 DAT NaN NaN NaN
4100690 300170330 1.0 1.0 1.0
5732515 DA NaN NaN NaN
5732516 300174170 2.0 1.0 1.0
Here is the data:
{'PersonIDNmb': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 1.0},
'PersonTypeCde': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 1.0},
'UnitIDNmb': {2194593: nan,
2194594: 1.0,
4100689: nan,
4100690: 1.0,
5732515: nan,
5732516: 2.0},
'\ufeffMstrRecNbrTxt': {2194593: 'P',
2194594: '300146901',
4100689: 'DAT',
4100690: '300170330',
5732515: 'DA',
5732516: '300174170'}}
What am I doing wrong?
My end goal is to get rid of the NaN values in my dataframe so that the data can be read in as integers and not as floats (because it makes it difficult to join this table to other non-float tables).
Working example... hope this helps!
from io import StringIO
import pandas as pd
import numpy as np
txt = """index,col1,col2
0,a,b
1,c,d
2,e,f
3,g,h
4,i,j
5,k,l
6,m,n
7,o,p
8,q,r
9,s,t
10,u,v
11,w,x
12,y,z"""
indices_to_skip = np.array([2, 6, 11])
# I offset `indices_to_skip` by one in order to account for header
df = pd.read_csv(StringIO(txt), index_col=0, skiprows=indices_to_skip + 1)
print(df)
col1 col2
index
0 a b
1 c d
3 g h
4 i j
5 k l
7 o p
8 q r
9 s t
10 u v
12 y z

Categories