Transforming outliers in Pandas DataFrame using .apply, .applymap, .groupby - python

I'm attempting to transform a pandas DataFrame object into a new object that contains a classification of the points based upon some simple thresholds:
Value transformed to 0 if the point is NaN
Value transformed to 1 if the point is negative or 0
Value transformed to 2 if it falls outside certain criteria based on the entire column
Value is 3 otherwise
Here is a very simple self-contained example:
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})
print(df)
The transformation process created so far:
#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers'
outliers = pd.DataFrame()
for datapoint in df.columns:
tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())])
outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer')
outliers[outliers.isnull() == False] = 2
#Classify everything else as "3"
df[df > 0] = 3
#Classify negative and zero points as a "1"
df[df <= 0] = 1
#Update with the outliers
df.update(outliers)
#Everything else is a "0"
df.fillna(value=0, inplace=True)
Resulting in:
I have tried to use .applymap() and/or .groupby() in order to speed up the process with no luck. I found some guidance in this answer however, I'm still unsure how .groupby() is useful when you're not grouping within a pandas column.

Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer.
>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )
a b
0 NaN 2
1 2 3
2 3 -4
3 4 5
4 5 6
5 0 2
6 -7 7
7 9 9
8 10 NaN
You could also do it with apply, but it will be slower than the np.where approach (but approximately the same speed as what you are currently doing), though much simpler. That's probably a good example of why you should always avoid apply if possible, when you care about speed.
>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2
You could also do this, which is faster than apply but slower than np.where:
>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2
Of course, these things don't always scale linearly, so test them on your real data and see how that compares.

Related

How do I replace a string-value in a specific column using method chaining?

I have a pandas data frame, where some string values are "NA". I want to replace these values in a specific column (i.e. the 'strCol' in the example below) using method chaining.
How do I do this? (I googled quite a bit without success even though this should be easy?! ...)
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],
'B':['val1','val2','NA','val3']})
df = (
df
.rename(columns={'A':'intCol', 'B':'strCol'}) # method chain example operation 1
.astype({'intCol':float}) # method chain example operation 2
# .where(df['strCol']=='NA', pd.NA) # how to replace the sting 'NA' here? this does not work ...
)
df
You can try replace instead of where:
df.replace({'strCol':{'NA':pd.NA}})
Use lambda in where clause to evaluate the chained dataframe:
df = (df.rename(columns={'A':'intCol', 'B':'strCol'})
.astype({'intCol':float})
.where(lambda x: x['strCol']=='NA', pd.NA))
Output:
>>> df
intCol strCol
0 NaN <NA>
1 NaN <NA>
2 3.0 NA
3 NaN <NA>
Many methods like where, mask, groupby, apply can take a callable or a function so you can pass a lambda function.
pandas.DataFrame.where does
Replace values where the condition is False.
So you need condition to not hold where you want to make replacement, simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9]})
df2 = df.where(df.x%2==0,-1)
print(df2)
gives output
x
0 -1
1 2
2 -1
3 4
4 -1
5 6
6 -1
7 8
8 -1
Observe that odd values were replaced by -1s, whilst condition does hold for even values.

Vector arithmetic by conditional selection from multiple columns in a dataframe

I'm trying to do arithmetic among different cells in my dataframe and can't figure out how to operate on each of my groups. I'm trying to find the difference in energy_use between a baseline building (in this example upgrade_name == b is the baseline case) and each upgrade, for each building. I have an arbitrary number of building_id's and arbitrary number of upgrade_names.
I can do this successfully for a single building_id. Now I need to expand this out to a full dataset and am stuck. I will have 10's of thousands of buildings and dozens of upgrades for each building.
The answer to this question Iterating within groups in Pandas may be related, but I'm not sure how to apply it to my problem.
I have a dataframe like this:
df = pd.DataFrame({'building_id': [1,2,1,2,1], 'upgrade_name': ['a', 'a', 'b', 'b', 'c'], 'energy_use': [100.4, 150.8, 145.1, 136.7, 120.3]})
In [4]: df
Out[4]:
building_id upgrade_name energy_use
0 1 a 100.4
1 2 a 150.8
2 1 b 145.1
3 2 b 136.7
4 1 c 120.3
For a single building_id I have the following code:
upgrades = df.loc[df.building_id == 1, ['upgrade_name', 'energy_use']]
starting_point = upgrades.loc[upgrades.upgrade_name == 'b', 'energy_use']
upgrades['diff'] = upgrades.energy_use - starting_point.values[0]
In [8]: upgrades
Out[8]:
upgrade_name energy_use diff
0 a 100.4 -44.7
2 b 145.1 0.0
4 c 120.3 -24.8
How do I write this for arbitrary numbers of building_id's, instead of my hard-coded building_id == 1?
The ideal solution looks like this (doesn't matter if the baseline differences are 0 or NaN):
In [17]: df
Out[17]:
building_id upgrade_name energy_use ideal
0 1 a 100.4 -44.7
1 2 a 150.8 14.1
2 1 b 145.1 0.0
3 2 b 136.7 0.0
4 1 c 120.3 -24.8
Define the function counting the difference in energy usage (for
a group of rows for the current building) as follows:
def euDiff(grp):
euBase = grp[grp.upgrade_name == 'b'].energy_use.values[0]
return grp.energy_use - euBase
Then compute the difference (for all buildings), applying it to each group:
df['ideal'] = df.groupby('building_id').apply(euDiff)\
.reset_index(level=0, drop=True)
The result is just as you expected.
thanks for sharing that example data! Made things a lot easier.
I suggest solving this in two parts:
1. Make a dictionary from your dataframe that contains that baseline energy use for each building
2. Apply a lambda function to your dataframe to subtract each energy use value from the baseline value associated with that building.
# set index to building_id, turn into dictionary, filter out energy use
building_baseline = df[df['upgrade_name'] == 'b'].set_index('building_id').to_dict()['energy_use']
# apply lambda to dataframe, use axis=1 to access rows
df['diff'] = df.apply(lambda row: row['energy_use'] - building_baseline[row['building_id']])
You could also write a function to do this. You also don't necessarily need the dictionary, it just makes things easier. If you're curious about these alternative solutions let me know and I can add them for you.

Python PANDAS: Groupby Transform Sum Unique

I have a situation where I am creating a pivot table in PANDAS where it makes more sense to calculate the fields separately and just use .pivot_table() for the pivot step. However, I am running into some difficultly trying to calculate the denominator for my percentages. Essentially, due to the data format I appear to need to do something like "groupby transform unique sum" on the second line below (which is where I am stuck):
df['numerator'] = df.groupby(['category1','category2'])['customer_id'].transform('nunique')
df['denominator'] = df.groupby(['category2'])['numerator'].nunique().transform('sum')
df['percentage'] = (df['numerator'] / df['denominator'])
df_pivot = df.pivot_table(index='category1',
columns=['category2'],
values=['numerator','percentage']) \
swaplevel(0,1,axis=1)
df_pivot.loc['total', :] = df_pivot.sum().values
My apologies for not being able to provide any dummy data, but I would appreciate any tips if I have hopefully provided enough detail to reason about.
I believe need lambda function with unique and sum:
df = pd.DataFrame({'numerator':[3,1,1,9,2,2],
'category2':list('aaabbb')})
#print (df)
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: x.unique().sum())
Alternative solution with sets and sums:
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: sum(set(x)))
print (df)
category2 numerator denominator
0 a 3 4
1 a 1 4
2 a 1 4
3 b 9 11
4 b 2 11
5 b 2 11

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

Normalize data in pandas

Suppose I have a pandas data frame df:
I want to calculate the column wise mean of a data frame.
This is easy:
df.apply(average)
then the column wise range max(col) - min(col). This is easy again:
df.apply(max) - df.apply(min)
Now for each element I want to subtract its column's mean and divide by its column's range. I am not sure how to do that
Any help/pointers are much appreciated.
In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124
In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())
In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611
In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17
In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1
If you don't mind importing the sklearn library, I would recommend the method talked on this blog.
import pandas as pd
from sklearn import preprocessing
data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized
You can use apply for this, and it's a bit neater:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)
0 1 2 3
0 9.497381 0.552974 0.887313 -1.291874
1 6.461631 -6.206155 9.979247 -0.044828
2 4.276156 2.002518 8.848432 -5.240563
3 1.710331 1.463783 7.535078 -1.399565
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.515087 0.133967 -0.651699 0.135175
1 0.125241 -0.689446 0.348301 0.375188
2 -0.155414 0.310554 0.223925 -0.624812
3 -0.484913 0.244924 0.079473 0.114448
Also, it works nicely with groupby, if you select the relevant columns:
df['grp'] = ['A', 'A', 'B', 'B']
0 1 2 3 grp
0 9.497381 0.552974 0.887313 -1.291874 A
1 6.461631 -6.206155 9.979247 -0.044828 A
2 4.276156 2.002518 8.848432 -5.240563 B
3 1.710331 1.463783 7.535078 -1.399565 B
df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.5 0.5 -0.5 -0.5
1 -0.5 -0.5 0.5 0.5
2 0.5 0.5 0.5 -0.5
3 -0.5 -0.5 -0.5 0.5
Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)
I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):
def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):
if low=='min':
low=min(s)
elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
if hi=='max':
hi=max(s)
elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))
if center=='mid':
center=(max(s)+min(s))/2
elif center=='avg':
center=mean(s)
elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]
for x in s2:
if x<low:
r.append(0.)
elif x>hi:
r.append(1.)
else:
if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)
else:
r.append((x-low)/(center-low)*0.5+0.)
if insideout==True:
ir=[(1.-abs(z-0.5)*2.) for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]
return rr
This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:
#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]
It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:
#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]
So now "2" which is closest to the center, defined as "1" is the highest value.
Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.
This is how you do it column-wise:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

Categories