Normalize data in pandas - python

Suppose I have a pandas data frame df:
I want to calculate the column wise mean of a data frame.
This is easy:
df.apply(average)
then the column wise range max(col) - min(col). This is easy again:
df.apply(max) - df.apply(min)
Now for each element I want to subtract its column's mean and divide by its column's range. I am not sure how to do that
Any help/pointers are much appreciated.

In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124
In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())
In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611
In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17
In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1

If you don't mind importing the sklearn library, I would recommend the method talked on this blog.
import pandas as pd
from sklearn import preprocessing
data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized

You can use apply for this, and it's a bit neater:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)
0 1 2 3
0 9.497381 0.552974 0.887313 -1.291874
1 6.461631 -6.206155 9.979247 -0.044828
2 4.276156 2.002518 8.848432 -5.240563
3 1.710331 1.463783 7.535078 -1.399565
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.515087 0.133967 -0.651699 0.135175
1 0.125241 -0.689446 0.348301 0.375188
2 -0.155414 0.310554 0.223925 -0.624812
3 -0.484913 0.244924 0.079473 0.114448
Also, it works nicely with groupby, if you select the relevant columns:
df['grp'] = ['A', 'A', 'B', 'B']
0 1 2 3 grp
0 9.497381 0.552974 0.887313 -1.291874 A
1 6.461631 -6.206155 9.979247 -0.044828 A
2 4.276156 2.002518 8.848432 -5.240563 B
3 1.710331 1.463783 7.535078 -1.399565 B
df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.5 0.5 -0.5 -0.5
1 -0.5 -0.5 0.5 0.5
2 0.5 0.5 0.5 -0.5
3 -0.5 -0.5 -0.5 0.5

Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)
I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):
def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):
if low=='min':
low=min(s)
elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
if hi=='max':
hi=max(s)
elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))
if center=='mid':
center=(max(s)+min(s))/2
elif center=='avg':
center=mean(s)
elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]
for x in s2:
if x<low:
r.append(0.)
elif x>hi:
r.append(1.)
else:
if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)
else:
r.append((x-low)/(center-low)*0.5+0.)
if insideout==True:
ir=[(1.-abs(z-0.5)*2.) for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]
return rr
This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:
#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]
It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:
#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]
So now "2" which is closest to the center, defined as "1" is the highest value.
Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.

This is how you do it column-wise:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

Related

Pandas: Calculating a Z-score to avoid "look ahead" bias

I have time series data in dataframe named "df", and, my code for calculating the z-score is given below:
mean = df.mean()
standard_dev = df.std()
z_score = (df - mean) / standard_dev
I would like to calculate the z-score for each observation using the respective observation and data that was known at the point of recording the observation. i.e. I do not want to use a standard deviation and mean that incorporates data that occurs after a specific point in time. I just want to use data from time t, t-1, t-2....
How do I do this?
Use .expanding() - col being the column you want to compute your statistics for (drop [col] in case, if you wish to compute it for the whole dataframe):
You might need to sort values by time column first - denoted as time_col (in case if it's not sorted already):
df=df.sort_values("time_col", axis=0)
Then:
df[col].sub(df[col].expanding().mean()).div(df[col].expanding().std())
Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html
For the sample data:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqrstuv"), "b": [6,5,7,1,-9,0,3,5,2,8]})
df["c"]=df["b"].sub(df["b"].expanding().mean()).div(df["b"].expanding().std())
Outputs:
a b c
0 x 6 NaN
1 y 5 -0.707107
2 z 7 1.000000
3 p 1 -1.425880
4 q -9 -1.677484
5 r 0 -0.281450
6 s 3 0.210502
7 t 5 0.534207
8 u 2 -0.046142
9 v 8 1.062430
You could assign two new columns, containing the mean and std of previous items. I here assume, that your time series data is in the column 'time_series_data':
len_ = len(df)
df['mean_past'] = [np.mean(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['std_past'] = [np.std(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['z_score'] = (df['time_series_data'] - df['mean_past']) / df['std_past']
Edit: if you want to z-score all columns, you could define a function, that computes the z-score and apply it on all columns of your dataframe:
def z_score_column(column):
len_ = len(column)
mean = [np.mean(column[0:lv+1]) for lv in range(0,len_)]
std = [np.std(column[0:lv+1]) for lv in range(0,len_)]
return [(c-m)/s for c,m,s in zip(column, mean, std)]
df = pd.DataFrame(np.random.rand(10,5))
df.apply(z_score_column)

Vector arithmetic by conditional selection from multiple columns in a dataframe

I'm trying to do arithmetic among different cells in my dataframe and can't figure out how to operate on each of my groups. I'm trying to find the difference in energy_use between a baseline building (in this example upgrade_name == b is the baseline case) and each upgrade, for each building. I have an arbitrary number of building_id's and arbitrary number of upgrade_names.
I can do this successfully for a single building_id. Now I need to expand this out to a full dataset and am stuck. I will have 10's of thousands of buildings and dozens of upgrades for each building.
The answer to this question Iterating within groups in Pandas may be related, but I'm not sure how to apply it to my problem.
I have a dataframe like this:
df = pd.DataFrame({'building_id': [1,2,1,2,1], 'upgrade_name': ['a', 'a', 'b', 'b', 'c'], 'energy_use': [100.4, 150.8, 145.1, 136.7, 120.3]})
In [4]: df
Out[4]:
building_id upgrade_name energy_use
0 1 a 100.4
1 2 a 150.8
2 1 b 145.1
3 2 b 136.7
4 1 c 120.3
For a single building_id I have the following code:
upgrades = df.loc[df.building_id == 1, ['upgrade_name', 'energy_use']]
starting_point = upgrades.loc[upgrades.upgrade_name == 'b', 'energy_use']
upgrades['diff'] = upgrades.energy_use - starting_point.values[0]
In [8]: upgrades
Out[8]:
upgrade_name energy_use diff
0 a 100.4 -44.7
2 b 145.1 0.0
4 c 120.3 -24.8
How do I write this for arbitrary numbers of building_id's, instead of my hard-coded building_id == 1?
The ideal solution looks like this (doesn't matter if the baseline differences are 0 or NaN):
In [17]: df
Out[17]:
building_id upgrade_name energy_use ideal
0 1 a 100.4 -44.7
1 2 a 150.8 14.1
2 1 b 145.1 0.0
3 2 b 136.7 0.0
4 1 c 120.3 -24.8
Define the function counting the difference in energy usage (for
a group of rows for the current building) as follows:
def euDiff(grp):
euBase = grp[grp.upgrade_name == 'b'].energy_use.values[0]
return grp.energy_use - euBase
Then compute the difference (for all buildings), applying it to each group:
df['ideal'] = df.groupby('building_id').apply(euDiff)\
.reset_index(level=0, drop=True)
The result is just as you expected.
thanks for sharing that example data! Made things a lot easier.
I suggest solving this in two parts:
1. Make a dictionary from your dataframe that contains that baseline energy use for each building
2. Apply a lambda function to your dataframe to subtract each energy use value from the baseline value associated with that building.
# set index to building_id, turn into dictionary, filter out energy use
building_baseline = df[df['upgrade_name'] == 'b'].set_index('building_id').to_dict()['energy_use']
# apply lambda to dataframe, use axis=1 to access rows
df['diff'] = df.apply(lambda row: row['energy_use'] - building_baseline[row['building_id']])
You could also write a function to do this. You also don't necessarily need the dictionary, it just makes things easier. If you're curious about these alternative solutions let me know and I can add them for you.

Conditionally replace values in pandas.DataFrame with previous value

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.
I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).
Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)
A simple example:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
A B
0 1 a
1 2 b
2 3 c
3 4 d
4 1000 e # '1000 e' --> '4 e'
5 6 f
6 7 g
7 8 h
You can simply mask values over your threshold and use ffill:
df.assign(A=df.A.mask(df.A.gt(10)).ffill())
A B
0 1.0 a
1 2.0 b
2 3.0 c
3 4.0 d
4 4.0 e
5 6.0 f
6 7.0 g
7 8.0 h
Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.
I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.
def df_replace_with_previous(df,col,maskfunc,inplace=False):
arr = np.array(df[col])
mask = maskfunc(arr)
arr[ mask ] = arr[ list(mask)[1:]+[False] ]
if inplace:
df[col] = arr
return
else:
df2 = df.copy()
df2[col] = arr
return df2
This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.
Usage in the case given in OP:
df_replace_with_previous(df,'A',lambda x:x>10,False)

How to filter very sparse features from a data set

I am try to preprocess a data set and I'd like to delete very sparse columns by setting a threshold such that columns with values less than that have entries fewer than the threshold will be removed.
The code below should get the job done but I do not understand how it works, kindly assist with an explanation or suggestions on how I can get this done. Thanks!
sparse_col_idx = ((x_sparse > 0).mean(0) > 0.05).A.ravel()
x_sparse has dim of (12060, 272776)
Lets break this down into steps. Assuming x_sparse is a DataFrame then x_sparse > 0 will return a DataFrame with the same exact dimensions, index and columns with each value as True or False based on the condition given (here is where the value > 0)
.mean(0)
This takes the mean of each column. Since False evaluates as 0 and True evaluates 1, mean() returns the percentage of the column that meet the criteria. You are down to a Series at this point, where the column names are the index and values are the percentage that meet the criteria.
> 0.05
This now changes the previous Series to a Series of booleans that match the column names meeting the criteria.
.A.ravel()
This doesn't look necessary. I will come up with a simple example below to show the steps.
Create a DataFrame with random normal values
np.random.seed(3)
x_sparse = pd.DataFrame(data=np.random.randn(100, 5), columns=list('abcde'))
print(x_sparse.head())
output:
a b c d e
0 1.788628 0.436510 0.096497 -1.863493 -0.277388
1 -0.354759 -0.082741 -0.627001 -0.043818 -0.477218
2 -1.313865 0.884622 0.881318 1.709573 0.050034
3 -0.404677 -0.545360 -1.546477 0.982367 -1.101068
4 -1.185047 -0.205650 1.486148 0.236716 -1.023785
# the argument 0 is unnecessary. The default is get average of columns
(x_sparse > 0).mean()
Output
a 0.48
b 0.52
c 0.44
d 0.55
e 0.45
# create a threshold
threshold = .5
(x_sparse > 0).mean() > threshold
Output
a False
b True
c False
d True
e False
Keep specific columns
threshold = .5
keep = (x_sparse > 0).mean() > threshold
x_sparse[x_sparse.columns[keep]]
Output
b d
0 0.436510 -1.863493
1 -0.082741 -0.043818
2 0.884622 1.709573
3 -0.545360 0.982367
4 -0.205650 0.236716

Transforming outliers in Pandas DataFrame using .apply, .applymap, .groupby

I'm attempting to transform a pandas DataFrame object into a new object that contains a classification of the points based upon some simple thresholds:
Value transformed to 0 if the point is NaN
Value transformed to 1 if the point is negative or 0
Value transformed to 2 if it falls outside certain criteria based on the entire column
Value is 3 otherwise
Here is a very simple self-contained example:
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})
print(df)
The transformation process created so far:
#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers'
outliers = pd.DataFrame()
for datapoint in df.columns:
tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())])
outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer')
outliers[outliers.isnull() == False] = 2
#Classify everything else as "3"
df[df > 0] = 3
#Classify negative and zero points as a "1"
df[df <= 0] = 1
#Update with the outliers
df.update(outliers)
#Everything else is a "0"
df.fillna(value=0, inplace=True)
Resulting in:
I have tried to use .applymap() and/or .groupby() in order to speed up the process with no luck. I found some guidance in this answer however, I'm still unsure how .groupby() is useful when you're not grouping within a pandas column.
Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer.
>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )
a b
0 NaN 2
1 2 3
2 3 -4
3 4 5
4 5 6
5 0 2
6 -7 7
7 9 9
8 10 NaN
You could also do it with apply, but it will be slower than the np.where approach (but approximately the same speed as what you are currently doing), though much simpler. That's probably a good example of why you should always avoid apply if possible, when you care about speed.
>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2
You could also do this, which is faster than apply but slower than np.where:
>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2
Of course, these things don't always scale linearly, so test them on your real data and see how that compares.

Categories