Pandas DataFrame apply function doubling size of DataFrame - python

I have a Pandas DataFrame with numeric data. For each non-binary column, I want to identify the values larger than its 99th percentile and create a boolean mask that I will later use to remove the rows with outliers.
I am trying to create this boolean mask using the apply method, where df is a DataFrame with numeric data of size a*b, as follows.
def make_mask(s):
if s.unique().shape[0] == 2: # If binary, return all-false mask
return pd.Series(np.zeros(s.shape[0]), dtype=bool)
else: # Otherwise, identify outliers
return s >= np.percentile(s, 99)
s_bool = df.apply(make_mask, axis=1)
Unfortunately, s_bool is output as a DataFrame with twice as many columns (i.e., size a*(b*2)). The first b columns are named 1, 2, 3, etc. and are full of null values. The second b columns seem to be the intended mask.
Why is the apply method doubling the size of the DataFrame? Unfortunately, the Pandas apply documentation does not offer helpful clues.

I am not clear on why, but it seems the problem is that you are returning a series. This seems to work in your given example:
def make_mask(s):
if s.unique().shape[0] == 2: # If binary, return all-false mask
return np.zeros(s.shape[0], dtype=bool)
else: # Otherwise, identify outliers
return s >= np.percentile(s, 99)
You can further simplify the code like so, and use raw=True:
def make_mask(s):
if np.unique(s).size == 2: # If binary, return all-false mask
return np.zeros_like(s, dtype=bool)
else: # Otherwise, identify outliers
return s >= np.percentile(s, 99)

Related

Checking multiple condition with numpy array

I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.
Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)
Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...
It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe

Multiple check for empty dataframe

I have a situation where I need to move the dataframe forward in code only if it is not empty. Illustrated below:
----- Filter 1 -------
Check if df.empty then return emptydf
else
----- Filter 2 ------
Check if df.empty then return emptydf
else
----- Filter 3 ------
return df
The code for the above is written as below(Just a part of code):
def filter_df(df):
df = df[df.somecolumn > 2].copy()
if df.empty:
return df
df = df[df.someother == 2].copy()
if df.empty:
return df
df = df[df.all <= 10].copy()
return df
If I have many such filters which expect dataframe not to be empty, I need to check empty after each filter. Is there any better way of checking dataframe empty rather than checking at each level.
Repeatedly subsetting your dataframe is expensive. Repeatedly copying your dataframe may also be expensive. It's also expensive to pre-calculate a large number of Boolean masks. The tricky part is finding a way to apply the masks lazily in a for loop.
While the below functional solution may seem ugly, it does address the above concerns. The idea is to combine a Boolean mask iteratively with an aggregate mask. Check in your loop whether your mask has all False values, not whether a dataframe is empty. Apply the aggregate mask once at the end of your logic:
from operator import methodcaller
def filter_df(df):
masks = [('somecolumn', 'gt', 2),
('someother', 'eq', 2),
('all', 'le', 10)]
agg_mask = np.ones(len(df.index)).astype(bool) # "all True" mask
for col, op, val in masks:
mask = methodcaller(op, val)(df[col])
agg_mask = agg_mask & mask
if not agg_mask.any():
return df[agg_mask]
return df[agg_mask]
Note for this solution series comparison operators such as >, ==, <= have functional equivalents pd.Series.gt, pd.Series.eq, pd.Series.le.
you can use function and call that after very filter
def check_empty(df):
if df.empty:
return df
df = df[df.somecolumn > 2].copy()
check_empty(df)
df = df[df.someother == 2].copy()
check_empty(df)
df = df[df.all <= 10].copy()
return df

Filling each row of one column of a DataFrame with different values (a random distribution)

I have a DataFrame with aprox. 4 columns and 200 rows. I created a 5th column with null values:
df['minutes'] = np.nan
Then, I want to fill each row of this new column with random inverse log normal values. The code to generate 1 inverse log normal:
note: if the code bellow is ran multiple times it will generate a new result because of the value inside ppf() : random.random()
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
What's happening when I do that is that it's filling all 200 rows of df['minutes'] with the same number, instead of triggering the random.random() for each row as I expected it to.
What do I have to do? I tried using for loopbut apparently I'm not getting it right (giving the same results):
for i in range(1,len(df)):
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
what am I doing wrong?
Also, I'll add that later I'll need to change some parameters of the inverse log normal above if the value of another column is 0 or 1. as in:
if df['type'] == 0:
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
elif df['type'] == 1:
df['minutes'] = df['minutes'].fillna(stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int))
thanks in advance.
The problem with your use of fillna here is that this function takes a value as argument and applies it to every element along the specified axis. So your stat value is calculated once and then distributed into every row.
What you need is your function called for every element on the axis, so your argument must be the function itself and not a value. That's a job for apply, which takes a function and applies it on elements along an axis.
I'm straight jumping to your final requirements:
You could use apply just on the minutes-column (as a pandas.Series method) with a lambda-function and then assign the respective results to the type-column filtered rows of column minutes:
import numpy as np
import pandas as pd
import scipy.stats as stats
import random
# setup
df = pd.DataFrame(np.random.randint(0, 2, size=(8, 4)),
columns=list('ABC') + ['type'])
df['minutes'] = np.nan
df.loc[df.type == 0, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int),
convert_dtype=False))
df.loc[df.type == 1, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int),
convert_dtype=False))
... or you use apply as a DataFrame method with a function wrapping your logic to distinguish between values of type-column and assign the result back to the minutes-column:
def calc_minutes(row):
if row['type'] == 0:
return stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int)
elif row['type'] == 1:
return stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int)
df['minutes'] = df.apply(calc_minutes, axis=1)
Managed to do it with some steps with a different mindset:
Created 2 lists, each with i's own parameters
Used NumPy's append
so that for each row a different random number
lognormal_tone = []
lognormal_ttwo = []
for i in range(len(s)):
lognormal_tone.append(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
lognormal_ttwo.append(stats.lognorm(0.4, scale=np.exp(2.7)).ppf(random.random()).astype(int))
Then, included them in the DataFrame with another previously created list:
df = pd.DataFrame({'arrival':arrival,'minTypeOne':lognormal_tone, 'minTypeTwo':lognormal_two})

Replace values in pandas DataFrame

I have the following python code:
consumos=df.iloc[:,0]
df['media_movel'] = rolling_median(consumos, window=30, center=True).fillna(method='bfill').fillna(method='ffill')
desv_padrao=df.stack().std()
threshold = 1000
difference = np.abs(consumos - df['media_movel'])
corr=np.abs(df['media_movel']-desv_padrao)
df['corr']=pd.DataFrame(corr)
outlier = difference > threshold
df.mask(outlier, df['corr'], axis=1)
So, I have a dataframe containing a time series and my aim is to correct the outliers (by admiting that the difference between the reference data and the rolling median has to be greater than 1000, which is the threshold).
For that, I've created the boolean variable outlier(which is True when there is an outlier based on the previous explanation) and I am trying to replace those outliers with: (rolling mediam column - standard deviation) into a mask, but the result is the time series with NaNs. I don't know why those NaNs appear, but I need to obtain the correct data.
I think the replacement of the masked values may be failing due to a shape mismatch. Try replacing your last line with this:
df.mask(outlier, df['corr'].values.reshape(-1, 1), axis=1)
If that fails, try this:
df.iloc[:,0].mask(outlier, df['corr'].values.reshape(-1, 1), axis=1)

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()

Categories