I have a situation where I need to move the dataframe forward in code only if it is not empty. Illustrated below:
----- Filter 1 -------
Check if df.empty then return emptydf
else
----- Filter 2 ------
Check if df.empty then return emptydf
else
----- Filter 3 ------
return df
The code for the above is written as below(Just a part of code):
def filter_df(df):
df = df[df.somecolumn > 2].copy()
if df.empty:
return df
df = df[df.someother == 2].copy()
if df.empty:
return df
df = df[df.all <= 10].copy()
return df
If I have many such filters which expect dataframe not to be empty, I need to check empty after each filter. Is there any better way of checking dataframe empty rather than checking at each level.
Repeatedly subsetting your dataframe is expensive. Repeatedly copying your dataframe may also be expensive. It's also expensive to pre-calculate a large number of Boolean masks. The tricky part is finding a way to apply the masks lazily in a for loop.
While the below functional solution may seem ugly, it does address the above concerns. The idea is to combine a Boolean mask iteratively with an aggregate mask. Check in your loop whether your mask has all False values, not whether a dataframe is empty. Apply the aggregate mask once at the end of your logic:
from operator import methodcaller
def filter_df(df):
masks = [('somecolumn', 'gt', 2),
('someother', 'eq', 2),
('all', 'le', 10)]
agg_mask = np.ones(len(df.index)).astype(bool) # "all True" mask
for col, op, val in masks:
mask = methodcaller(op, val)(df[col])
agg_mask = agg_mask & mask
if not agg_mask.any():
return df[agg_mask]
return df[agg_mask]
Note for this solution series comparison operators such as >, ==, <= have functional equivalents pd.Series.gt, pd.Series.eq, pd.Series.le.
you can use function and call that after very filter
def check_empty(df):
if df.empty:
return df
df = df[df.somecolumn > 2].copy()
check_empty(df)
df = df[df.someother == 2].copy()
check_empty(df)
df = df[df.all <= 10].copy()
return df
Related
I have a pandas DataFrame corr which collects correlations between 2k variables. Since I didn't create it, I would like to check whether it satisfies the usual consistency properties of a correlation matrix (simmetry, all numeric values in [-1,1],no missing values,...). How can I check such conditions efficiently, since my actual code involves two nested loops?
For the sake of completeness I generate a df below with an example of my actual checks.
import pandas as pd
df = pd.DataFrame(np.random.normal(size=(10000,2000)), columns = ["var"+str(i) for i in range(0,2000)])
corr = df.corr()
inconsistent_cells=[]
for row in corr.index:
for col in corr.columns:
value = corr.loc[row,col]
if not isinstance(value,float) or (value<-1 or value>1):
inconsistent_cells.append((value, (row,col)))
I think one possible solution would be to use itertuples(), but then I would loose info about cell coordinates. The same is true for apply().
Any suggestion is appreciate, thanks.
Write a custom function to check:
def check_df(df):
#symmetry
if not df.eq(df.T).all().all():
return False
#between -1 and 1
if not df.apply(lambda x: x.between(-1,1,inclusive="both").all()).all():
return False
#null values
if df.isnull().any().any():
return False
return True
Below is my pandas snippet. it works. Given a df, I wish to know if there exist any row that satisfy c1> 10 and C2 and C3 are True. Below code works. I wsh to know if there is any better way to do the same.
import pandas as pd
inp = [{'c1':10, 'c2':True, 'c3': False}, {'c1':9, 'c2':True, 'c3': True}, {'c1':11, 'c2':True, 'c3': True}]
df = pd.DataFrame(inp)
def check(df):
for index, row in df.iterrows():
if ((row['c1']>10) & (row['c2']==True)& (row['c3']==True)):
return True
else:
continue
t = check(df)
When using pandas you rarely need to iterate over rows and apply the operations per each row separately. In many cases if you apply the same operation to the whole dataframe or column you get the same or similar result and faster a more readable code. In your case:
(df['c1'] > 10) & df['c2'] & df['c3']
# will lead to a Series:
# 0 False
# 1 False
# 2 True
# dtype: bool
(note that I am calling the operation on the whole df rather than single row
which signifies for which rows the condition holds. If you need to know just if any row satisfies the condition, you can all any:
((df['c1'] > 10) & df['c2'] & df['c3']).any()
# True
So your whole check function would be:
def check(df):
return ((df['c1'] > 10) & df['c2'] & df['c3']).any()
It is not clear what you want to change or improve about your solution, but you can achieve the same without a separate function and loops as well -
df[(df['c1'] > 10) & (df['c2']) & (df['c3'])].index.size > 0
The condition in question is (df.c1 > 10) & df.c2 & df.c3
You can either check if there are any rows in the dataframe df that satisfies this condition.
>>> print(((df.c1 > 10) & df.c2 & df.c3).any())
True
or , you can check for the length of the dataframe returned from the original dataframe - for this condition (which will be df[(condition)]
>>> print(len(df[((df.c1>10) & df.c2 & df.c2)]) > 0)
True
I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
I have two large dataframes I want to compare. I want a comparison result capable of a column and / or row wise comparison of similarities by percent. This part is simple. However, I want to be able to make the comparison ignore differences based upon value criteria. A small example is below.
d1 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['AA','--','BB']),
'Col2':pd.Series(['AB','AA','BB'])}
d2 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['BB','AB','--']),
'Col2':pd.Series(['AB','AA','AB'])}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1 = df1.set_index('Sample')
df2 = df2.set_index('Sample')
comparison = df1.eq(df2)
# for column stats
comparison.sum(axis=0) / float(len(df1.index))
# for row stats
comparison.sum(axis=1) / float(len(df1.columns))
My problem is that for when value1='AA' and value2 = '--' I want them to be viewed as equal (so when one is '--' basically always be true) but, otherwise perform a normal Boolean comparison. I need an efficient way to do this that doesn't include excessive looping as the datasets are quite large.
Below, I'm interpreting "when one is '--' basically always be true" to mean that any comparison against '--' (no matter what the other value is) should return True. In that case, you could use
mask = (df1=='--') | (df2=='--')
to find every location where either df1 or df2 is equal to '--' and then use
comparison |= mask
to update comparison. For example,
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 10000
df1, df2 = [pd.DataFrame(
np.random.choice(map(''.join, IT.product(list('ABC'), repeat=2))+['--'],
size=(N, 2)),
columns=['Col1', 'Col2']) for i in range(2)]
comparison = df1.eq(df2)
mask = (df1=='--') | (df2=='--')
comparison |= mask
# for column stats
column_stats = comparison.sum(axis=0) / float(len(df1.index))
# for row stats
row_stats = comparison.sum(axis=1) / float(len(df1.columns))
I think loop comprehension should be quite fast:
new_columns = []
for col in df1.columns:
new_columns.append([True if (x==y or x=='--' or y=='--') else False for x,y in zip(df1[col],df2[col])])
results = pd.DataFrame(new_columns).T
results.index = df1.index
This outputs the full true/false df.
I have a Pandas DataFrame with numeric data. For each non-binary column, I want to identify the values larger than its 99th percentile and create a boolean mask that I will later use to remove the rows with outliers.
I am trying to create this boolean mask using the apply method, where df is a DataFrame with numeric data of size a*b, as follows.
def make_mask(s):
if s.unique().shape[0] == 2: # If binary, return all-false mask
return pd.Series(np.zeros(s.shape[0]), dtype=bool)
else: # Otherwise, identify outliers
return s >= np.percentile(s, 99)
s_bool = df.apply(make_mask, axis=1)
Unfortunately, s_bool is output as a DataFrame with twice as many columns (i.e., size a*(b*2)). The first b columns are named 1, 2, 3, etc. and are full of null values. The second b columns seem to be the intended mask.
Why is the apply method doubling the size of the DataFrame? Unfortunately, the Pandas apply documentation does not offer helpful clues.
I am not clear on why, but it seems the problem is that you are returning a series. This seems to work in your given example:
def make_mask(s):
if s.unique().shape[0] == 2: # If binary, return all-false mask
return np.zeros(s.shape[0], dtype=bool)
else: # Otherwise, identify outliers
return s >= np.percentile(s, 99)
You can further simplify the code like so, and use raw=True:
def make_mask(s):
if np.unique(s).size == 2: # If binary, return all-false mask
return np.zeros_like(s, dtype=bool)
else: # Otherwise, identify outliers
return s >= np.percentile(s, 99)