Make a calculation or function skip rows based on NaN values - python

I have two data frames that I am looking to apply two separate functions to that will perform validation checks on each data frame independently, and then any differences that arise will get concatenated into one transformed list.
The issue I am facing is that the first validation check should happen ONLY if numeric values exist in ALL of the numeric columns of whichever of the two data frames it is analyzing. If there are ANY NaN values in a row line for the first validation check, then that row should be skipped.
The second validation check does not need that specification.
Here are the data frames, functions, and transformations:
import pandas as pd
import numpy as np
df1 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.5,3,4,np.nan,3.5,1.5,2],'Amount':[40,19,np.nan,np.nan,60,70,80,np.nan,45,102],
'Quantity Frozen':[3,4,np.nan,15,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[37,12,np.nan,45,np.nan,61,np.nan,24,14,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,21,40]}
df1 = pd.DataFrame(df1, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
df2 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.6,3,4,np.nan,3.5,1.5,2],'Amount':[40,16,np.nan,np.nan,60,72,80,np.nan,45,100],
'Quantity Frozen':[3,4,np.nan,np.nan,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[np.nan,12,np.nan,45,np.nan,61,np.nan,24,15,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,20,40]}
df2 = pd.DataFrame(df2, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
#Validation Check 2:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Multiple'] = dataset['Price'] * dataset['Quantity Fresh']
for varname,var in {'Multiple vs. Price x Quantity Fresh':'dif_Multiple'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
# #Wrangling internal inconsistency data frames to be in correct format
inconsistency_vars = ['dif_Stock on Hand','dif_Multiple']
inconsistency_var_betternames = {'dif_Stock on Hand':'Stock on Hand = Quantity Fresh + Quantity Frozen','dif_Multiple':'Multiple = Price x Quantity on Hand'}
# #Rollup1
idvars1=['Fruits']
df1 = df1[idvars1 + inconsistency_vars]
df2 = df2[idvars1 + inconsistency_vars]
df1 = df1.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df2 = df2.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df1['dataset'] = 'Fruit Dataset1'
df2['dataset'] = 'Fruit Dataset2'
# #First table in Internal Inconsistencies Sheet (Table 5)
inconsistent = pd.concat([df1,df2])
inconsistent = inconsistent[['variable','Difference Magnitude','dataset','Fruits']]
inconsistent['variable'] = inconsistent['variable'].map(inconsistency_var_betternames)
inconsistent = inconsistent[inconsistent['Difference Magnitude'] != 0]
Here is the desired output, which for the first validation check skips rows in either data frame that have ANY NaN values in the numeric columns (every column but 'Fruits'):
#Desired output
inconsistent_true = {'variable': ["Stock on Hand = Quantity Fresh + Quantity Frozen","Stock on Hand = Quantity Fresh + Quantity Frozen","Multiple = Price x Quantity on Hand",
"Multiple = Price x Quantity on Hand","Multiple = Price x Quantity on Hand"],
'Difference Magnitude': [1,2,1,4.5,2.5],
'dataset':["Fruit Dataset1","Fruit Dataset1","Fruit Dataset2","Fruit Dataset2","Fruit Datset2"],
'Fruits':["Blueberry","Coconut","Blueberry","Cherry","Pear"]}
inconsistent_true = pd.DataFrame(inconsistent_true, columns = ['variable', 'Difference Magnitude','dataset','Fruits'])

A pandas function that may come in handy is pd.isnull() return True for np.nan value-
For example take df1-
pd.isnull(df1['Amount'][2])
True
This can be added as a check to all your numeric columns as such and then use only rows that have column 'numeric_check' value as 1-
df1['numeric_check'] = df1.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
refer the modified validation check 1 -
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
if '1' in name: # check to implement condition for only df1
# Adding the 'numeric_check' column to dataset df
dataset['numeric_check'] = dataset.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
# filter out Nan rows, they will not be considered for this check
dataset = dataset.loc[dataset['numeric_check']==1]
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')

I hope I got your intention.
# make boolean mask, True if all numeric values are not NaN
mask = df1.select_dtypes('number').notna().all(axis=1)
print(df1[mask])
Fruits Price Amount Quantity Frozen Quantity Fresh Multiple
0 Banana 2.0 40.0 3.0 37.0 74.0
1 Blueberry 1.5 19.0 4.0 12.0 17.0
5 Pineapple 4.0 70.0 9.0 61.0 244.0
9 Coconut 2.0 102.0 80.0 20.0 40.0

Related

Connecting lines over NaNs without dropping entire rows from the dataframe

I have my code set up as follows
df = t_sepsis_col_adder(filename)
if df['SepsisLabel'].sum() > 0:
cols = list(df.columns)
cols_to_remove = ['Age', 'Gender', 'Unit1', 'Unit2', 'T_Sepsis', 'SepsisLabel', 'HospAdmTime']
for col in cols:
if df[col].isnull().all():
cols_to_remove.append(col)
for col in cols_to_remove:
cols.remove(col)
col_l, col_r = cols[:len(cols) // 2], cols[len(cols) // 2:]
chart_l = alt.Chart(df).mark_line(point=True).encode(
alt.X(alt.repeat("column"), type='quantitative',
sort="descending"),
alt.Y(alt.repeat("row"), type='quantitative',
scale=alt.Scale(zero=False)),
order="T_Sepsis"
).properties(
width=600,
height=100
).repeat(
row=col_l,
column=['T_Sepsis']
)
chart_l.save("chart_l.png")
chart_r = alt.Chart(df).mark_line(point=True).encode(
alt.X(alt.repeat("column"), type='quantitative',
sort="descending"),
alt.Y(alt.repeat("row"), type='quantitative',
scale=alt.Scale(zero=False)),
order="T_Sepsis"
).properties(
width=600,
height=100
).repeat(
row=col_r,
column=['T_Sepsis']
)
chart_concat = alt.hconcat(chart_l, chart_r)
I essentially have a lot of features to plot so I decided to split the different plots into two columns. The issue is that the line plots don't actually connect to the points most of the time. I will attach a screenshot below. Any ideas on how to go about fixing this issue? By the way, the issue still persists if I stick to a single column of plots so I don't think repeat is causing the issue here. It's also worth noting that my data has a lot of NaN values (which I am choosing not to pre-process and take care of since I need to plot the raw data). Thanks!
Part of the main chart
Edit: In addition to the above code, here is the function t_sepsis_col_adder.
def t_sepsis_col_adder(filename: str) -> pd.DataFrame:
"""Adds a column that gives the time till t_sepsis at each time step.
Args:
filename: name of the file being processed
Returns:
Returns a dataframe with the new column
"""
df = df_instantiator_unaugmented(filename)
number_of_sepsis_hours = df['SepsisLabel'].sum()
if number_of_sepsis_hours > 0:
if number_of_sepsis_hours >= 7:
first_occurrence = df['SepsisLabel'].idxmax()
t_sepsis = first_occurrence + 6
else:
t_sepsis = len(df.index) - 1
temp_list = []
for i in range(len(df.index)):
temp_list.append(t_sepsis - i)
df['T_Sepsis'] = temp_list
else:
t_end_recording = len(df.index)
temp_list = []
for i in range(len(df.index)):
temp_list.append(t_end_recording - i - 1)
df['T_EndRecording'] = temp_list
return df
I can't confirm without a reproducible example, but this is likely due to there being NaNs in the data. These are not connected by Altair/VegaLite by default, whereas values that are missing would be connected. You can fix this by using dropna(subset=['column_name']) in pandas or .transform_filter('isValid(datum.column_name)') in Altair. Please see this answer for examples.
What I reference above also work for faceting, here is a reproducible example:
import pandas as pd
import altair as alt
import numpy as np
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06',
'2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],
'line': [8,np.nan,10,8, 4, 5,6,7] })
df
## out
date ID line
0 2020-04-03 a 8.0
1 2020-04-04 a NaN
2 2020-04-05 a 10.0
3 2020-04-06 a 8.0
4 2020-04-03 b 4.0
5 2020-04-04 b 5.0
6 2020-04-05 b 6.0
7 2020-04-06 b 7.0
transform_filter only exclude points from individual facets, not entire rows.
(alt.Chart(df).mark_line(point=True).encode(
alt.X('monthdate(date):O'), y='line:Q')
.transform_filter('isValid(datum.line)')
.facet('ID'))

Identifying Column Values and then Assigning New Values to Each

I have a pandas DataFrame with cities and a separate list with multipliers for each city. I want to update the TaxAmount in the first df with the corresponding multiplier for each city, from the list.
My current code functions and runs fine but it sets the multiplier to being the same for all cities instead of updating to a new multiplier. So basically all the city's tax rates are the same when they should be different. Any suggestions on how to get this to work?
import pandas as pd
df = pd.DataFrame({
'City': ['BELLEAIR BEACH', 'BELLEAIR BEACH', 'CLEARWATER', 'CLEARWATER'],
'TaxAnnualAmount': [5672, 4781, 2193.34, 2199.14]
})
flag = True
flag = (df['City'] == 'Belleair Bluffs')
if (flag.any() == True):
df.loc['TaxAnnualAmount'] = ((df['CurrentPrice'] / 1000) * 19.9818)
flag = True
flag = (df['City'] == 'Belleair')
if (flag.any() == True):
df.loc['TaxAnnualAmount'] = ((df['CurrentPrice'] / 1000) * 21.1318)
flag = True
flag = (df['City'] == 'Belleair Shore')
if (flag.any() == True):
df.loc['TaxAnnualAmount'] = ((df['CurrentPrice'] / 1000) * 14.4641)
As per your comment, whenever you need to update all rows (or most of them) with a different factor, you can create a second dataframe with those values and merge it with your original.
# sample data
df = pd.DataFrame({
'City': ['BELLEAIR BEACH', 'BELLEAIR BEACH', 'CLEARWATER', 'Belleair'],
'TaxAnnualAmount': [5672, 4781, 2193.34, 500]
})
mults = pd.DataFrame([
['Belleair Bluffs', 19.9818],
['Belleair', 21.1318],
['Belleair Shore', 14.4641]
], columns=['City', 'factor'])
df = df.merge(mults, on='City', how='left')
df['NewTaxAmount'] = df['TaxAnnualAmount'].div(1000).mul(df['factor'])
print(df)
Output
City TaxAnnualAmount factor NewTaxAmount
0 BELLEAIR BEACH 5672.00 NaN NaN
1 BELLEAIR BEACH 4781.00 NaN NaN
2 CLEARWATER 2193.34 NaN NaN
3 Belleair 500.00 21.1318 10.5659
Notice two things:
The how='left' parameter tells pandas to include all rows from the main dataframe and fill nan on the rows that don't have a match.
You must be careful whenever overwriting columns on a dataframe, make sure you don't have lines like this inside a loop (as you would have with your previous method).
For more on merging you can look at the documentation and this excellent answer by cs95.

Pandas Rename one column Values when not equal to multiple conditions

I am new to coding and Pandas any help would be much appreciated.
I have a column I wish to rename values Locations A-00-UD, A-01-UD, A-02-UD would = Audit, T-00-UD , T-02-UD, T-03-UD would = Transit and all other values would = stock. The problem I have is naming all the other values as Stock as in the full data-frame column is 15,000 lines long and has hundreds of different locations I wish to name stock.
Location
A-00-UD
A-01-UD
A-02-UD
A-03-UD
T-00-UD
T-01-UD
T-02-UD
T-03-UD
A-45-TR
S-30-RT
D-20-ED
V-00-LM
You can use a dictionary to map the first character of Location:
mapper = {'A': 'Audit', 'T': 'Transit'}
df['Location'] = df['Location'].str[0].map(mapper).fillna('Stock')
Alternatively, using numpy.select, you can specify conditions, values for each condition and a default value:
df = pd.DataFrame({'Location': ['A-00-UD', 'T-01-UD', 'S-30-RT']})
conditions = [df['Location'].str[0] == 'A', df['Location'].str[0] == 'T']
values = ['Audit', 'Transit']
df['Location'] = np.select(conditions, values, 'Stock')
print(df)
Location
0 Audit
1 Transit
2 Stock
Use numpy.select with isin for exact match:
m1 = df['Location'].isin(['A-00-UD', 'A-01-UD', 'A-02-UD'])
m2 = df['Location'].isin(['T-00-UD', 'T-02-UD', 'T-03-UD'])
Or with startswith for check first value:
m1 = df['Location'].str.startswith('A')
m2 = df['Location'].str.startswith('T')
df['new'] = np.select([m1, m2], ['Audit', 'Transit'], default='Stock')
print (df)
Location new
0 A-00-UD Audit
1 A-01-UD Audit
2 A-02-UD Audit
3 A-03-UD Audit
4 T-00-UD Transit
5 T-01-UD Transit
6 T-02-UD Transit
7 T-03-UD Transit
8 A-45-TR Audit
9 S-30-RT Stock
10 D-20-ED Stock
11 V-00-LM Stock

How to flatten individual pandas dataframes and stack them to achieve a new one?

I have a function which takes in data for a particular year and returns a dataframe.
For example:
df
year fruit license grade
1946 apple XYZ 1
1946 orange XYZ 1
1946 apple PQR 3
1946 orange PQR 1
1946 grape XYZ 2
1946 grape PQR 1
..
2014 grape LMN 1
Note:
1) a specific license value will exist only for a particular year and only once for a particular fruit (eg. XYZ only for 1946 and only once for apple, orange and grape).
2) Grade values are categorical.
I realize the below function isn't very efficient to achieve its intended goals,
but this is what I am currently working with.
def func(df, year):
#1. Filter out only the data for the year needed
df_year=df[df['year']==year]
'''
2. Transform DataFrame to the form:
XYZ PQR .. LMN
apple 1 3 1
orange 1 1 3
grape 2 1 1
Note that 'LMN' is just used for representation purposes.
It won't logically appear here because it can only appear for the year 2014.
'''
df_year = df_year.pivot(index='fruit',columns='license',values='grade')
#3. Remove all fruits that have ANY NaN values
df_year=df_year.dropna(axis=1, how="any")
#4. Some additional filtering
#5. Function to calculate similarity between fruits
def similarity_score(fruit1, fruit2):
agreements=np.sum( ( (fruit1 == 1) & (fruit2 == 1) ) | \
( (fruit1 == 3) & (fruit2 == 3) ))
disagreements=np.sum( ( (fruit1 == 1) & (fruit2 == 3) ) |\
( (fruit1 == 3) & (fruit2 == 1) ))
return (( (agreements-disagreements) /float(len(fruit1)) ) +1)/2)
#6. Create Network dataframe
network_df=pd.DataFrame(columns=['Source','Target','Weight'])
for i,c in enumerate(combinations(df_year,2)):
c1=df[[c[0]]].values.tolist()
c2=df[[c[1]]].values.tolist()
c1=[item for sublist in c1 for item in sublist]
c2=[item for sublist in c2 for item in sublist]
network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)]
return network_df
Running the above gives:
df_1946=func(df,1946)
df_1946.head()
Source Target Weight
Apple Orange 0.6
Apple Grape 0.3
Orange Grape 0.7
I want to flatten the above to a single row:
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
Note the above will not have 3 columns, but in fact around 5000 columns.
Eventually, I want to stack the transformed dataframe rows to get something like:
df_all_years
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
1947 0.7 0.25 0.8
..
2015 0.75 0.3 0.65
What is the best way to do this?
I would rearrange the computation a bit differently.
Instead of looping over the years:
for year in range(1946, 2015):
partial_result = func(df, year)
and then concatenating the partial results, you can get
better performance by doing as much work as possible on the whole DataFrame, df,
before calling df.groupby(...). Also, if you can express the computation in terms of builtin aggregators such as sum and count, the computation can be done more quickly than if you use custom functions with groupby/apply.
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2017)
def make_df():
N = 10000
df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N),
'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N),
'year': np.random.choice(range(1946,1950), size=N)})
df['manufacturer'] = (df['year'].astype(str) + '-'
+ df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str))
df = df.sort_values(by=['year'])
return df
def similarity_score(df):
"""
Compute the score between each pair of columns in df
"""
agreements = {}
disagreements = {}
for col in IT.combinations(df,2):
fruit1 = df[col[0]].values
fruit2 = df[col[1]].values
agreements[col] = ( ( (fruit1 == 1) & (fruit2 == 1) )
| ( (fruit1 == 3) & (fruit2 == 3) ))
disagreements[col] = ( ( (fruit1 == 1) & (fruit2 == 3) )
| ( (fruit1 == 3) & (fruit2 == 1) ))
agreements = pd.DataFrame(agreements, index=df.index)
disagreements = pd.DataFrame(disagreements, index=df.index)
numerator = agreements.astype(int)-disagreements.astype(int)
grouped = numerator.groupby(level='year')
total = grouped.sum()
count = grouped.count()
score = ((total/count) + 1)/2
return score
df = make_df()
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit'])
df2 = df2.dropna(axis=0, how="any")
print(similarity_score(df2))
yields
Grape Orange
Apple Apple Grape
year
1946 0.629111 0.650426 0.641900
1947 0.644388 0.639344 0.633039
1948 0.613117 0.630566 0.616727
1949 0.634176 0.635379 0.637786
Here's one way of doing a pandas routine to pivot the table in the way you refer to; while it handles ~5,000 columns--as resulting combinatorially from two initially separate classes--quickly enough (bottleneck step took about 20 s on my quad-core MacBook), for much larger scaling there are definitely faster strategies. The data in this example is pretty sparse (5K columns, with 5K random samples from 70 rows of years [1947-2016]) so execution time might be some seconds longer with a fuller dataframe.
from itertools import chain
import pandas as pd
import numpy as np
import random # using python3 .choices()
import re
# Make bivariate data w/ 5000 total combinations (1000x5 categories)
# Also choose 5,000 randomly; some combinations may have >1 values or NaN
random_sample_data = np.array(
[random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] +
['of Fruit' + str(i) for i in range(1000)],
k=5000),
random.choices(['Grapes', 'Are Purple', 'And Make Wine',
'From the Yeast', 'That Love Sugar'],
k=5000),
[random.random() for _ in range(5000)]]
).T
df = pd.DataFrame(random_sample_data, columns=[
"Source", "Target", "Weight"])
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0])
# Three views of resulting df in jupyter notebook:
df
df[df.Year == 1947]
df.groupby(["Source", "Target"]).count().unstack()
To flatten the grouped-by-year data, since groupby requires a function to be applied, you can use a temporary df intermediary to:
push all data.groupby("Year") into individual rows but with separate dataframes per the two columns "Target" + "Source" (to later expand by) plus "Weight".
Use zip and pd.core.reshape.util.cartesian_product to create an empty properly shaped pivot df which will be the final table, arising from temp_df.
e.g.,
df_temp = df.groupby("Year").apply(
lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)],
columns=["Target", "Source", "Weight"])
).sort_index()
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d
# Predetermine all possible pairwise column category combinations
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
[df.Target.unique(), df.Source.unique()])
))]
ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts]
ts_combinations
Finally, use simple for-for nested iteration (again, not the fastest, though pd.DataFrame.iterrows might help speed things up, as shown). Because of the random sampling with replacement I had to handle multiple values, so you probably would want to remove the conditional below the second for loop, which is the step where the three separate dataframes are, for each year, accordingly zipped into a single row of all cells via the pivoted ("Weight") x ("Target"-"Source") relation.
df_pivot = pd.DataFrame(np.zeros((70, 5000)),
columns=ts_combinations)
df_pivot.index = df_temp.index
for year, values in df_temp.iterrows():
for (target, source, weight) in zip(*values):
bivar_pair = str(target + ' ' + source)
curr_weight = df_pivot.loc[year, bivar_pair]
if curr_weight == 0.0:
df_pivot.loc[year, bivar_pair] = [weight]
# append additional values if encountered
elif type(curr_weight) == list:
df_pivot.loc[year, bivar_pair] = str(curr_weight +
[weight])
# Spotcheck:
# Verifies matching data in pivoted table vs. original for Target+Source
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016
df
df_pivot['And Make Wine of Fruit614']
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]

In pandas, how do I perform filtering on a DataframeGroupBy object?

Given I have the following csv data.csv:
id,category,price,source_id
1,food,1.00,4
2,drink,1.00,4
3,food,5.00,10
4,food,6.00,10
5,other,2.00,7
6,other,1.00,4
I want to group the data by (price, source_id) and I am doing it with the following code
import pandas as pd
df = pd.read_csv('data.csv', names=['id', 'category', 'price', 'source_id'])
grouped = df.groupby(['price', 'source_id'])
valid_categories = ['food', 'drink']
for price_source, group in grouped:
if group.category.size < 2:
continue
categories = group.category.tolist()
if 'other' in categories and len(set(categories).intersection(valid_categories)) > 0:
pass
"""
Valid data in this case is:
1,food,1.00,4
2,drink,1.00,4
6,other,1.00,4
I will need all of the above data including the id for other purposes
"""
Is there an alternate way to perform the above filtering in pandas before the for loop and if it's possible, will it be any faster than the above?
The criteria for filtering is:
size of the group is greater than 1
the group by data should contain category other and at least one of either food or drink
You could directly apply a custom filter to the GroupBy object, something like
crit = lambda x: all((x.size > 1,
'other' in x.category.values,
set(x.category) & {'food', 'drink'}))
df.groupby(['price', 'source_id']).filter(crit)
Outputs
category id price source_id
0 food 1 1.0 4
1 drink 2 1.0 4
5 other 6 1.0 4

Categories