Boolean comparison in Pandas Groupby extremely slow with mysteriously appearing MultiIndex? - python

I am making some boolean comparisons within a pandas groupby and am experiencing over a 3x slowdown. It is a lot of data, but I don't think these boolean checks should be this slow. After I run the code, my DataFrame mysteriously has a new MultiIndex it did not have before, so I am thinking this is where the slowdown is coming from.
def my_func(group):
uncanceled_values = group[['t1_values','final_values']].max(axis=1)
group['t1_values_total'] = group['t1_values'].sum()
group['final_values_total'] = uncanceled_values.sum()
group['t1_values_share'] = group['t1_values'] / group['t1_values_total']
group['t1_to_final_values_share'] = np.where(uncanceled_values.sum() > group['t1_values'].sum(),
((group['final_values'] - group['t1_values']) /
(group['final_values_total'] - group['t1_values_total'])),
0)
group['t1_values_rank'] = group['t1_values'].rank(ascending = False)
group['t1_to_final_values_rank'] = group['t1_to_final_values_share'].rank(ascending = False)
if (group['final_values_total'] < 2000.0).any() or (group['t1_values_share'] == 0.0).any():
return
else:
return group
I apply this function over the groupby like so:
df= df.groupby('id').apply(my_func)
But then I need to drop the MultiIndex it produces:
df= df.droplevel('id')
I am wondering if the fact that I don't return the group under those 2 conditions makes pandas have a hard time piecing the df back together so it has to multiindex.
Any ideas? Thanks a lot

Related

How to drop row in pandas if column1 = certain value and column 2 = NaN?

I'm trying to do the following: "# drop all rows where tag == train_loop and start is NaN".
Here's my current attempt (thanks Copilot):
# drop all rows where tag == train_loop and start is NaN
# apply filter function to each row
# return True if row should be dropped
def filter_fn(row):
return row["tag"] == "train_loop" and pd.isna(row["start"]):
old_len = len(df)
df = df[~df.apply(filter_fn, axis=1)]
It works well, but I'm wondering if there is a less verbose way.
using apply is a really bad way to do this actually, since it loops over every row, calling the function you defined in python. Instead, use vectorized functions which you can call on the entire dataframe, which call optimized/vectorized versions written in C under the hood.
df = df[~((df["tag"] == "train_loop") & df["start"].isnull())]
If your data is large (>~100k rows), then even faster would be to use pandas query methods, where you can write both conditions in one:
df = df.query(
'~((tag == "train_loop") and (start != start))'
)
This makes use of the fact that NaNs never equal anything, including themselves, so we can use simple logical operators to find NaNS (.isnull() isn't available in the compiled query mini-language). For the query method to be faster, you need to have numexpr installed, which will compile your queries on the fly before they're called on the data.
See the docs on enhancing performance for more info and examples.
You can do
df = df.loc[~(df['tag'].eq('train_loop') & df['start'].isna())]

How to loop through pandas dataframe, and conditionally assign values to a row of a variable?

I'm trying to loop through the 'vol' dataframe, and conditionally check if the sample_date is between certain dates. If it is, assign a value to another column.
Here's the following code I have:
vol = pd.DataFrame(data=pd.date_range(start='11/3/2015', end='1/29/2019'))
vol.columns = ['sample_date']
vol['hydraulic_vol'] = np.nan
for i in vol.iterrows():
if pd.Timestamp('2015-11-03') <= vol.loc[i,'sample_date'] <= pd.Timestamp('2018-06-07'):
vol.loc[i,'hydraulic_vol'] = 319779
Here's the error I received:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
This is how you would do it properly:
cond = (pd.Timestamp('2015-11-03') <= vol.sample_date) &
(vol.sample_date <= pd.Timestamp('2018-06-07'))
vol.loc[cond, 'hydraulic_vol'] = 319779
Another way to do this would be to use the np.where method from the numpy module, in combination with the .between method.
This method works like this:
np.where(condition, value if true, value if false)
Code example
cond = vol.sample_date.between('2015-11-03', '2018-06-07')
vol['hydraulic_vol'] = np.where(cond, 319779, np.nan)
Or you can combine them in one single line of code:
vol['hydraulic_vol'] = np.where(vol.sample_date.between('2015-11-03', '2018-06-07'), 319779, np.nan)
Edit
I see that you're new here, so here's something I had to learn as well coming to python/pandas.
Looping over a dataframe should be your last resort, try to use vectorized solutions, in this case .loc or np.where, these will perform better in terms of speed compared to looping.

Perform logical operations and add new columns of dataframe in the same time?

Since I'm new and learning python, I want to collect specific data in dataframe corresponding to logical operation and add a label to it, however, this needs to perform in many lines of code.
For example:
df = df[(df['this_col'] >= 10) & (df['anth_col'] < 100)]
result_df = df.copy()
result_df['label'] = 'medium'
I really wonder if there's a way such that I can perform in one line of code without applying a function. If it cannot perform in one line, how come?
Cheers!
query returns a copy, always.
result_df = df.query("this_col >= 10 and anth_col < 100").assign(label='medium')
Assuming your columns can pass off as valid identifier names in python, this will do.

Python pandas - using apply funtion and creating new columns in dataframe

I have a dataframe with 40 million records and I need to create 2 new columns (net_amt and share_amt) from existing amt and sharing_pct columns. I created two functions which calculate these amounts and then used apply function to populate them back to dataframe. As my dataframe is large it is taking more time to complete. Can we calculate both amounts at one shot or is there completely a better way of doing it
def fn_net(row):
if (row['sharing']== 1):
return row['amt'] * row['sharing_pct']
else:
return row['amt']
def fn_share(row):
if (row['sharing']== 1):
return (row['amt']) * (1- row['sharing_pct'])
else:
return 0
df_load['net_amt'] = df_load.apply (lambda row: fn_net (row),axis=1)
df_load['share_amt'] = df_load.apply (lambda row: fn_share (row),axis=1)
I think numpy where() will be the best choice here (after import numpy as np):
df['net_amount'] = np.where( df['sharing']==1, # test/condition
df['amt']*df['sharing_pct'], # value if True
df['amt'] ) # value if False
You can, of course, use this same method for 'share_amt' also. I don't think there is any faster way to do this, and I don't think you can do it in "one shot", depending on how you define it. Bottom line: doing it with np.where is way faster than applying a function.
More specifically, I tested on the sample dataset below (10,000 rows) and it's about 700x faster than the function/apply method in that case.
df=pd.DataFrame({ 'sharing':[0,1]*5000,
'sharing_pct':np.linspace(.01,1.,10000),
'amt':np.random.randn(10000) })

Operating on Pandas groups

I am attempting to perform multiple operations on a large dataframe (~3 million rows).
Using a small test-set representative of my data, I've come up with a solution. However the script runs extremely slowly when using the large dataset as input.
Here is the main loop of the application:
def run():
df = pd.DataFrame(columns=['CARD_NO','CUSTOMER_ID','MODIFIED_DATE','STATUS','LOYALTY_CARD_ENROLLED'])
foo = input.groupby('CARD_NO', as_index=False, sort=False)
for name, group in foo:
if len(group) == 1:
df = df.append(group)
else:
dates = group['MODIFIED_DATE'].values
if all_same(dates):
df = df.append(group[group.STATUS == '1'])
else:
df = df.append(group[group.MODIFIED_DATE == most_recent(dates)])
path = ''
df.to_csv(path, sep=',', index=False)
The logic is as follows:
For each CARD_NO
- if there is only 1 CARD_NO, add row to new dataframe
- if there are > 1 of the same CARD_NO, check MODIFIED_DATE,
- if MODIFIED_DATEs are different, take the row with most recent date
- if all MODIFIED_DATES are equal, take whichever row has STATUS = 1
The slow-down occurs at each iteration around,
input.groupby('CARD_NO', as_index=False, sort=False)
I am currently trying to parallelize the loop by splitting the groups returned by the above statement, but I'm not sure if this is the correct approach...
Am I overlooking a core functionality of Pandas?
Is there a better, more Pandas-esque way of solving this problem?
Any help is greatly appreciated.
Thank you.
Two general tips:
For looping over a groupby object, you can try apply. For example,
grouped = input.groupby('CARD_NO', as_index=False, sort=False))
grouped.apply(example_function)
Here, example_function is called for each group in your groupby object. You could write example_function to append to a data structure yourself, or if it has a return value, pandas will try to concatenate the return values into one dataframe.
Appending rows to dataframes is slow. You might be better off building some other data structure with each iteration of the loop, and then building your dataframe at the end. For example, you could make a list of dicts.
data = []
grouped = input.groupby('CARD_NO', as_index=False, sort=False)
def example_function(row,data_list):
row_dict = {}
row_dict['length'] = len(row)
row_dict['has_property_x'] = pandas.notnull(row['property_x'])
data_list.append(row_dict)
grouped.apply(example_function, data_list=data)
pandas.DataFrame(data)
I wrote a significant improvement in running time. The return statements appear to change the dataframe in place, significantly improving running time (~30 minutes for 3 million rows) and avoiding the need for secondary data structures.
def foo(df_of_grouped_data):
group_length = len(df_of_grouped_data)
if group_length == 1:
return df_of_grouped_data
else:
dates = df_of_grouped_data['MODIFIED_DATE'].values
if all_same(dates):
return df_of_grouped_data[df_of_grouped_data.STATUS == '1']
else:
return df_of_grouped_data[df_of_grouped_data.MODIFIED_DATE == most_recent(dates)]
result = card_groups.apply(foo)

Categories