Operating on Pandas groups - python

I am attempting to perform multiple operations on a large dataframe (~3 million rows).
Using a small test-set representative of my data, I've come up with a solution. However the script runs extremely slowly when using the large dataset as input.
Here is the main loop of the application:
def run():
df = pd.DataFrame(columns=['CARD_NO','CUSTOMER_ID','MODIFIED_DATE','STATUS','LOYALTY_CARD_ENROLLED'])
foo = input.groupby('CARD_NO', as_index=False, sort=False)
for name, group in foo:
if len(group) == 1:
df = df.append(group)
else:
dates = group['MODIFIED_DATE'].values
if all_same(dates):
df = df.append(group[group.STATUS == '1'])
else:
df = df.append(group[group.MODIFIED_DATE == most_recent(dates)])
path = ''
df.to_csv(path, sep=',', index=False)
The logic is as follows:
For each CARD_NO
- if there is only 1 CARD_NO, add row to new dataframe
- if there are > 1 of the same CARD_NO, check MODIFIED_DATE,
- if MODIFIED_DATEs are different, take the row with most recent date
- if all MODIFIED_DATES are equal, take whichever row has STATUS = 1
The slow-down occurs at each iteration around,
input.groupby('CARD_NO', as_index=False, sort=False)
I am currently trying to parallelize the loop by splitting the groups returned by the above statement, but I'm not sure if this is the correct approach...
Am I overlooking a core functionality of Pandas?
Is there a better, more Pandas-esque way of solving this problem?
Any help is greatly appreciated.
Thank you.

Two general tips:
For looping over a groupby object, you can try apply. For example,
grouped = input.groupby('CARD_NO', as_index=False, sort=False))
grouped.apply(example_function)
Here, example_function is called for each group in your groupby object. You could write example_function to append to a data structure yourself, or if it has a return value, pandas will try to concatenate the return values into one dataframe.
Appending rows to dataframes is slow. You might be better off building some other data structure with each iteration of the loop, and then building your dataframe at the end. For example, you could make a list of dicts.
data = []
grouped = input.groupby('CARD_NO', as_index=False, sort=False)
def example_function(row,data_list):
row_dict = {}
row_dict['length'] = len(row)
row_dict['has_property_x'] = pandas.notnull(row['property_x'])
data_list.append(row_dict)
grouped.apply(example_function, data_list=data)
pandas.DataFrame(data)

I wrote a significant improvement in running time. The return statements appear to change the dataframe in place, significantly improving running time (~30 minutes for 3 million rows) and avoiding the need for secondary data structures.
def foo(df_of_grouped_data):
group_length = len(df_of_grouped_data)
if group_length == 1:
return df_of_grouped_data
else:
dates = df_of_grouped_data['MODIFIED_DATE'].values
if all_same(dates):
return df_of_grouped_data[df_of_grouped_data.STATUS == '1']
else:
return df_of_grouped_data[df_of_grouped_data.MODIFIED_DATE == most_recent(dates)]
result = card_groups.apply(foo)

Related

How to loop through a pandas data frame using a columns values as the order of the loop?

I have two CSV files which I’m using in a loop. In one of the files there is a column called "Availability Score"; Is there a way that I can make the loop iterate though the records in descending order of this column? I thought I could use Ob.sort_values(by=['AvailabilityScore'],ascending=False) to change the order of the dataframe first, so that when the loop starts in will already be in the right order. I've tried this out and it doesn’t seem to make a difference.
# import the data
CF = pd.read_csv (r'CustomerFloat.csv')
Ob = pd.read_csv (r'Orderbook.csv')
# Convert to dataframes
CF = pd.DataFrame(CF)
Ob = pd.DataFrame(Ob)
#Remove SubAssemblies
Ob.drop(Ob[Ob['SubAssembly'] != 0].index, inplace = True)
#Sort the data by thier IDs
Ob.sort_values(by=['CustomerFloatID'])
CF.sort_values(by=['FloatID'])
#Sort the orderbook by its avalibility score
Ob.sort_values(by=['AvailabilityScore'],ascending=False)
# Loop for Urgent Values
for i, rowi in CF.iterrows():
count = 0
urgent_value = 1
for j, rowj in Ob.iterrows():
if(rowi['FloatID']==rowj['CustomerFloatID'] and count < rowi['Urgent Deficit']):
Ob.at[j,'CustomerFloatPriority'] = urgent_value
count+= rowj['Qty']
You need to add inplace=True, like this:
Ob.sort_values(by=['AvailabilityScore'],ascending=False, inplace=True)
sort_values() (like most Pandas functions nowadays) are not in-place by default. You should assign the result back to the variable that holds the DataFrame:
Ob = Ob.sort_values(by=['CustomerFloatID'], ascending=False)
# ...
BTW, while you can pass inplace=True as argument to sort_values(), I do not recommend it. Generally speaking, inplace=True is often considered bad practice.

Boolean comparison in Pandas Groupby extremely slow with mysteriously appearing MultiIndex?

I am making some boolean comparisons within a pandas groupby and am experiencing over a 3x slowdown. It is a lot of data, but I don't think these boolean checks should be this slow. After I run the code, my DataFrame mysteriously has a new MultiIndex it did not have before, so I am thinking this is where the slowdown is coming from.
def my_func(group):
uncanceled_values = group[['t1_values','final_values']].max(axis=1)
group['t1_values_total'] = group['t1_values'].sum()
group['final_values_total'] = uncanceled_values.sum()
group['t1_values_share'] = group['t1_values'] / group['t1_values_total']
group['t1_to_final_values_share'] = np.where(uncanceled_values.sum() > group['t1_values'].sum(),
((group['final_values'] - group['t1_values']) /
(group['final_values_total'] - group['t1_values_total'])),
0)
group['t1_values_rank'] = group['t1_values'].rank(ascending = False)
group['t1_to_final_values_rank'] = group['t1_to_final_values_share'].rank(ascending = False)
if (group['final_values_total'] < 2000.0).any() or (group['t1_values_share'] == 0.0).any():
return
else:
return group
I apply this function over the groupby like so:
df= df.groupby('id').apply(my_func)
But then I need to drop the MultiIndex it produces:
df= df.droplevel('id')
I am wondering if the fact that I don't return the group under those 2 conditions makes pandas have a hard time piecing the df back together so it has to multiindex.
Any ideas? Thanks a lot

Using pandas .shift on multiple columns with different shift lengths

I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned

Looping over rows in Pandas dataframe taking too long

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?
df4 is a panda data frame. df4.head() looks like this
df4 = pd.DataFrame({
'hashtag':np.random.randn(3000000),
'sentiment_score':np.random.choice( [0,1], 3000000),
'user_id':np.random.choice( ['11','12','13'], 3000000),
})
What I am aiming to have is a new column called rating.
len(df4.index) is 3,037,321.
ratings = []
for index in df4.index:
rowUserID = df4['user_id'][index]
rowTrackID = df4['track_id'][index]
rowSentimentScore = df4['sentiment_score'][index]
condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
allRows = df4[condition]
totalSongListendForContext = len(allRows.index)
rows = df4[(condition & (df4['track_id'] == rowTrackID))]
songListendForContext = len(rows.index)
rating = songListendForContext/totalSongListendForContext
ratings.append(rating)
Globally, you'll need groupby. you can either:
use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:
df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
/ df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))
Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:
df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
.value_counts(normalize=True)
.rename('ratings').reset_index(),
how='left')
in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

How to speed up this Python function?

I hope it is OK to ask questions of this type.
I have a get_lags function that takes a data frame, and for each column, shifts the column by each n in the list n_lags. So, if n_lags = [1, 2], the function shifts each column once by 1 and once by 2 positions, creating new lagged columns in this way.
def get_lags (df, n_lags):
data =df.copy()
data_with_lags = pd.DataFrame()
for column in data.columns:
for i in range(n_lags[0], n_lags[-1]+1):
new_column_name = str(column) + '_Lag' + str(i)
data_with_lags[new_column_name] = data[column].shift(-i)
data_with_lags.fillna(method = 'ffill', limit = max(n_lags), inplace = True)
return data_with_lags
So, if:
df.columns
ColumnA
ColumnB
Then, get_lags(df, [1 , 2]).columns will be:
ColumnA_Lag1
ColumnA_Lag2
ColumnB_Lag1
ColumnB_Lag2
Issue: working with data frames that have about 100,000 rows and 20,000 columns, this takes forever to run. On a 16-GB RAM, core i7 windows machine, once I waited for 15 minutes to the code to run before I stopped it. Is there anyway I can tweak this function to make it faster?
You'll need shift + concat. Here's the concise version -
def get_lags(df, n_lags):
return pd.concat(
[df] + [df.shift(i).add_suffix('_Lag{}'.format(i)) for i in n_lags],
axis=1
)
And here's a more memory-friendly version, using a for loop -
def get_lags(df, n_lags):
df_list = [df]
for i in n_lags:
v = df.shift(i)
v.columns = v.columns + '_Lag{}'.format(i)
df_list.append(v)
return pd.concat(df_list, axis=1)
This may not apply to your case (I hope I understand what you're trying to do correctly), but you can speed it up massively by not doing it in the first place. Can you treat your columns like a ring buffer?
Instead of changing the columns afterwards, keep track of:
how many columns can you use (how many lag items for each entry)
what was the last lag column used
(optionally) how many times you "rotated"
So instead of moving the data, you do something like:
current_column = (current_column + 1) % total_columns
and write to that column next.

Categories