Pandas: speeding up groupby? - python

I am wondering whether it is possible to speed up pandas dataframe.groupby with the following application:
Basic data structure:
HDFStore with 9 columns
4 columns are columns with data (colF ... colI)
the combination of the remaining 5 columns (colA ... colE) gives a unique index
colE is a "last modified" column
The basic idea is to implement a data base with a "transactional memory". Assuming an entry changes, I don't delete it but write a new row with a new value in the "last modified" column. This allows me to retroactively look at how entries have changed over time.
However, in situations where I only want the currently valid "state" of the data, it requires selecting only those rows with the most recent "last modified" column:
idx = df.groupby(['colA', 'colB', 'colC', 'colD'],
as_index=False, sort=False)['colE'].max()
df_current_state = df.merge(idx, 'inner', on=['colA', 'colB', 'colC', 'colD', 'colE'])
This groupby method eats up about 70% of my run time.
Note: For the majority of rows, there exists only a single entry with respect to the "last modified" column. Only for very few, multiple versions of the row with different "last modified" values exist.
Is there a way to speed up this process other than changing the program logic as follows?
Alternative Solution without need for groupby:
Add an additional "boolean" column activeState which stores whether the row is part of the "active state".
When rows change, mark their activeState field as False and insert a new row with activeState=True.
One can then query the table with activeState==True rather than use groupby.
My issue with this solution is that it has the potential for mistakes where the activeState field is not set appropriately. Of course this is recoverable from using the "last modified" column, but if the groupby could be sped up, it would be foolproof...

What about using a sort followed by drop_duplicates? I'm using it on a large database with four levels of grouping with good speed. Im taking the first, so I don't know how first vs last helps the speed, but you can always reverse the sort too.
df_current_state = df.sort(columns='colE')
df_current_state = df_current_state.drop_duplicates(subset=['colA','colB','colC','colD'],take_last=True)

Related

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

Remove duplicated cell content using python?

I filter the duplicates, got duplicate on the same row and join items by comma and with this below code, don't really understand why the Join_Dup column is replicated?
dd = sales_all[sales_all['Order ID'].duplicated(keep=False)]
dd['Join_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
print(dd.head())
dd = dd[['Order ID','Join_Dup']].drop_duplicates()
dd
Order ID Join_Dup
0 176558 USB-C Charging Cable,USB-C Charging Cable,USB-...
2 176559 Bose SoundSport Headphones,Bose SoundSport Hea...
3 176560 Google Phone,Wired Headphones,Google Phone,Wir...
5 176561 Wired Headphones,Wired Headphones,Wired Headph...
... ... ...
186846 259354 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186847 259355 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186848 259356 34in Ultrawide Monitor,34in Ultrawide Monitor,...
186849 259357 USB-C Charging Cable,USB-C Charging Cable,USB-...
[178437 rows x 2 columns]
I need to remove the duplicates from the cell in each row, can some please help.
IIUC, let's try to prevent the duplicates in the groupby transform statement:
dd['Join_No_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(set(x)))
Edit disregard the second part of the answer. I will delete that portion if it ends up not being useful.
So you comment you want unique product strings for each Order ID. You can get that in a single step:
dd = (
sales_all.groupby(['Order ID', 'Product'])['some_other_column']
.size().rename('quantity').reset_index()
)
Now you have unique rows of OrderID/Product with the count of repeated products (or quantity, as in a regular invoice). You can work with that or you can groupby to form a list of products:
orders = dd.groupby('Order ID').Product.apply(list)
---apply vs transform---
Please note that if you use .transform as in your question you will invariably get a result with the same shape as the dataframe/series being grouped (i.e. grouping will be reversed and you will end up with the same number of rows, thus creating duplicates). The function .apply will pass the groups of your groupby to the same function, any function, but will not broadcast back to the original shape (it will return only one row per group).
Old Answer
So you are removing ALL Oder IDs that appear in multiple rows (if ID 14 appears in two rows you discard both rows). This makes the groupby in the next line redundant, as every grouped ID will have just one line.
Ok, now that's out of the way. Then presumably each row in Product contains a list which you are joining with a lambda. This step would be a little faster with a pandas native function.
dd['Join_Dup'] = dd.Product.str.join(', ')
# perhaps choose a better name for the column, once you remove duplicates it will not mean much (does 'Join_Products' work?)
Now to handle duplicates. You didn't actually need to join in the last step if all you wanted was to remove dups. Pandas can handle lists as well. But the part you were missing is the subset attribute.
dd = dd[['Order ID', 'Join_Dup']].drop_duplicates(subset='Join_Dup')

Python - Filtering dataframe based on 3 columns potentially containing a sought after value

I'm trying to take a query of recent customer transactions and match potential primary phone, cellphone and work phone matches against a particular list of customers I have.
Essentially, I am taking one dataframe column (the list of customers I am trying to see if they had transactions recently) against the overall universe of all recent transactions (dataframe being transaction_data) and remove any row that does not have a match in either the primary phone, cellphone or workphone column,
Here is what I am currently trying to do but it only returns Falses across each column header and does not filter the dataframe by rows as I had hoped,
transaction_data[(transaction_data['phone'].isin(df['phone'])) | (transaction_data['cell'].isin(df['phone'])) | (transaction_data['workphone'].isin(df['phone']))].any()
I'm trying to return a dataframe containing rows of transactional records where there is a match on either primary phone, cellphone or workphone.
Is there a better way to do this perhaps? Or do I need a minor tweak on my code?
The thing here is that applying the .isin() method of a Series to another Series will return a boolean Series.
In your example transaction_data['phone'] is a Series, and also df['phone']. The return of this method will be a boolean Series containing the value True in a row which the value in transaction_data['phone'] appears in df['phone'] and False otherwise. This is similar for all applications of isin() method in your example.
And, good news! This boolean Series is exactly what is needed for slicing the dataframe. Therefor your code just need a small tweak. Just delete the .any() at the end of the line.
transaction_data[(transaction_data['phone'].isin(df['phone'])) | (transaction_data['cell'].isin(df['phone'])) | (transaction_data['workphone'].isin(df['phone']))]

Python pandas loop efficient through two dataframes with different lengths

I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]
Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]

Python Pandas - Get Location from 2nd dataframe using 1st data

Very basic user of Pandas but I am coming against a brick wall here.
So I have one dataframe called dg has a column called 'user_id', and two other columns which aren't needed at the moment. I also have two more dataframes(data_conv and data_retargeting) with includes the same column name and a column called 'timestamp' in it however there is multiple timestamps for each 'user_id'.
What I need to create new columns in dg for the minimum and maximum 'timestamp' found.
I am currently able to do this through some very long-winded method with iterrow rows however for a dataframe of ~16000, it took 45minutes and I would like to cut it down because I have larger dataframes to run this one.
for index,row in dg.iterrows():
user_id=row['pdp_id']
n_audft=data_retargeting[data_retargeting.pdp_id == user_id].index.min()
n_audlt=data_retargeting[data_retargeting.pdp_id == user_id].index.max()
n_convft=data_conv[data_conv.pdp_id == user_id].index.min()
n_convlt=data_conv[data_conv.pdp_id == user_id].index.max()
dg[index,'first_retargeting']=data_retargeting.loc[n_audft, 'raw_time']
dg[index,'last_retargeting']=data_retargeting.loc[n_audlt, 'raw_time']
dg[index,'first_conversion']=data_conv.loc[n_convft, 'raw_time']
dg[index,'last_conversion']=data_conv.loc[n_convlt, 'raw_time']
without going into specific code, is every user_id in dg found in data_conv and data_retargeting? if so, you can merge (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.merge.html) them into a new dataframe first, and then compute the max/min, and extract the desired columns. i suspect that might run a little bit faster.

Categories