Fastest way to search dataframe with conditions - python

I am looking for the most efficient way to search a large dataframe based on specific conditions. I have tried .loc, .iloc, and numpy, but all of them are too slow. The fastest thus far is with numpy, where my code looks something like this:
ParsedTimestamp = []
for index, row in df_primary.iterrows():
d_index = list(np.where((df_data['filePath'] == row['FilePath']) & (df_data['session id'] == row['ChannelName']) & (df_data['message'] == row['Text']) & (df_data['d_temp'] == row['MessageTimestamp']))[0])[0]
ParsedTimestamp.append(df_data.loc[d_index]['Datetime UTC'])
As you may be able to tell, I have one datadrame (df_primary) that I need to match 4 values from another dataframe (df_data) to find a more accurate timestamp. The issue is that each search for the index in df_data that matches the row in df_primary takes over 1 second, which is much too long. The df_data dataframe is about 2.5 million rows.
I am open to converting the dataframes to dictionaries or any other forms, but from my research I have been told that dictionaries are less efficient at this size. Does anybody have any suggestions?

Why don't you just merge?
ParsedTimestamp = pd.merge(
df_data, df_primary,
left_on=['filePath','session id','message','MessageTimestamp','d_temp'],
right_on=['FilePath','ChannelName','Text','MessageTimestamp']
)['Datetime UTC']

Related

Faster Way of Getting Distinct Rows

Suppose we have a PySpark dataframe with ~10M rows. Is there a a faster way of getting distinct rows compared to df.distinct()? Maybe use df.groupBy()?
if you try select the columns of interest before you do the operation then wil be faster (smaller dataset).
something like
python
columns_to_select = ["col1", "col2"]
df.select(columns_to_select).distinct()

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!
You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

Fastest way to update pandas columns based on matching column from other pandas dataframe

I have two pandas dataframes and one has updated values for a subset of the values for the primary dataframe. The main one is ~2m rows and the column to update is ~20k. This operation is running extremely slowly as I have it below which is O(m*n) as far as I can tell, is there a good way to vectorize it or just generally increase the speed? I don't see how many other optimizations can apply to this case. I have also tried making the 'object_id' column the index but that didn't lead to a meaningful increase in speed.
# df_primary this is 2m rows
# df_updated this is 20k rows
for idx, row in df_updated.iterrows():
df_primary.loc[df_primary.object_id == row.object_id, ['status', 'category']] = [row.status, row.category]
Let's try DataFrame.update to update df_primary in place using values from df_updated:
df_primary = df_primary.set_index('object_id')
df_primary.update(df_updated.set_index('object_id')[['status', 'category']])
df_primary = df_primary.reset_index()
use join methods based on requirements like left/right/inner joins. It will be super fast than any other way.

How to find average of values in columns within iterrows in python

I have a dataframe with 100+ columns where all columns after col10 are of type float. What I would like to do is find the average of certain range of columns within loop. Here is what I tried so far,
for index,row in df.iterrows():
a = row.iloc[col30:col35].mean(axis=0)
This unfortunately returns unexpected values and I'm not able to get the average of col30,col31,col32,col33,col34,col35 for every row.Can someone please help.
try:
df.iloc[:, 30:35].mean(axis=1)
You may need to adjust 30:35 to 29:35 (you can remove the .mean and play around to get an idea of how the .iloc works). Generally in pandas you want to avoid loops as much as possible. The .iloc method allows you to select the index and columns based on their positional index. Then you can use the .mean() with axis=1 to sum across the 1st axis (Rows).
You really should be putting a small example where I reproduce the example, please see this below where the mentioned solution in comments works.
import pandas as pd
df = pd.DataFrame({i:val for i,val in enumerate(range(100))}, index=list(range(100)))
for i,row in df.iterrows():
a = row.iloc[29:25].mean() # a should be 31.5 for each row
print(a)

Comparing Pandas Dataframe Rows & Dropping rows with overlapping dates

I have a dataframe filled with trades taken from a trading strategy. The logic in the trading strategy needs to be updated to ensure that trade isn't taken if the strategy is already in a trade - but that's a different problem. The trade data for many previous trades is read into a dataframe from a csv file.
Here's my problem for the data I have:
I need to do a row-by-row comparison of the dataframe to determine if Entrydate of rowX is less than ExitDate rowX-1.
A sample of my data:
Row 1:
EntryDate ExitDate
2012-07-25 2012-07-27
Row 2:
EntryDate ExitDate
2012-07-26 2012-07-29
Row 2 needs to be deleted because it is a trade that should not have occurred.
I'm having trouble identifying which rows are duplicates and then dropping them. I tried the approach in answer 3 of this question with some luck but it isn't ideal because I have to manually iterate through the dataframe and read each row's data. My current approach is below and is ugly as can be. I check the dates, and then add them to a new dataframe. Additionally, this approach gives me multiple duplicates in the final dataframe.
for i in range(0,len(df)+1):
if i+1 == len(df): break #to keep from going past last row
ExitDate = df['ExitDate'].irow(i)
EntryNextTrade = df['EntryDate'].irow(i+1)
if EntryNextTrade>ExitDate:
line={'EntryDate':EntryDate,'ExitDate':ExitDate}
df_trades=df_trades.append(line,ignore_index=True)
Any thoughts or ideas on how to more efficiently accomplish this?
You can click here to see a sampling of my data if you want to try to reproduce my actual dataframe.
You should use some kind of boolean mask to do this kind of operation.
One way is to create a dummy column for the next trade:
df['EntryNextTrade'] = df['EntryDate'].shift()
Use this to create the mask:
msk = df['EntryNextTrade'] > df'[ExitDate']
And use loc to look at the subDataFrame where msk is True, and only the specified columns:
df.loc[msk, ['EntryDate', 'ExitDate']]

Categories