Faster way to index pandas dataframe multiple times

Faster way to index pandas dataframe multiple times - python

For every row in df_a, I am looking to find rows in df_b where the id's are the same and the df_a row's location falls within the df_b row's start and end location.
df_a looks like:
|---------------------|------------------|------------------|
| Name | id | location |
|---------------------|------------------|------------------|
| a | 1 | 202013 |
|---------------------|------------------|------------------|
df_b looks like:
|---------------------|------------------|------------------|------------------|
| Name | id | location_start | location_end |
|---------------------|------------------|------------------|------------------|
| x | 1 | 202010 | 2020199 |
|---------------------|------------------|------------------|------------------|
Unfortunately, df_a and df_b are both nearly a million rows. This code is taking like 10 hours to run on my local. Currently I'm running the following:
for index,row in df_a.iterrows():
matched = df_b[(df_b['location_start']<row['location'])
& (df_b['location_end']>row['location'])
& (df_b['id']==row['id'])]
Is there any obvious way to speed this up?

You can do this:
Consider my sample dataframes below:
In [90]: df_a = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location':[202013, 102013]})
In [91]: df_b = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location_start':[202010, 1020199],'location_end':[2020199, 1020299] })
In [92]: df_a
Out[92]:
Name id location
0 a 1 202013
1 b 2 102013
In [93]: df_b
Out[93]:
Name id location_start location_end
0 a 1 202010 2020199
1 b 2 1020199 1020299
In [95]: d = pd.merge(df_a, df_b, on='id')
In [106]: indexes = d[d['location'].between(d['location_start'], d['location_end'])].index.tolist()
In [107]: df_b.iloc[indexes, :]
Out[107]:
Name id location_start location_end
0 a 1 202010 2020199

Related

How to aggregate in pandas with some conditions?

I want to aggregate my data in this way:
df.groupby('date').agg({ 'user_id','nunique',
'user_id':'nunique' ONLY WHERE purchase_flag==1})
date | user_id | purchase_flag
4-1-2020 | 1 | 1
4-1-2020 | 1 | 1 (purchased second time but still same unique user on that day)
4-1-2020 | 2 | 0
In this case I want the output to looks like:
date | total_users | total_users_who_purchased
4-1-2020 | 2 | 1
How can I best achieve this?

Try this by creating helper column in your dataframe to indicate users who purchased first then groupby and aggregate on that helper column:
df["user_id_purchased"] = df["user_id"].where(df["purchase_flag"].astype(bool))
df_output = df.groupby("date", as_index=False).agg(
total_users=("user_id", "nunique"),
total_users_who_purchased=("user_id_purchased", "nunique"),
)
Output:
date total_users total_users_who_purchased
0 4-1-2020 2 1

I think that one way to achieve this goal is using .loc
df.loc[ (df["purchase_flag"]==1)].user_id.nunique
Implementation to get your output:
details = { 'date' : ['4-1-2020'],
'total_users' : df.user_id.nunique(),
'total_users_who_purchased' :
df.loc(df["purchase_flag"]==1)].user_id.nunique()}
df2 = pd.DataFrame(details)
df2

Find out if values in dataframe are between values in other dataframe

I'm new to pandas and i'm trying to understand if there is a method to find out, if two values from one row in df1 are between two values from one row in df2.
Basically my df1 looks like this:
start | value | end
1 | TEST | 5
2 | TEST | 3
...
and my df2 looks like this:
start | value | end
2 | TEST2 | 10
3 | TEST2 | 4
...
Right now i've got it working with two loops:
for row in df1.iterrows():
for row2 in df2.iterrows():
if row2[1]["start"] >= row[1]["start"] and row2[1]["end"] <= row[1]["end"]:
print(row2)
but this doesn't feel like it's the pandas way to me.
What I'm expecting is that row number 2 from df2 is getting printed because 3 > 1 and 4 < 5, i.e.:
3 | TEST2 | 4
Is there a method to do this in the pandas kind of working?

You could use a cross merge to get all combinations of df1 and df2 rows, and filter using classical comparisons. Finally, get the indices and slice:
idx = (df1.merge(df2.reset_index(), suffixes=('1', '2'), how='cross')
.query('(start2 > start1) & (end2 < end1)')
['index'].unique()
)
df2.loc[idx]
NB. I am using unique here to ensure that a row is selected only once, even if there are several matches
output:
start value end
1 3 TEST2 4

Find maximum in Dataframe based on variable values

I have a dataframe of the form:
A| B| C | D
a| x| r | 1
a| x| s | 2
a| y| r | 1
b| w| t | 4
b| z| v | 2
I'd like to be able to return something like (showing unique values and frequency)
A| freq of most common value in Column B |maximum of column D based on the most common value in Column B | most common value in Column B
a 2 2 x
b 1 4 w
at the moment i can calculate the everything but the 3 column of the result dataframe quiet fast via
df = (df.groupby('A', sort=False)['B']
.apply(lambda x: x.value_counts().head(1))
.reset_index()
but to calculate the 2 Column ("maximum of column D based on the most common value in Column B") i have writen a for-loop witch is slow for a lot of data.
Is there a fast way?
The question is linked to: Count values in dataframe based on entry

Use merge with get rows by maximum D per groups by DataFrameGroupBy.idxmax:
df1 = (df.groupby('A', sort=False)['B']
.apply(lambda x: x.value_counts().head(1))
.reset_index()
.rename(columns={'level_1':'E'}))
#print (df1)
df = df1.merge(df, left_on=['A','E'], right_on=['A','B'], suffixes=('','_'))
df = df.loc[df.groupby('A')['D'].idxmax(), ['A','B','D','E']]
print (df)
A B D E
1 a 2 2 x
2 b 1 4 w

Consider doing this in 3 steps:
find most common B (as in your code):
df2 = (df.groupby('A', sort=False)['B']).apply(lambda x: x.value_counts().head(1)).reset_index()
build DataFrame with max D for each combination of A and B
df3 = df.groupby(['A','B']).agg({'D':max}).reset_index()
merge 2 DataFrames to find max Ds matching the A-B pairs selected earlier
df2.merge(df3, left_on=['A','level_1'], right_on=['A','B'])
The column D in the resulting DataFrame will be what you need
A level_1 B_x B_y D
0 a x 2 x 2
1 b w 1 w 4

Matrix Multiplication of 2 Pandas DF's

I have a 2 pandas dataframes that are (156915, 22) and (22,2) in shape. The DF1, (156915, 22), has column names that match DF2 column 1 rows. I want to do matrix multiplication where DF1.columns = DF2['col1']. Heres a quick view of what the df's may look like. I would like to return a pandas dataframe of the same shape as DF1. Thank you in advance!
DF1:
A | B | C
1 | 15 | 8
5 | 3 | 2
DF2:
col1 | col2
A | 5
B | 1
C | 0

One method:
df3 = df2.set_index('col1')
df1[df3.index].apply(lambda x: x*df3['col2'].T,axis=1)`

If your columns in DF1 are in the same order as your col1 in DF2, then using np.dot should work:
np.dot(DF1, DF2['col2'])

Python data subset - select value ... from DF1 ... where value exists in DF2

I am trying to create a dataframe partly by seeing if values exist in another dataframe. here is the SQL version of what I am trying to do:
SELECT *
FROM DF1
WHERE
Patient_alive='still_alive'
AND Patient_ID in (SELECT Pat_ID from DF2)
Here is the code I am struggling with, the last line is what I can't figure out, i have two versions of pseudocode concerning PT_ID:
DF3 = DF1[
(DF1['Patient_alive'].str.contains('still_alive', case=False))&
#(DF1['PT_ID'].isin(DF2))
(DF1['PT_ID'].contains(DF2, case=False))
]
Update1:
Input Data of df1:
Patient_ID | Patient_Alive | Patient_Name
12345 | StillAlive | Knowles, Archibald
23456 | NotAlive | Hauzer, Bruno
911235 | StillAlive | Samarkand, Samsonite VII
Input Data of df2:
PT_ID
12345
22222
55555
99999
Df3 desired output:
Patient_ID | Patient_Alive | Patient_Name
12345 | StillAlive | Knowles, Archibald

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster way to index pandas dataframe multiple times - python

Related

How to aggregate in pandas with some conditions?

Find out if values in dataframe are between values in other dataframe

Find maximum in Dataframe based on variable values

Matrix Multiplication of 2 Pandas DF's

Python data subset - select value ... from DF1 ... where value exists in DF2

Categories

Resources