Efficient way to merge two large dataframes based on a condition

Efficient way to merge two large dataframes based on a condition - python

I have two dataframes like as shown below. Already referred the posts here, here, here and here. Don't mark it as duplicate
id,id2,app_date
1,'A',20/3/2017
1,'A',28/8/2017
3,'B',18/10/2017
4,'C',15/2/2017
tf = pd.read_clipboard(sep=',')
tf['app_date'] = pd.to_datetime(tf['app_date'],dayfirst=True)
id,valid_from,valid_to,s_flag
1,20/1/2017,30/4/2017,0
1,28/11/2017,15/2/2018,1
1,18/12/2017,24/2/2018,0
2,15/7/2017,15/11/2017,1
2,2/2/2017,2/6/2017,0
2,11/5/2016,11/6/2016,1
df = pd.read_clipboard(sep=',')
df['valid_from'] = pd.to_datetime(df['valid_from'],dayfirst=True)
df['valid_to'] = pd.to_datetime(df['valid_to'],dayfirst=True)
I would like to do the below
a) check whether tf['app_date'] is within the df['valid_from'] and df['valid_to'] for matching id
b) If yes, then copy the column s_flag to tf dataframe for matching id
I tried the below but am not sure whether the below is efficient for million records plus dataframes
t1 = tf.merge(df, how = 'left',on=['id'])
t1 = t1.loc[(t1.app_date >= t1.valid_from) & (t1.app_date <= t1.valid_to),['id','s_flag','app_date']]
tf.merge(t1, how = 'inner',on=['id','app_date'])
While the above works in sample data, but in real data, for some records, I encounter issues like below
.
You can see that 9/1/2017 approval date doesn't meet the condition for the 2nd and 3rd row but it is still returned as output. This is incorrect.
I expect my output to be like as shown below
id app_date s_flag
0 1 2017-03-20 0.0
2 3 2017-10-18 NaN
3 4 2017-02-15 NaN

Related

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Why is Pandas DataFrame Function 'isin()' taking so much time?

The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!

It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()

First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]

Remove non-duplicate records based on date in python pandas

I have data structured like this:
sample patient data
Basically, it's a list of patients who have received a pair of blood tests (A, B) on a certain date BUT they could've been given 1 of those blood tests separately on another date (and many of them were), and those records are all mixed in together, so the data I have are like:
test_date patient# test_type result
20191001 1 A 77
20191001 2 A 34
20191001 2 B 66
... ... ... ...
20191011 15 A 111
20191011 15 B 222
20191011 1 A 32
20191011 1 B 99
I have been working in python (pandas, numpy) to clean the data up to this point, and now I'm trying to remove the non-duplicate patient# records by date (remove rows of patients who only received one test on a given date), because I want to compare the test results (A, B) ONLY for patients who received BOTH tests on the SAME date.
The big caveat here is that, for example, patient #1 could've received ONLY Test A on 2019-10-01 but did receive BOTH Tests A & B on 2019-10-02 and/or some other dates (1 patient could've received both tests on multiple dates). So in that example I would want to discard patient #1's 2019-10-01 test record but preserve the 2019-10-02 one (and any later pairs).
Ideally, my finalized data would look something like this:
cleaned data
I have tried using duplicated() and drop_duplicates() on patient numbers to filter out data but that doesn't work in this situation since all patients have received both tests on at least one given date.

This can be done using 2 group by's and a merge. Comments in the code should help explain what is being done.
# get count of # tests for each patient-date combination
grp_df = df.groupby(['PATIENT','DATE'], as_index=False)\
.agg({'TEST':'count'})\
.rename(columns = {'TEST':'TEST_CT'})\
.sort_values(['PATIENT','DATE'])
filt_df = grp_df[grp_df['TEST_CT'] == 2]\ # filter days when patients got both tests only
.groupby(['PATIENT'], as_index=False)\
.agg({'DATE':'max'}) # get latest date for a patent when both tests were done
op_df = pd.merge(df, filt_df, on = ['PATIENT','DATE']) # filter original data to only include selected patient-date combinations
op_df

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1

Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.

You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Fastest way to combine two slices from two pandas dataframes in a loop?

I have a list of person IDs, and for each ID, I want to extract all available information from two different dataframes. In addition, the types of information also have IDs, and I only want specific information IDs for each person ID. Here's how I'm currently doing this:
new_table = []
for i in range(ranges):
slice = pd.concat([df1[sp.logical_and(df1.person.values == persons[i],
df1['info_id'].isin(info_ids))],
df2[sp.logical_and(df2.person.values == persons[i],
df2['info_id'].isin(info_ids))]], ignore_index=True)
if len(list(set(slice['info_ids']))) < amount_of_info_needed:
continue
else:
full_time_range = max(slice['age_days']) - min(slice['age_days'])
if full_time_range <= 1460:
new_table.append(slice)
else:
window_end = min(slice['age_days']) + 1460
slice = slice[slice.age_days < window_end+1]
if len(list(set(slice['info_id']))) < amount_of_info_needed:
continue
else:
new_table.append(slice)
#return new_table
new_table = pd.concat(new_table, axis=0)
new_table = new_table.groupby(['person', 'info_id']).agg(np.mean).reset_index()
new_table.to_sql('person_info_within4yrs', engine, if_exists='append', index=False,
dtype={'person': types.NVARCHAR(32), 'value': types.NVARCHAR(4000)})
I read about not using pd.concat in a loop because of quadratic time, but I tried converting the dataframes to arrays and slicing and concatenating those, but that went even slower than using pd.concat. After profiling each line with %lprun, all of the time is being consumed with the pd.concat/logical_and operation in the loop. This code is also faster than using .loc with both dataframes and concatenating two slices together. After the if-else blocks, I append to a list and at the end, turn the list into a dataframe.
Edit: Here is an example of what I'm doing. The goal is to slice from both dataframes by person_id and info_id, combine the slices, and append the combined slice to a list, which I will then turn back into a dataframe and export to a SQL table. The if-else blocks are relevant too, but from my profiling they take barely any time at all so I'm not going to describe them in detail.
df1.head()
person info_id value age_days
0 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28801
1 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28803
2 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28804
3 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28805
4 000012eae6ea403ca564e87b8d44d0bb 0 100.0 28806
df2.head()
person info_id value age_days
0 00000554787a3cb38131c3c38578cacf 4v 97.0 12726
1 00000554787a3cb38131c3c38578cacf 14v 180.3 12726
2 00000554787a3cb38131c3c38578cacf 9v 2.0 12726
3 00000554787a3cb38131c3c38578cacf 3v 20.0 12726
4 00000554787a3cb38131c3c38578cacf 0v 71.0 12726

I took Parfait's advice and first concatenated both dataframes into one, then a coworker gave me a solution to iterate through the dataframe. The dataframe consisted of ~117M rows with ~246K person IDs. My coworker's solution was to create a dictionary where each key is a person ID, and the value for each key is a list of row indices for that person ID in the dataframe. You then use .iloc to slice the dataframe by referencing the values in the dictionary. Finished running in about one hour.
idx = df1['person'].reset_index().groupby('person')['index'].apply(tuple).to_dict()
for i in range(ranges):
mrn_slice = df1.iloc[list(idx.values()[i])]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to merge two large dataframes based on a condition - python

Related

Pandas loop over 2 dataframe and drop duplicates

Why is Pandas DataFrame Function 'isin()' taking so much time?

Remove non-duplicate records based on date in python pandas

Assign value to dataframe from another dataframe based on two conditions

Fastest way to combine two slices from two pandas dataframes in a loop?

Categories

Resources