find common data between two dataframes on a specific range of date - python

I have two dataframes df1 and df2 based, respectively, on these dictionaries:
data1 = {'date': ['5/09/22', '7/09/22', '7/09/22','10/09/22'],
'second_column': ['first_value', 'second_value', 'third_value','fourth_value'],
'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXzadzd589'],
'fourth_column':['first_value', 'second_value', 'third_value','fourth_value'],}
data2 = {'date': ['5/09/22', '7/09/22', '7/09/22', '7/09/22', '7/09/22', '11/09/22'],
'second_column': ['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],
'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXV6mkjdd589','GGdbkz589', 'BXhshhsd589'],
'fourth_column':['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],}
I want to compare df2 with df1 in order to show the "id_number" of df2 that are in df1.
I also want to compare the two dataframes on the same date range.
For example the shared date range between df1 and df2 should be the from 5/09/22 to 10/09/22 (and not beyond)
How can I do this?

You can define a helper function to make dataframes of your dictionaries and slice them on certain date range:
def format(dictionary, start, end):
"""Helper function.
Args:
dictionary: dictionary to format.
start: start date (DD/MM/YY).
end: end date (DD/MM/YY).
Returns:
Dataframe.
"""
return (
pd.DataFrame(dictionary)
.pipe(lambda df_: df_.assign(date=pd.to_datetime(df_["date"], format="%d/%m/%y")))
.pipe(
lambda df_: df_.loc[
(df_["date"] >= pd.to_datetime(start, format="%d/%m/%y"))
& (df_["date"] <= pd.to_datetime(end, format="%d/%m/%y")),
:,
]
).reset_index(drop=True)
)
Then, with dictionaries you provided, here is how you can "show the "id_number" of df2 that are in df1" for the desired date range:
df1 = format(data1, "05/09/22", "10/09/22")
df2 = format(data2, "05/09/22", "10/09/22")
print(df2[df2["id_number"].isin(df1["id_number"])]["id_number"])
# Output
0 AA576bdk89
1 GG6jabkhd589
2 BXV6jabd589
Name: id_number, dtype: object

Related

Join Pandas DataFrames on Fuzzy/Approximate Matches for Multiple Columns

I have two Pandas DataFrames that look like this. Trying to join the two data sets on 'Name','Longitude', and 'Latitude' but using a fuzzy/approximate match. Is there a way to join these together using a combination of the 'Name' strings being, for example, at least an 80% match and the 'Latitude' and 'Longitude' columns being the nearest value or within like 0.001 of each other? I tried using pd.merge_asof but couldn't figure out how to make it work. Thank you for the help!
import pandas as pd
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
merge_asof won't work here since it can only merge on a single numeric column, such as datetimelike, integer, or float (see doc).
Here you can compute the (euclidean) distance between the coordinates of df1 and df2 and pickup the best match:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
# Replacing 'Latitude' and 'Longitude' columns with a 'Coord' Tuple
df1['Coord'] = df1[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df1.drop(columns=['Latitude', 'Longitude'], inplace=True)
df2['Coord'] = df2[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df2.drop(columns=['Latitude', 'Longitude'], inplace=True)
# Creating a distance matrix between df1['Coord'] and df2['Coord']
distances_df1_df2 = cdist(df1['Coord'].to_list(), df2['Coord'].to_list())
# Creating df1['Price'] column from df2 and the distance matrix
for i in df1.index:
# you can replace the following lines with a loop over distances_df1_df2[i]
# and reject names that are too far from each other
min_dist = np.amin(distances_df1_df2[i])
if min_dist > 0.001:
continue
closest_match = np.argmin(distances_df1_df2[i])
# df1.loc[i, 'df2_Name'] = df2.loc[closest_match, 'Name'] # keep track of the merged row
df1.loc[i, 'Price'] = df2.loc[closest_match, 'Price']
print(df1)
Output:
Name Rating Coord Price
0 Game Time Bar 4.5 (42.3734, -71.1204) $$
1 Sports Grill 4.6 (42.3739, -71.1214) $
2 Sports Grill 2 4.3 (42.3839, -71.1315) $$
Edit: your requirement on 'Name' ("at least an 80% match") isn't really appropriate. Take a look at fuzzywuzzy to get a sense of how string distances can be measured.

Summing certain columns by similar part of its name

How to sum columns by already fetched list of unique columns partly names ?
list = ['13-14', '15-16']
DataFrame:
X.13-14 Y.13-14 Z.13-14 X.15-16 ...
id
182761 10274.00 6097173.00 5758902.00 3345841.00
I.e. I want to create '13-14' and '15-16' columns with corresponding sum of (X.13-14,Y.13-14,Z.13-14), then (X.15-16,Y.15-16,Z.15-16)
If want sum columns by columns names after . use lambda function in DataFrame.groupby with axis=1:
df1 = df.groupby(lambda x: x.split('.')[1], axis=1).sum()
print (df1)
13-14 15-16
id
182761 11866349.0 3345841.0
Or if need only columns by list:
L = ['13-14', '15-16']
df.columns = df.columns.str.extract(f'({"|".join(L)})', expand=False)
df1 = df.sum(level=0, axis=1)[L]
print (df1)
13-14 15-16
id
182761 11866349.0 3345841.0
If need add to original:
df = df.join(df1)
print (df)
X.13-14 Y.13-14 Z.13-14 X.15-16 13-14 15-16
id
182761 10274.0 6097173.0 5758902.0 3345841.0 11866349.0 3345841.0

pandas str contains with maximum value

I have 2 data-frames, one of them contains strings and the other contains a timestamp and a string.
df2= pd.DataFrame({'Name':['Tim', 'Timothy', 'Kistian', 'Kris cole','Ian'],
'Age':['1-2-1997', '21-3-1998', '19-6-2000', '18-4-1996','12-12-2001']})
df1= pd.DataFrame({'string':['Ti', 'Kri' ,'ian' ],
'MaxDate':[None, None, None]})
I want to assign to MaxDate column the maximum date of a str.contains(df1['string'][0] operation on df2:
for example: df2[df2.Name.str.contains(df1['string'][0])] gives me 2 records
I want to assign the maximum of these values to MaxDate corresponding to 'ti':
ie o/p for the first iteration will be:
df1= pd.DataFrame({'string':['Ti', 'Kri' ,'ian' ],
'MaxDate':['1-2-1997', None, None]})
How can I do this for all entries of df1 using a loop?
If need loop solution create list of dictionaries with max and pass to DataFrame constructor:
df2['Age'] = pd.to_datetime(df2['Age'], dayfirst=True)
out = []
for x in df1['string']:
m = df2.loc[df2.Name.str.contains(x), 'Age'].max()
out.append({'string': x, 'MaxDate': m})
df = pd.DataFrame(out)
print (df)
string MaxDate
0 Ti 1998-03-21
1 Kri 1996-04-18
2 ian 2000-06-19

How to generalize a function written for a specific column in a dataframe to be usable on any similar column?

How can I adjust this code so that it is useable for any column in the dataframe? Currently it only works on the column called "Gaps", but I have 10 other columns to which I need to apply this same function.
def get_averages(df: pd.DataFrame, column: str) -> pd.DataFrame:
'''
Add a column in place, with the averages
of each `Num` cyclical item for each row
'''
# work with a new dataframe
df2 = (
df[['FileName', 'Num', column]]
.explode('Gaps', ignore_index=True)
)
df2.Gaps = df2.Gaps.astype(float)
df2['tag'] = ( # add cyclic tags to each row, within each FileName
df2.groupby('FileName')[column]
.transform('cumcount') # similar to range(len(group))
% df2.Num # get the modulo of the row number within the group
)
# get averages and collect into lists
df2 = df2.groupby(['FileName', 'tag'])[column].mean() # get average
df2.rename(f'{column}_avgs', inplace=True)
# collect in a list by Filename and merge with original df
df2 = df2.groupby('FileName').agg(list)
df = df.merge(df2, on='FileName')
return df
df = get_averages(df, 'Gaps')
Use the parameter variable instead of hard-coding the column name:
df2 = (
df[['FileName', 'Num', column]]
.explode(column, ignore_index=True)
)
df2[column] = df2[column].astype(float)

Pyspark dataframe join based on key,group by and max

i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)

Categories