Problems with pd.merge - python

Hope you all are having an excellent week.
So, I was finishing a script that worked really well for an specific use case. The base is as follows:
Funcion cosine_similarity_join:
def cosine_similarity_join(a:pd.DataFrame, b:pd.DataFrame, col_name):
a_len = len(a[col_name])
# all of the "documents" in a 1D array
corpus = np.concatenate([a[col_name].to_numpy(), b[col_name].to_numpy()])
# vectorize the array
tfidf, vectorizer = fit_vectorizer(corpus, 3)
# in this matrix each row represents the str in a and the col is the str from b, value is the cosine similarity
res = cosine_similarity(tfidf[:a_len], tfidf[a_len:])
res_series = pd.DataFrame(res).stack().rename("score")
res_series.index.set_names(['a', 'b'], inplace=True)
# join scores to b
b_scored = pd.merge(left=b, right=res_series, left_index=True, right_on='b').droplevel('b')
# find the indices on which to match, (highest score in each row)
best_match = np.argmax(res, axis=1)
# Join the rest of
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
print(res)
df = res.reset_index()
df = df.iloc[df.groupby(by="RefCol")["score"].idxmax()].reset_index(drop=True)
return df
This works like a charm when I do something like:
resulting_df = cosine_similarity_join(df1,df2,'My_col')
But in my case, I need something in the lines of:
big_df = pd.read_csv('some_really_big_df.csv')
some_other_df = pd.read_csv('some_other_small_df.csv')
counter = 0
size = 10000
total_size = len(big_df)
while counter <= total_size:
small_df = big_df[counter:counter+size]
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I already mapped the problem until one specific line in the function:
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
Basically this res dataframe is coming out empty and I just cannot understand why (since when I replicate the values outside of the loop it works just fine)...
I looked at the problem for hours now and would gladly accept a new light over the question.
Thank you all in advance!

Found the problem!
I just needed to reset the indexes for the join clause - once I create a new small df from the big df, the indexes remain equal to the slice of the big one, thus generating the problem when joining with another df!
So basically all I needed to do was:
while counter <= total_size:
small_df = big_df[counter:counter+size]
small_df = small_df.reset_index()
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I'll leave it here in case it helps someone :)
Cheers!

Related

python - "merge based on a partial match" - Improving performance of function

I have the below script - which aims to create a "merge based on a partial match" functionality since this is not possible with the normal .merge() funct to the best of my knowledge.
The below works / returns the desired result, but unfortunately, it's incredibly slow to the point that it's almost unusable where I need it.
Been looking around at other Stack Overflow posts that contain similar problems, but haven't yet been able to find a faster solution.
Any thoughts on how this could be accomplished would be appreciated!
import pandas as pd
df1 = pd.DataFrame([ 'https://wwww.example.com/hi', 'https://wwww.example.com/tri', 'https://wwww.example.com/bi', 'https://wwww.example.com/hihibi' ]
,columns = ['pages']
)
df2 = pd.DataFrame(['hi','bi','geo']
,columns = ['ngrams']
)
def join_on_partial_match(full_values=None, matching_criteria=None):
# Changing columns name with index number
full_values.columns.values[0] = "full"
matching_criteria.columns.values[0] = "ngram_match"
# Creating matching column so all rows match on join
full_values['join'] = 1
matching_criteria['join'] = 1
dfFull = full_values.merge(matching_criteria, on='join').drop('join', axis=1)
# Dropping the 'join' column we created to join the 2 tables
matching_criteria = matching_criteria.drop('join', axis=1)
# identifying matching and returning bool values based on whether match exists
dfFull['match'] = dfFull.apply(lambda x: x.full.find(x.ngram_match), axis=1).ge(0)
# filtering dataset to only 'True' rows
final = dfFull[dfFull['match'] == True]
final = final.drop('match', axis=1)
return final
join = join_on_partial_match(full_values=df1,matching_criteria=df2)
print(join)
>> full ngram_match
0 https://wwww.example.com/hi hi
7 https://wwww.example.com/bi bi
9 https://wwww.example.com/hihibi hi
10 https://wwww.example.com/hihibi bi
For anyone who is interested - ended up figuring out 2 ways to do this.
First returns all matches (i.e., it duplicates the input value and matches with all partial matches)
Only returns the first match.
Both are extremely fast. Just ended up using a pretty simple masking script
def partial_match_join_all_matches_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with all matching values (duplicating the full value).
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_join1 = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full")
full_values = full_values.drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# df = df.loc[n, 'match']
output.append(df_copy)
final = pd.concat(output)
end_join1 = (time.time() - start_join1)
end_join1 = str(round(end_join1, 2))
len_join1 = len(final)
return final
def partial_match_join_first_match_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with the first matching value.
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_singlejoin = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full").drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# leaves us with only the 1st of each URL
df_copy.drop_duplicates(subset=['full'])
output.append(df_copy)
final = pd.concat(output)
end_singlejoin = (time.time() - start_singlejoin)
end_singlejoin = str(round(end_singlejoin, 2))
len_singlejoin = len(final)
return final

Is there a way of dinamically find partial matching numbers between columns in pandas dataframes?

Im looking for a way of comparing partial numeric values between columns from different dataframes, this columns are filled with something like social security numbers (they can´t and won´t repeat), so something like a dynamic isin() with be ideal.
This are representations of very large dataframes that I import from csv files.
{import numpy as np
import pandas as pd
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
print(df1)
print(df2)
df2['Id_number_length']= df2['Id_number'].str.len()
df2.groupby('Id_number_length').count()
count_list = df2.groupby('Id_number_length')[['Id_number_length']].count()
print('count_list:\n', count_list)
df1 ['S_number'] = pd.to_numeric(df1['S_number'], downcast = 'integer')
df2['Id_number'] = pd.to_numeric(df2['Id_number'], downcast = 'integer')
inner_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='inner')
print('MATCH!:\n', inner_join)
outer_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
print('UNMATCHED:\n', anti_join)
}
What I need to get is something as the following as a result of the inner join or whatever method:
{
df3 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"],
"Id_number": [ "027160", "60078","342964","763", "1544", "5303", "973637", "14452", "9930", "4205",]})
print('MATCH!:\n', df3)
}
I thought that something like this (very crude) pseudocode would work. Using count_list to strip parts of the numbers of df1 to fully match df2 instead of partially matching (notice that in df2 the missing or added numbers are always at the begining or the end)
{
for i in count_list:
if i ==6:
try inner join
except empty output
elif i ==5:
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[1:]
inner join with df2
except empty output
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[:-1]
inner join with df2
elif i == 4:
same as above...
}
But the lengths in count_list are variable so this for is an inefficient way.
Any help with this will be very appreciated, I´ve been stuck with this for days. Thanks in advance.
You can 'explode' each line of df1 into up to 45 lines. For example, SSN 123456789 can be map to [1,2,3...9,12,23,34,45..89,...12345678,23456789,123456789]. While this look bad, from algorithm standpoint it is O(1) for each row and therefore O(N) in total.
Using this new column as key, a simple 'merge on' can combine the 2 DFs easily - which is usually O(NlogN).
Here is an example of what I should do. I hope I've understood. Feel free to ask if it's not clear.
import pandas as pd
import joblib
from joblib import Parallel,delayed
# Building the base
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
# Initiate empty list for indexes
IDX = []
# Using un function to paralleliza it if database is big
def func(x,y):
if all(c in df2.Id_number[y] for c in df1.S_number[x]):
return(x,y)
# using the max of processors
number_of_cpu = joblib.cpu_count()
# Prpeparing a delayed function to be parallelized
delayed_funcs = (delayed(func)(x,y) for x in range(len(df1)) for y in range(len(df2)))
# fiting it with processes and not threads
parallel_pool = Parallel(n_jobs=number_of_cpu,prefer="processes")
# Fillig the IDX List
IDX.append(parallel_pool(delayed_funcs))
# Droping the None
IDX = list(filter(None, IDX[0]))
# Making df3 with the tuples of indexes
df3 = pd.DataFrame(IDX)
# Making it readable
df3['df1'] = df1.S_number[df3[0]].to_list()
df3['df2'] = df2.Id_number[df3[1]].to_list()
df3
OUTPUT :

Loop through cell range (Every 3 cells) and add ranking to it

The problem is I am trying to make a ranking for every 3 cells in that column
using pandas.
For example:
This is the outcome I want
I have no idea how to make it.
I tried something like this:
for i in range(df.iloc[1:],df.iloc[,:],3):
counter = 0
i['item'] += counter + 1
The code is completely wrong, but I need help with the range and put df.iloc in the brackets in pandas.
Does this match the requirements ?
import pandas as pd
df = pd.DataFrame()
df['Item'] = ['shoes','shoes','shoes','shirts','shirts','shirts']
df2 = pd.DataFrame()
for i, item in enumerate(df['Item'].unique(), 1):
df2.loc[i-1,'rank'] = i
df2.loc[i-1, 'Item'] = item
df2['rank'] = df2['rank'].astype('int')
print(df)
print("\n")
print(df2)
df = df.merge(df2, on='Item', how='inner')
print("\n")
print(df)

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

Python Pandas Panel counting value occurence

I have a large dataset stored as a pandas panel. I would like to count the occurence of values < 1.0 on the minor_axis for each item in the panel. What I have so far:
#%% Creating the first Dataframe
dates1 = pd.date_range('2014-10-19','2014-10-20',freq='H')
df1 = pd.DataFrame(index = dates)
n1 = len(dates)
df1.loc[:,'a'] = np.random.uniform(3,10,n1)
df1.loc[:,'b'] = np.random.uniform(0.9,1.2,n1)
#%% Creating the second DataFrame
dates2 = pd.date_range('2014-10-18','2014-10-20',freq='H')
df2 = pd.DataFrame(index = dates2)
n2 = len(dates2)
df2.loc[:,'a'] = np.random.uniform(3,10,n2)
df2.loc[:,'b'] = np.random.uniform(0.9,1.2,n2)
#%% Creating the panel from both DataFrames
dictionary = {}
dictionary['First_dataset'] = df1
dictionary['Second dataset'] = df2
P = pd.Panel.from_dict(dictionary)
#%% I want to count the number of values < 1.0 for all datasets in the panel
## Only for minor axis b, not minor axis a, stored seperately for each dataset
for dataset in P:
P.loc[dataset,:,'b'] #I need to count the numver of values <1.0 in this pandas_series
To count all the "b" values < 1.0, I would first isolate b in its own DataFrame by swapping the minor axis and the items.
In [43]: b = P.swapaxes("minor","items").b
In [44]: b.where(b<1.0).stack().count()
Out[44]: 30
Thanks for thinking with me guys, but I managed to figure out a surprisingly easy solution after many hours of attempting. I thought I should share it in case someone else is looking for a similar solution.
for dataset in P:
abc = P.loc[dataset,:,'b']
abc_low = sum(i < 1.0 for i in abc)

Categories