How to speed up the matching of columns in pandas dataframe [duplicate] - python

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I'm trying to find matching values in a pandas dataframe. Once a match is found I want to perform some operations on the row of the dataframe.
Currently I'm using this Code:
import pandas as pd
d = {'child_id': [1, 2,5,4], 'parent_id': [3, 4,2,3], 'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
print(df.content[i])
else:
pass
It works fine, but is rather slow. Since I'm dealing with a dataset with millions of rows, I would take months. Is there a faster way to do this?
Edit: To clarify what, I want to create is a dataframe, which contains the Content of Matches.
import pandas as pd
d = {'child_id': [1,2,5,4],
'parent_id': [3,4,2,3],
'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(columns = ("content_child", "content_parent"))
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
content_child = str(df["content"][i])
content_parent = str(df["content"][j])
s = pd.Series([content_child, content_parent], index=['content_child', 'content_parent'])
df2 = df2.append(s, ignore_index=True)
else:
pass
print(df2)

The fastest way is to use the features of numpy:
import pandas as pd
d = {
'child_id': [1, 2, 5, 4],
'parent_id': [3, 4, 2, 3],
'content': ["a", "b", "c", "d"]
}
df = pd.DataFrame(data=d)
comp1 = df['child_id'].values == df['parent_id'].values
comp2 = df['child_id'].values[::-1] == df['parent_id'].values
comp3 = df['child_id'].values == df['parent_id'].values[::-1]
if comp1.any() and not comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1] ]
elif comp1.any() and comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2] ]
elif comp1.any() and comp2.any() and comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2], df['content'].values[comp3] ]
print( df['content'].values[comp] )
Which outputs:
[]

Related

Python dataframe merge on condition

I have two data frames, and 3 conditions to create new data frame
1)df1["Product"]==df2["Product"] and df2["Date"] >= df1["Date"]
2)Now need to loop df2["Product"] sum(df2["Count"]) while checking df1["Count"] on each iteration for df2["Count"] == df1["Count"]
Example
df1["Product"][2] = "147326.A" and df1["Date"][2] = "1/03/22" and df1["Count"][2] = 4,
now we check df2 if there is match df2["Product"][1] == df1["Product"][2] and df2["Date"][1] >= df1["Date"][2], first condition are met now we need to sum() the df2["Count"] end on each iteration compare it to df1["Count"] if df1["Count"]== df2[Count] add to new data frame
df1 = pd.DataFrame({"Date":["11/01/22", "1/02/22", "1/03/22", "1/04/22", "2/02/22"],"Product" :["315114.A", "147326.A", "147326.A", "91106.A", "283214.A"],"Count":[3,1,4,1,2]})
df2 = pd.DataFrame({"Date" : ["15/01/22", "4/02/22", "7/03/22", "1/04/22", "2/02/22", "15/01/22","1/06/22","1/06/22"],"Product" : ["315114.A", "147326.A ", "147326.A", "91106.A", "283214.A", "315114.A","147326.A","147326.A" ],"Count" : [1, 1, 2, 1, 2, 2, 1, 1]})
The following data should be a match:
df1 = pd.DataFrame({"Date" : ["01/03/2022"],"Product":["91106.A"],"Count":[2]})
df2 = pd.DataFrame({"Date" : ["01/03/2022", "7/03/2022", "7/03/2022", "7/03/2022","7/03/2022", "7/03/2022"],"Product" : ["91106.A", "91106.A","91106.A", "91106.A", "91106.A", "91106.A"],"Count" : [1, 1, 1, 1, 1, 1]})
You could solve this in a list comprehension (within a pd.DataFrame):
df3 = pd.DataFrame([j.to_dict() for i, j in df1.iterrows() if
j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum()])
Splitting this up into lots of lines would look like this:
l = []
for i, j in df1.iterrows():
if j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum():
x = j.to_dict()
l.append(x)
df3 = pd.DataFrame(l)

How do I remove repeated Row Comparisons?

I'm currently new to coding and would require to do pairwise comparisons using Pandas. Hence, I have to find a way to code row-by-row comparisons without any repetitions.
A mock data will be as follows:
Whereby i'm comparing males & their age. However, as seen in the image above, in index 1 there is a combination of Vyel & Allsebrook & in index 4 the same combination is seen with Allsebrook and Vyel.
Ideally, the desired output would be like:
Desired Results
I have managed to remove rows containing the same person twice, but is there a way i can code so i can avoid overlapping data comparisons? Would appreciate any feedback. Thank you!
Try This:
import pandas as pd
# Create the DF
id_x = [1, 1, 1, 1, 2]
last_name_x = ["Vyel", "Vyel", "Vyel", "Vyel", "Allsebrook"]
gender = ["Male", "Male", "Male", "Male", "Male"]
age_x = [66, 66, 66, 66, 50]
id_y = [1, 2, 3, 4, 1]
last_name_y = ["Vyel", "Allsebrook", "Prinett", "Jinda", "Vyel"]
age_y = [66, 50, 30, 31, 66]
df = pd.DataFrame(id_x, columns=['id_x'])
df['last_name_x'] = last_name_x
df['gender'] = gender
df['age_x'] = age_x
df['id_y'] = id_y
df['last_name_y'] = last_name_y
df['age_y'] = age_y
# Create the keys in order to check duplicates
df["key_x"] = df["id_x"].astype(str) + df["last_name_x"] + df["age_x"].astype(str)
df["key_y"] = df["id_y"].astype(str) + df["last_name_y"] + df["age_y"].astype(str)
df["combined"] = df['key_x'] + "|" +df['key_y']
s = set(df['key_y'] + "|" + df['key_x'])
d = {}
# mark all of the duplicated rows
def check_if_duplicate(row):
if row in s:
if row in d:
return True
else:
arr = row.split("|")
d[arr[1] + "|" + arr[0]] = 1
return False
df['duplicate'] = df['combined'].apply(check_if_duplicate)
# drop the duplicates and the rows we added in order to check the duplicates
df = df[df['duplicate'] != True]
df.drop(["key_x", "key_y", "combined", "duplicate"], axis=1, inplace=True)
print(df)

maintaining pandas df index with selection & groupby (python)

I am having an issue with returning the original df index of a row given a groupby condition after subselecting some of the df. It's easier to understand through code.
So if we start with a toy dataframe:
headers = ['a','b']
nrows = 8
df = pd.DataFrame(columns = headers)
df['a'] = [0]*(nrows//2) + [1]*(nrows//2)
df['b'] = [2]*(nrows//4) + [4]*(nrows//4) + [2]*(nrows//4) + [4]*(nrows//4)
print(df)
then I select the subset of data I am interested in and checking that the index is retained:
sub_df = df[df['a']==1] ## selects for only group 1 (indices 4-7)
print(sub_df.index) ## looks good so far
sub_df.index returns
Int64Index([4, 5, 6, 7], dtype='int64')
Which seems great! I would like to group data from that subset and extract the original df index and that is where the issue occurs:
For example:
g_df = sub_df.groupby('b')
g_df_idx = g_df.indices
print(g_df_idx) ## bad!
when I print(g_df_idx) I want it to return:
{2: array([4,5]), 4: array([6,7])}
Due to the way I will be using this code I can't just groupby(['a','b'])
I'm going nuts with this thing. Here are some of the many solutions I have tried:
## 1
e1_idx = sub_df.groupby('b').indices
# print(e1_idx) ## issue persists
## 2
e2 = sub_df.groupby('b', as_index = True) ## also tried as_index = False
e2_idx = e2.indices
# print(e2_idx) ## issue persists
## 3
e3 = sub_df.reset_index()
e3_idx = e3.groupby('b').indices
# print(e3_idx) ## issue persists
I'm sure there must be some simple solution I'm just overlooking. Would be very grateful for any advice.
you can do like this
g_df_idx = g_df.apply(lambda x: x.index).to_dict()
print(g_df_idx)
# {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}

Remove matching values from two columns in Pandas

I have the following data set generated from the equation
df.loc[:,51] = [1217.0, -20.0, 13970.0, -74]
I dropped the negative values (specific values) and got this
df.loc[:,52] = [1217.0, 0, 13970.0, 0]
Now I am trying to get another column with the dropped values
df.loc[:,53] = df.drop_duplicates(subset=[df.loc[:,51], df.loc[:,52]])
I want this result.
The values that are dropped
df.loc[:,53] = [0,-20, 0,-74]
But I got the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Try numpy.where
df.loc[:,53] = np.where(df.loc[:,51] == df.loc[:,52], 0, df.loc[:,51])
Here, I've done it with a sample data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'rating': [4, -4, 3.5, -15, 5]
})
df.loc[(df['rating'] < 0 ), 'new_col'] = 0
df.loc[(df['rating'] > 0 ), 'new_col'] = df['rating']
df['dropped'] = np.where(df['rating'] == df['new_col'], 0, df['rating'])
df

Pandas assignment using nested loops leading to memory error

I am using pandas and trying to do an assignment using a nested loops. I iterate over a dataframe and then run a distance function if it meets a certain criteria. I am faced with two problems:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Memory Error. It doesn't work on large datasets. I end up having to terminate the process.
How should I change my solution to ensure it can scale with a larger dataset of 60,000 rows?
for i, row in df.iterrows():
listy = 0
school = []
if row['LS_Type'] == 'Primary (1-4)':
a = row['Northing']
b = row['Easting']
LS_ID = row['LS_ID']
for j, row2 in df.iterrows():
if row2['LS_Type'] == 'Primary (1-8)':
dist_km = distance(a,b, df.Northing[j], df.Easting[j])
if (listy == 0):
listy = dist_km
school.append([df.LS_Name[j], df.LS_ID[j]])
else:
if dist_km < listy:
listy = dist_km
school[0] = [df.LS_Name[j], int(df.LS_ID[j])]
df['dist_up_prim'][i] = listy
df["closest_up_prim"][i] = school[0]
else:
df['dist_up_prim'][i] = 0
The double for loop is what's killing you here. See if you can break it up into two separate apply steps.
Here is a toy example of using df.apply() and partial to do a nested for loop:
import math
import pandas as pd
from functools import partial
df = pd.DataFrame.from_dict({'A': [1, 2, 3, 4, 5, 6, 7, 8],
'B': [1, 2, 3, 4, 5, 6, 7, 8]})
def myOtherFunc(row):
if row['A'] <= 4:
return row['B']*row['A']
def myFunc(the_df, row):
if row['A'] <= 2:
other_B = the_df.apply(myOtherFunc, axis=1)
return other_B.mean()
return pd.np.NaN
apply_myFunc_on_df = partial(myFunc, df)
df.apply(apply_myFunc_on_df, axis=1)
You can rewrite your code in this form, which will be much faster.

Categories