I'm currently new to coding and would require to do pairwise comparisons using Pandas. Hence, I have to find a way to code row-by-row comparisons without any repetitions.
A mock data will be as follows:
Whereby i'm comparing males & their age. However, as seen in the image above, in index 1 there is a combination of Vyel & Allsebrook & in index 4 the same combination is seen with Allsebrook and Vyel.
Ideally, the desired output would be like:
Desired Results
I have managed to remove rows containing the same person twice, but is there a way i can code so i can avoid overlapping data comparisons? Would appreciate any feedback. Thank you!
Try This:
import pandas as pd
# Create the DF
id_x = [1, 1, 1, 1, 2]
last_name_x = ["Vyel", "Vyel", "Vyel", "Vyel", "Allsebrook"]
gender = ["Male", "Male", "Male", "Male", "Male"]
age_x = [66, 66, 66, 66, 50]
id_y = [1, 2, 3, 4, 1]
last_name_y = ["Vyel", "Allsebrook", "Prinett", "Jinda", "Vyel"]
age_y = [66, 50, 30, 31, 66]
df = pd.DataFrame(id_x, columns=['id_x'])
df['last_name_x'] = last_name_x
df['gender'] = gender
df['age_x'] = age_x
df['id_y'] = id_y
df['last_name_y'] = last_name_y
df['age_y'] = age_y
# Create the keys in order to check duplicates
df["key_x"] = df["id_x"].astype(str) + df["last_name_x"] + df["age_x"].astype(str)
df["key_y"] = df["id_y"].astype(str) + df["last_name_y"] + df["age_y"].astype(str)
df["combined"] = df['key_x'] + "|" +df['key_y']
s = set(df['key_y'] + "|" + df['key_x'])
d = {}
# mark all of the duplicated rows
def check_if_duplicate(row):
if row in s:
if row in d:
return True
else:
arr = row.split("|")
d[arr[1] + "|" + arr[0]] = 1
return False
df['duplicate'] = df['combined'].apply(check_if_duplicate)
# drop the duplicates and the rows we added in order to check the duplicates
df = df[df['duplicate'] != True]
df.drop(["key_x", "key_y", "combined", "duplicate"], axis=1, inplace=True)
print(df)
Related
I have two data frames, and 3 conditions to create new data frame
1)df1["Product"]==df2["Product"] and df2["Date"] >= df1["Date"]
2)Now need to loop df2["Product"] sum(df2["Count"]) while checking df1["Count"] on each iteration for df2["Count"] == df1["Count"]
Example
df1["Product"][2] = "147326.A" and df1["Date"][2] = "1/03/22" and df1["Count"][2] = 4,
now we check df2 if there is match df2["Product"][1] == df1["Product"][2] and df2["Date"][1] >= df1["Date"][2], first condition are met now we need to sum() the df2["Count"] end on each iteration compare it to df1["Count"] if df1["Count"]== df2[Count] add to new data frame
df1 = pd.DataFrame({"Date":["11/01/22", "1/02/22", "1/03/22", "1/04/22", "2/02/22"],"Product" :["315114.A", "147326.A", "147326.A", "91106.A", "283214.A"],"Count":[3,1,4,1,2]})
df2 = pd.DataFrame({"Date" : ["15/01/22", "4/02/22", "7/03/22", "1/04/22", "2/02/22", "15/01/22","1/06/22","1/06/22"],"Product" : ["315114.A", "147326.A ", "147326.A", "91106.A", "283214.A", "315114.A","147326.A","147326.A" ],"Count" : [1, 1, 2, 1, 2, 2, 1, 1]})
The following data should be a match:
df1 = pd.DataFrame({"Date" : ["01/03/2022"],"Product":["91106.A"],"Count":[2]})
df2 = pd.DataFrame({"Date" : ["01/03/2022", "7/03/2022", "7/03/2022", "7/03/2022","7/03/2022", "7/03/2022"],"Product" : ["91106.A", "91106.A","91106.A", "91106.A", "91106.A", "91106.A"],"Count" : [1, 1, 1, 1, 1, 1]})
You could solve this in a list comprehension (within a pd.DataFrame):
df3 = pd.DataFrame([j.to_dict() for i, j in df1.iterrows() if
j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum()])
Splitting this up into lots of lines would look like this:
l = []
for i, j in df1.iterrows():
if j["Count"] == df2[(df2["Product"] == j["Product"]) &
(df2["Date"] >= j["Date"])]["Count"].sum():
x = j.to_dict()
l.append(x)
df3 = pd.DataFrame(l)
I am having an issue with returning the original df index of a row given a groupby condition after subselecting some of the df. It's easier to understand through code.
So if we start with a toy dataframe:
headers = ['a','b']
nrows = 8
df = pd.DataFrame(columns = headers)
df['a'] = [0]*(nrows//2) + [1]*(nrows//2)
df['b'] = [2]*(nrows//4) + [4]*(nrows//4) + [2]*(nrows//4) + [4]*(nrows//4)
print(df)
then I select the subset of data I am interested in and checking that the index is retained:
sub_df = df[df['a']==1] ## selects for only group 1 (indices 4-7)
print(sub_df.index) ## looks good so far
sub_df.index returns
Int64Index([4, 5, 6, 7], dtype='int64')
Which seems great! I would like to group data from that subset and extract the original df index and that is where the issue occurs:
For example:
g_df = sub_df.groupby('b')
g_df_idx = g_df.indices
print(g_df_idx) ## bad!
when I print(g_df_idx) I want it to return:
{2: array([4,5]), 4: array([6,7])}
Due to the way I will be using this code I can't just groupby(['a','b'])
I'm going nuts with this thing. Here are some of the many solutions I have tried:
## 1
e1_idx = sub_df.groupby('b').indices
# print(e1_idx) ## issue persists
## 2
e2 = sub_df.groupby('b', as_index = True) ## also tried as_index = False
e2_idx = e2.indices
# print(e2_idx) ## issue persists
## 3
e3 = sub_df.reset_index()
e3_idx = e3.groupby('b').indices
# print(e3_idx) ## issue persists
I'm sure there must be some simple solution I'm just overlooking. Would be very grateful for any advice.
you can do like this
g_df_idx = g_df.apply(lambda x: x.index).to_dict()
print(g_df_idx)
# {2: Int64Index([4, 5], dtype='int64'), 4: Int64Index([6, 7], dtype='int64')}
I have a list
a = [15, 50 , 75]
Using the above list I have to create smaller dataframes filtering out rows (the number of rows is defined by the list) on the index from the main dataframe.
let's say my main dataframe is df
the dataframes I'd like to have is df1 (from row index 0-15),df2 (from row index 15-65), df3 (from row index 65 - 125)
since these are just three I can easily use something like this below:
limit1 = a[0]
limit2 = a[1] + limit1
limit3 = a[2] + limit3
df1 = df.loc[df.index <= limit1]
df2 = df.loc[(df.index > limit1) & (df.index <= limit2)]
df2 = df2.reset_index(drop = True)
df3 = df.loc[(df.index > limit2) & (df.index <= limit3)]
df3 = df3.reset_index(drop = True)
But what if I want to implement this with a long list on the main dataframe df, I am looking for something which is iterable like the following (which doesn't work):
df1 = df.loc[df.index <= limit1]
for i in range(2,3):
for j in range(2,3):
for k in range(2,3):
df[i] = df.loc[(df.index > limit[j]) & (df.index <= limit[k])]
df[i] = df[i].reset_index(drop=True)
print(df[i])
you could modify your code by building dataframes from the main dataframe iteratively cutting out slices from the end of the dataframe.
dfs = [] # this list contains your partitioned dataframes
a = [15, 50 , 75]
for idx in a[::-1]:
dfs.insert(0, df.iloc[idx:])
df = df.iloc[:idx]
dfs.insert(0, df) # add the last remaining dataframe
print(dfs)
Another option is to use list expressions as follows:
a = [0, 15, 50 , 75]
dfs = [df.iloc[a[i]:a[i+1]] for i in range(len(a)-1)]
This does it. It's better to use dictionaries if you want to store multiple variables and call them later. It's bad practice to create variables in an iterative way, so always avoid it.
df = pd.DataFrame(np.linspace(1,75,75), columns=['a'])
a = [15, 50 , 25]
d = {}
b = 0
for n,i in enumerate(a):
d[f'df{n}'] = df.iloc[b:b+i]
b+=i
Output:
I have a very long and wide dataframe. I'd like to create a new column in that dataframe, where the value depends on many other columns in the df. The calculation needed for the values in this new column, ALSO change, depending on a value in some other column.
The answers to this question and this question come close, but don't quite work out for me.
I'll eventually have about 30 different calculations that could be applied, so I'm not too keen on the np.where function, which is not that readible for too many conditions.
I've also been strongly adviced against doing a for-loop over all rows in a dataframe, because it's supposed to be awful for performance (please correct me if I'm wrong there).
What I've tried to do instead:
import pandas as pd
import numpy as np
# Information in my columns look something like this:
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3 , 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]
# lists to check against to decide upon which calculation is required
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']
conditions = [
(df['text'] is None),
(df['text'] in someList),
(df['text'] in someOtherList),
(df['text'] in someThirdList)]
choices = [0,
round(df['values2'] * 0.5 * df['values3'], 2),
df['values1'] + df['values2'] - df['values3'],
df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
I expect that based on the row values in the df['text'], the right calculation is applied to same row value of df['mynewvalue'].
Instead, I get the error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I program this instead, so that I can use these kind of conditions to define the right calculation for this df['mynewvalue'] column?
The errors come from the conditions:
conditions = [
... ,
(df['text'] in someList),
(df['text'] in someOtherList),
(df['text'] in someThirdList)]
You try to ask if several elements are in a list. The answer is a list (for each element). As the error suggests, you have to decide if the condition is verified when at least one element verify the property (any) or if all the elements verify the property (any).
One solution is to use isin (doc) or all (doc) for pandas dataframes.
Here using any:
import pandas as pd
import numpy as np
# Information in my columns look something like this:
df = pd.DataFrame()
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3, 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]
# other lists to test against whether
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']
conditions = [
(df['text'] is None),
(df['text'].isin(someList)),
(df['text'].isin(someOtherList)),
(df['text'].isin(someThirdList))]
choices = [0,
round(df['values2'] * 0.5 * df['values3'], 2),
df['values1'] + df['values2'] - df['values3'],
df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
# text values1 values2 values3 mynewvalue
# 0 dab 3 6 103 309.0
# 1 def 4 3 444 -437.0
# 2 bla 2 21 33 346.5
# 3 zdag 5 44 425 -376.0
# 4 etc 2 22 200 251.0
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I'm trying to find matching values in a pandas dataframe. Once a match is found I want to perform some operations on the row of the dataframe.
Currently I'm using this Code:
import pandas as pd
d = {'child_id': [1, 2,5,4], 'parent_id': [3, 4,2,3], 'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
print(df.content[i])
else:
pass
It works fine, but is rather slow. Since I'm dealing with a dataset with millions of rows, I would take months. Is there a faster way to do this?
Edit: To clarify what, I want to create is a dataframe, which contains the Content of Matches.
import pandas as pd
d = {'child_id': [1,2,5,4],
'parent_id': [3,4,2,3],
'content': ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(columns = ("content_child", "content_parent"))
for i in range(len(df)):
for j in range(len(df)):
if str(df['child_id'][j]) == str(df['parent_id'][i]):
content_child = str(df["content"][i])
content_parent = str(df["content"][j])
s = pd.Series([content_child, content_parent], index=['content_child', 'content_parent'])
df2 = df2.append(s, ignore_index=True)
else:
pass
print(df2)
The fastest way is to use the features of numpy:
import pandas as pd
d = {
'child_id': [1, 2, 5, 4],
'parent_id': [3, 4, 2, 3],
'content': ["a", "b", "c", "d"]
}
df = pd.DataFrame(data=d)
comp1 = df['child_id'].values == df['parent_id'].values
comp2 = df['child_id'].values[::-1] == df['parent_id'].values
comp3 = df['child_id'].values == df['parent_id'].values[::-1]
if comp1.any() and not comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1] ]
elif comp1.any() and comp2.any() and not comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2] ]
elif comp1.any() and comp2.any() and comp3.any():
comp = np.c_[ df['content'].values[comp1], df['content'].values[comp2], df['content'].values[comp3] ]
print( df['content'].values[comp] )
Which outputs:
[]