Pandas Merging on condition - python

I am trying to do conditional merging between pandas df:
My df's look like this:
df1
import numpy as np
import pandas as pd
data = {'Name':['Tom', 'JJ', 'ABC', 'Tom', 'JJ', 'ABC', 'Tom', 'Tom'], 'Age':[10, 20, 25, 15, 25, 30, 30, 50]}
df = pd.DataFrame(data)
df.sort_values(['Name'], ascending = True, inplace = True)
and
data_new = {'Name':['Tom', 'JJ', 'ABC', 'JJ', 'ABC'], 'Start_Age':[24, 18, 24, 25, 29], 'End_Age':[32, 22, 27, 25, 34]}
df_2 = pd.DataFrame(data_new)
df_2["Score"] = np.random.randint(1, 100, df_2.shape[0])
df_2.sort_values(['Name'], ascending = True, inplace = True)
I would like to merge df with df 2 to get score corresponding to the age present in df.
Below is how I am trying to do:
df_new_2 = pd.merge(df, df_2, how='left', left_on = ["Name"], right_on = ["Name"])
df_new_2 = df_new_2[(df_new_2['Age']>=df_new_2['Start_Age'])& (df_new_2['Age']<=df_new_2['End_Age']) ]
df_final = df.merge(df_new_2, how = 'left', on=['Name', 'Age'])
df_final[['Name', 'Score']].ffill(axis = 0)
My expected output is:
Name Age Score
ABC 25 86
ABC 30 87
JJ 20 59
JJ 25 22
Tom 10 Nan
Tom 15 Nan
Tom 30 98
Tom 50 98
But, I am getting something else....where am I wrong??

This would be my solution based on using np.where() to create the filters and then create a new dataframe with the output. Futhermore, I've changed the name of the column Name in df_2 to avoid having columns with equal names. df_2 = pd.DataFrame(data_new).rename(columns={'Name':'Name_new'}). Besides that, here is my code:
Age = df['Age'].values
e_age = df_2['End_Age'].values
s_age = df_2['Start_Age'].values
i, j = np.where((Age[:, None] >= s_age) & (Age[:, None] <= e_age))
final_df = pd.DataFrame(
np.column_stack([df.values[i], df_2.values[j]]),
columns=df.columns.append(df_2.columns)
)
final_df = final_df[final_df['Name'] == final_df['Name_new']]
df_max = df.merge(final_df,how='left')
df_max['Score'] = df_max.groupby('Name').ffill()['Score']
df_max = df_max[['Name','Age','Score']]
Output:
Name Age Score
0 ABC 25 41
1 ABC 30 46
2 JJ 20 39
3 JJ 25 96
4 Tom 10 NaN
5 Tom 15 NaN
6 Tom 30 78
7 Tom 50 78

Your ffill is incorrect. You need first to sort by name and age to make sure the order is correct and also group by name so only the score from the same person is considered. Otherwise the forward fill will take the previous score from any person:
df_final = df_final.sort_values(['Name', 'Age'])
df_final['Score'] = df_final.groupby('Name').ffill()['Score']
This is another solution to the problem.
It uses a helper function to lookup the score.
The helper function is then used on each row to get the score for name and age.
def get_score(name, age):
score = df_2.loc[(df_2.Name == name) &
(df_2.Start_Age <= age) &
(df_2.End_Age >= age)]['Score'].values
return score[0] if len(score) >= 1 else np.NaN
# user helper function for each row
df['Score'] = df.apply(lambda x: get_score(x.Name, x.Age), axis=1)
You can still do the forward fill afterwards like this:
df = df.sort_values(['Name', 'Age'])
df['Score'] = df.groupby('Name').ffill()['Score']

Related

How to append new rows into a column of a dataframe based on conditions in python?

I want to add a new category into my existing column based on some conditions
Trial
df.loc[df['cat'] == 'woman','age'].max()-df.loc[df['cat'] == 'man','age'].max().apend{'cat': 'first_child', 'age': age}
import pandas as pd
d = {'cat': ['man1','man', 'woman','woman'], 'age': [30, 40, 50,55]}
df = pd.DataFrame(data=d)
print(df)
cat age
man 30
man 40
woman 50
woman 55
output required:
cat age
man 30
man 40
woman 50
woman 55
first_child 15
sec_child 10
Its possible by transpose but actual data is very complex
df.transpose()
cat man man woman woman
age 30 40 50 55
---looking for solution in rows amend----
Try:
import pandas as pd
d = {'cat': ['man1','man2', 'woman1','woman2'], 'age': [30, 40, 50,55]}
df = pd.DataFrame(data=d)
df_man = df[df.cat.str.startswith('man')].reset_index(drop=True)
df_woman = df[df.cat.str.startswith('woman')].reset_index(drop=True)
childs = [f'child{i}' for i in range(1, len(df_woman)+1)]
df_child = pd.DataFrame(data={'cat':childs, 'age': (df_woman['age'].sub(df_man['age'])).values})
df = pd.concat([df_man, df_woman, df_child], ignore_index=True)
print(df)
I don't understand the logic behind this but you can use append:
age = df.loc[df['gender'] == 'female', 'age'].squeeze() \
- df.loc[df['gender'] == 'male', 'age'].squeeze()
df = df.append({'gender': 'child', 'age': age}, ignore_index=True)
Output:
>>> df
gender age
0 male 3
1 female 4
2 child 1

Adding column in a dataframe with 0,1 values based on another column values

In the example dataframe created below:
Name Age
0 tom 10
1 nick 15
2 juli 14
I want to add another column 'Checks' and get the values in it as 0 or 1 if the list check contain s the value as check=['nick']
I have tried the below code:
import numpy as np
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
check = ['nick']
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df['Checks'] = np.where(df['Name']== check[], 1, 0)
#print dataframe.
print(df)
print(check)
str.containts
phrase = ['tom', 'nick']
df['check'] = df['Name'].str.contains('|'.join(phrase))
You can use pandas.Series.isin:
check = ['nick']
df['check'] = df['Name'].isin(check).astype(int)
output:
Name Age check
0 tom 10 0
1 nick 15 1
2 juli 14 0

How to add a dataset identifier (like id column) when append two or more datasets?

I have multiple datasets in csv format that I would like to import by appending. Each dataset has the same columns name (fields), but different values and length.
For example:
df1
date name surname age address
...
df2
date name surname age address
...
I would like to have
df=df1+df2
date name surname age address dataset
(df1) 1
... 1
(df2) 2
... 2
i.e. I would like to add a new column that is an identifier for dataset (where fields come from, if from dataset 1 or dataset 2).
How can I do it?
Is this what you're looking for?
Note: Example has fewer columns that yours but the method is the same.
import pandas as pd
df1 = pd.DataFrame({
'name': [f'Name{i}' for i in range(5)],
'age': range(10, 15)
})
df2 = pd.DataFrame({
'name': [f'Name{i}' for i in range(20, 22)],
'age': range(20, 22)
})
combined = pd.concat([df1, df2])
combined['dataset'] = [1] * len(df1) + [2] * len(df2)
print(combined)
Output
name age dataset
0 Name0 10 1
1 Name1 11 1
2 Name2 12 1
3 Name3 13 1
4 Name4 14 1
0 Name20 20 2
1 Name21 21 2
We have key in concat
combined = pd.concat([df1, df2],keys=[1,2]).reset_index(level=1)
In Spark with scala , I would do something like this :
import org.apache.spark.sql.functions._
val df1 = sparkSession.read
.option("inferSchema", "true")
.json("/home/shredder/Desktop/data1.json")
val df2 = sparkSession.read
.option("inferSchema", "true")
.json("/home/shredder/Desktop/data2.json")
val df1New = df1.withColumn("dataset",lit(1))
val df2New = df2.withColumn("dataset",lit(2))
val df3 = df1New.union(df2New)
df3.show()

How do you replace duplicate values with multiple unique strings in Pandas?

import pandas as pd
import numpy as np
data = {'Name':['Tom', 'Tom', 'Jack', 'Terry'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
Lets say I have a dataframe that looks like this. I am trying to figure out how to check the Name column for the value 'Tom' and if I find it the first time I replace it with the value 'FirstTom' and the second time it appears I replace it with the value 'SecondTom'. How do you accomplish this? I've used the replace method before but only for replacing all Toms with a single value. I don't want to add a 1 on the end of the value, but completely change the string to something else.
Edit:
If the df looked more like this below, how would we check for Tom in the first column and the second column and then replace the first instance with FirstTom and the second instance with SecondTom
data = {'Name':['Tom', 'Jerry', 'Jack', 'Terry'], 'OtherName':[Tom, John, Bob,Steve]}
Just adding in to the existing solutions , you can use inflect to create dynamic dictionary
import inflect
p = inflect.engine()
df['Name'] += df.groupby('Name').cumcount().add(1).map(p.ordinal).radd('_')
print(df)
Name Age
0 Tom_1st 20
1 Tom_2nd 21
2 Jack_1st 19
3 Terry_1st 18
We can do cumcount
df.Name=df.Name+df.groupby('Name').cumcount().astype(str)
df
Name Age
0 Tom0 20
1 Tom1 21
2 Jack0 19
3 Terry0 18
Update
suf = lambda n: "%d%s"%(n,{1:"st",2:"nd",3:"rd"}.get(n if n<20 else n%10,"th"))
g=df.groupby('Name')
df.Name=df.Name.radd(g.cumcount().add(1).map(suf).mask(g.Name.transform('count')==1,''))
df
Name Age
0 1stTom 20
1 2ndTom 21
2 Jack 19
3 Terry 18
Update 2 for column
suf = lambda n: "%d%s"%(n,{1:"st",2:"nd",3:"rd"}.get(n if n<20 else n%10,"th"))
g=s.groupby([s.index.get_level_values(0),s])
s=s.radd(g.cumcount().add(1).map(suf).mask(g.transform('count')==1,''))
s=s.unstack()
Name OtherName
0 1stTom 2ndTom
1 Jerry John
2 Jack Bob
3 Terry Steve
EDIT: For count duplicated per rows use:
df = pd.DataFrame(data = {'Name':['Tom', 'Jerry', 'Jack', 'Terry'],
'OtherName':['Tom', 'John', 'Bob','Steve'],
'Age':[20, 21, 19, 18]})
print (df)
Name OtherName Age
0 Tom Tom 20
1 Jerry John 21
2 Jack Bob 19
3 Terry Steve 18
import inflect
p = inflect.engine()
#map by function for dynamic counter
f = lambda i: p.number_to_words(p.ordinal(i))
#columns filled by names
cols = ['Name','OtherName']
#reshaped to MultiIndex Series
s = df[cols].stack()
#counter per groups
count = s.groupby([s.index.get_level_values(0),s]).cumcount().add(1)
#mask for filter duplicates
mask = s.reset_index().duplicated(['level_0',0], keep=False).values
#filter only duplicates and map, reshape back and add to original data
df[cols] = count[mask].map(f).unstack().add(df[cols], fill_value='')
print (df)
Name OtherName Age
0 firstTom secondTom 20
1 Jerry John 21
2 Jack Bob 19
3 Terry Steve 18
Use GroupBy.cumcount with Series.map, but only for duplicated values by Series.duplicated:
data = {'Name':['Tom', 'Tom', 'Jack', 'Terry'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
nth = {
0: "First",
1: "Second",
2: "Third",
3: "Fourth"
}
mask = df.Name.duplicated(keep=False)
df.loc[mask, 'Name'] = df[mask].groupby('Name').cumcount().map(nth) + df.loc[mask, 'Name']
print (df)
Name Age
0 FirstTom 20
1 SecondTom 21
2 Jack 19
3 Terry 18
Dynamic dictionary should be like:
import inflect
p = inflect.engine()
mask = df.Name.duplicated(keep=False)
f = lambda i: p.number_to_words(p.ordinal(i))
df.loc[mask, 'Name'] = df[mask].groupby('Name').cumcount().add(1).map(f) + df.loc[mask, 'Name']
print (df)
Name Age
0 firstTom 20
1 secondTom 21
2 Jack 19
3 Terry 18
transform
nth = ['First', 'Second', 'Third', 'Fourth']
def prefix(d):
n = len(d)
if n > 1:
return d.radd([nth[i] for i in range(n)])
else:
return d
df.assign(Name=df.groupby('Name').Name.transform(prefix))
Name Age
0 FirstTom 20
1 SecondTom 21
2 Jack 19
3 Terry 18
4 FirstSteve 17
5 SecondSteve 16
6 ThirdSteve 15
​

String increment of characters for a column

I've tried researching but din't get any leads so posting a question,
I have a df and I want the string column values to be incremented based on their ascii values of each character of string by 3
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Name Age
0 Tom 10
1 Nick 15
2 Juli 14
Final answer should be like Name is incremented by 3 ASCII numbers
Name Age
0 Wrp 10
1 Qlfn 15
2 Myol 14
This action has to be carried out on a df with 32,000 row. Please suggest me on how to achieve this result?
Here's one way using python's built-in chr and ord (it seems like you want an increment of 3 not 2):
df['Name'] = [''.join(chr(ord(s)+3) for s in i) for i in df.Name]
print(df)
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14
Try the code below,
data = [['Tom', 10], ['Nick', 15], ['Juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def fn(inp_str):
return ''.join([chr(ord(i) + 3) for i in inp_str])
df['Name'] = df['Name'].apply(fn)
df
Output is
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14

Categories