How Do I match two Data Frames in Pandas with multiple matches? - python

I have 2 data frames I want to match some data from one data frame and append it on another.
df1 looks like this:
sourceId
firstName
lastName
1234
John
Doe
5678
Sally
Green
9101
Chlodovech
Anderson
df2 looks like this:
sourceId
agentId
123456789
1234,5678
987654321
9101
143216546
1234,5678
I want my Final Data Frame to look like this:
sourceId
firstName
lastName
agentId
1234
John
Doe
123456789,143216546
5678
Sally
Green
123456789,143216546
9101
Chlodovech
Anderson
987654321
Usually appending stuff is easy but I'm not quite sure how to match this data up, and then append the matches with commas in-between them. I'm fairly new to using pandas so any help is appreciated.

This works. It's long and not the most elegant, but it works well :)
tmp = df2.assign(agentId=df2['agentId'].str.split(',')).explode('agentId').set_index('agentId')['sourceId'].astype(str).groupby(level=0).agg(list).str.join(',').reset_index()
df1['sourceId'] = df1['sourceId'].astype(str)
new_df = df1.merge(tmp, left_on='sourceId', right_on='agentId').drop('agentId',axis=1).rename({'sourceId_x':'sourceId', 'sourceId_y':'agentId'},axis=1)
Output:
>>> new_df
sourceId firstName lastName agentId
0 1234 John Doe 123456789,143216546
1 5678 Sally Green 123456789,143216546
2 9101 Chlodovech Anderson 987654321

Related

Python remove text if same of another column

I want to drop in my dataframe the text in a column if it starts with the same text that is in another column.
Example of dataframe:
name var1
John Smith John Smith Hello world
Mary Jane Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Dataframe that I want:
name var1
John Smith Hello world
Mary Jane Python is cool
James Bond My name is James Bond
Peter Pan Nothing happens here
Something simple as:
df[~df.var1.str.contains(df.var1)]
does not work. How I should write my python code?
Try using apply lambda;
df["var1"] = df.apply(lambda x: x["var1"][len(x["name"]):].strip() if x["name"] == x["var1"][:len(x["name"])] else x["var1"],axis=1)
How about this?
df['var1'] = [df.loc[i, 'var1'].replace(df.loc[i, 'name'], "") for i in df.index]

match name and surname from two data frames, extract middle name from one data frame and append it to other

I have two almost identical data frames A and B. In reality its a two data frames with 1000+ names each.
I want to match name and surname from both data frames and then extract the middle name from data frame B to data frame A.
data frame A
name surname
John Doe
Tom Sawyer
Huckleberry Finn
data frame B
name middle_name surname
John `O Doe
Tom Philip Sawyer
Lilly Tomas Finn
The result i seek:
name middle name surname
John `O Doe
Tom Philip Sawyer
You can use df.merge with parameter how='inner' and on=['name','surname']. To get the correct order use df.reindex over axis 1.
df = df.merge(df1,how='inner',on=['name','surname'])
df.reindex(['name', 'middle_name', 'surname'])
name middle_name surname
0 John `O Doe
1 Tom Philip Sawyer

Generate a UUID in a PySpark dataframe based on unique value of a field

Is there no way to currently generate a UUID in a PySpark dataframe based on unique value of a field?
I understand that Pandas can do something like what i want very easily, but if i want to achieve giving a unique UUID to each row of my pyspark dataframe based on a specific column attribute, how do I do that?
Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
import uuid
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
I was trying to search for it all over but could not find an example of doing this with PySpark.
What you actually want is applying a hash function. A hash function applied on the same value will always output the same result. An UUID on the other hand is simply a 128 bits integer, so just apply a 128 bits hash function and interpret the result as UUID. For instance, MD5 is such a hash function.
import hashlib
import uuid
def compute_uuid(name: str) -> uuid.UUID:
digest = hashlib.md5(name.encode()).digest()
return uuid.UUID(bytes=digest)
assert compute_uuid('alice') != compute_uuid('bob')
You can apply this new function to your dataframe
df['UUID'] = [compute_uuid(name) for name in df['Name']]
Applied on your example dataframe I get
Name UUID
0 John Doe 4c2a904b-afba-0659-1225-113ad17b5cec
1 Jane Smith 71768b5e-2a0b-3697-eb3c-0c6d4ebbbaf8
2 John Doe 4c2a904b-afba-0659-1225-113ad17b5cec
3 Jane Smith 71768b5e-2a0b-3697-eb3c-0c6d4ebbbaf8
4 Jack Dawson ba4f82d8-ef72-6e37-eb87-e5c3b0dce9e3
5 John Doe 4c2a904b-afba-0659-1225-113ad17b5cec

Concatenate multiple column strings into one column

I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object

Dropping selected rows in Pandas with duplicated columns

Suppose I have a dataframe like this:
fname lname email
Joe Aaron
Joe Aaron some#some.com
Bill Smith
Bill Smith
Bill Smith some2#some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?
You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.
If empty rows are NaN
Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If empty rows are strings
If your empty rows are strings, you need to specify ascending=False when sorting:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
Result
print(df)
fname lname email
4 Bill Smith some2#some.com
1 Joe Aaron some#some.com
You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2#some.com
1 Joe Aaron some#some.com

Categories