Merge two dataframes by partial string match

Merge two dataframes by partial string match - python

I am trying to merge two fairly large dataframes of different sizes based on partial string matches.
df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.
Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.
Example dataframes:
df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))
df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))
df3 (merged)
code_1 code_2 name color
123456789012 123456789012 bob blue
210987654321 2109876543 joe red
567890543211 7890543211 sally green
987656789001 98765678900 john purple
123456654321 12345665432 lucy orange
678905432156 678905432156 alan brown

Try this SQL join.
library(sqldf)
sqldf("select a.code_1, b.code_2, a.name, b.color
from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")
giving:
code_1 code_2 name color
1 123456789012 123456789012 bob blue
2 210987654321 2109876543 joe red
3 567890543211 7890543211 sally green
4 987656789001 98765678900 john purple
5 123456654321 12345665432 lucy orange
6 678905432156 678905432156 alan brown
Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2.

We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. Next, we merge on matchID to get desired output:
df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))
df_merge = merge(df1, df2, by = "matchID")[-1]
Note that if a df1$code does not match any df2$code, df2$matchID will be blank, and so would not merge with df1$matchID.
Results:
> df2
code color matchID
1 123456789012 blue 1
2 2109876543 red 2
3 7890543211 green 3
4 98765678900 purple 4
5 12345665432 orange 5
6 678905432156 brown 6
7 14124124124 black
> df_merge
code.x name code.y color
1 123456789012 bob 123456789012 blue
2 210987654321 joe 2109876543 red
3 567890543211 sally 7890543211 green
4 987656789001 john 98765678900 purple
5 123456654321 lucy 12345665432 orange
6 678905432156 alan 678905432156 brown
Data (Added non-match for better demo):
df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
stringsAsFactors = FALSE)
df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
stringsAsFactors = FALSE)

Updated per new info. This should work:
df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)
combined <- merge(df1,df2, by.x="code_1", by.y="New")
code_1 name code_2 color
1 123456654321 lucy 12345665432 orange
2 123456789012 bob 123456789012 blue
3 210987654321 joe 2109876543 red
4 567890543211 sally 7890543211 green
5 678905432156 alan 678905432156 brown
6 987656789001 john 98765678900 purple

In python/pandas, you can do:
from pandas import DataFrame, Series
df1 = DataFrame(dict(
code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))
df2 = DataFrame(dict(
code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))
matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]
print(
df1.assign(subcode=Series(data=df2['code2'], index=matches))
.merge(df2, left_on='subcode', right_on='code2')
.drop('subcode', axis='columns')
)
And that dumps:
code1 name code2 color
0 123456789012 bob 123456789012 blue
1 210987654321 joe 2109876543 red
2 567890543211 sally 7890543211 green
3 987656789001 john 98765678900 purple
4 123456654321 lucy 12345665432 orange
5 678905432156 alan 678905432156 brown
Note: I hate using loops with dataframes, but this, uh, works, I guess.

Related

Python Pandas calculate value_counts of two columns and use groupby

I have a dataframe :
data = {'label': ['cat','dog','dog','cat','cat'],
'breeds': [ 'bengal','shar pei','pug','maine coon','maine coon'],
'nicknames':[['Loki','Loki' ],['Max'],['Toby','Zeus ','Toby'],['Marty'],['Erin ','Erin']],
'eye color':[['blue','green'],['green'],['brown','brown','brown'],['blue'],['green','brown']]
Output:
label breeds nicknames eye color
0 cat bengal [Loki,Loki] [blue, green]
1 dog shar pei [Max] [green]
2 dog pug [Toby,Zeus,Toby] [brown, brown, brown]
3 cat maine coon [Marty] [blue]
4 cat maine coon [Erin,Erin] [green, brown]
I want to apply the groupby :frame['label', 'breeds'], and calculate value_counts(unique value ) of nicknames and eye color,but output them in different columns: 'nickname_count','eye_count'
This code outputs only in one column, how do I output separately?
frame2=frame.groupby(['label','breeds'])['nicknames','eye color'].apply(lambda x: x.astype('str').value_counts().to_dict())

First, we use a groupby with sum on the lists as sum concatenates the lists together :
>>> df_grouped = df.groupby(['label', 'breeds']).agg({'nicknames': sum, 'eye color': sum}).reset_index()
>>> df_grouped
label breeds nicknames eye color
0 cat bengal [Loki, Loki] [blue, green]
1 cat maine coon [Marty, Erin , Erin] [blue, green, brown]
2 dog pug [Toby, Zeus , Toby] [brown, brown, brown]
3 dog shar pei [Max] [green]
Then, we can count the number of unique values in list by converting it to set, using len and save the output in two new columns to get the expected result :
>>> df_grouped['nickname_count'] = df_grouped['nicknames'].apply(lambda x: list(set(x))).str.len()
>>> df_grouped['eye_count'] = df_grouped['eye color'].apply(lambda x: list(set(x))).str.len()
>>> df_grouped
label breeds nicknames eye color nickname_count eye_count
0 cat bengal [Loki, Loki] [blue, green] 1 2
1 cat maine coon [Marty, Erin , Erin] [blue, green, brown] 3 3
2 dog pug [Toby, Zeus , Toby] [brown, brown, brown] 2 1
3 dog shar pei [Max] [green] 1 1

How to split two strings into different columns in Python with Pandas?

I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!

The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun

have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])

check if a name in one dataframe exist in other dataframe python

I am a beginner in Python and trying to find a solution for the following problem.
I have a csv file:
name, mark
Anna,24
John,19
Mike,22
Monica,20
Alex, 17
Daniel, 26
And xls file:
name, group
John, red
Anna, blue
Monica, blue
Mike, yellow
Alex, red
I am trying to get the result:
group, mark
Red, 26
Blue, 44
Yellow, 22
The number in result shows the total mark for the whole group.
I was trying to find similar problems but was not successful and I do not have much experience to find out what exactly I have to do and what commands to use.

Use pd.read_csv with df.merge and Groupby.sum:
In [89]: df1 = pd.read_csv('file1.csv')
In [89]: df1
Out[89]:
name mark
0 Anna 24
1 John 19
2 Mike 22
3 Monica 20
4 Alex 17
5 Daniel 26
In [90]: df2 = pd.read_csv('file2.csv')
In [90]: df2
Out[90]:
name group
0 John red
1 Anna blue
2 Monica blue
3 Mike yellow
4 Alex red
In [94]: df = df1.merge(df2).groupby('group').sum().reset_index()
In [95]: df
Out[95]:
group mark
0 blue 44
1 red 36
2 yellow 22
EDIT: If you have other columns, which you don't want to sum, do this:
In [284]: df1.merge(df2).groupby('group').agg({'mark': 'sum'}).reset_index()
Out[284]:
group mark
0 blue 44
1 red 36
2 yellow 22

For loop and if for multiple conditions and change another column in the same row in pandas

I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.

You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.

df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200

Assign specific nominal values randomly to rows using pandas

I want to assign some selected nominal values randomly to rows. For example:
I have three nominal values ["apple", "orange", "banana"].
Before assign these values randomly to rows:
**Name Fruit**
Jack
Julie
Juana
Jenny
Christina
Dickens
Robert
Cersei
After assign these values randomly to rows:
**Name Fruit**
Jack Apple
Julie Orange
Juana Apple
Jenny Banana
Christina Orange
Dickens Orange
Robert Apple
Cersei Banana
How can I do this using pandas dataframe?

You can use pd.np.random.choice with your values:
vals = ["apple", "orange", "banana"]
df['Fruit'] = pd.np.random.choice(vals, len(df))
>>> df
Name Fruit
0 Jack apple
1 Julie orange
2 Juana apple
3 Jenny orange
4 Christina apple
5 Dickens banana
6 Robert orange
7 Cersei orange

You can create a DataFrame in pandas and then assign random choices using numpy
ex2 = pd.DataFrame({'Name':['Jack','Julie','Juana','Jenny','Christina','Dickens','Robert','Cersei']})
ex2['Fruits'] = np.random.choice(['Apple','Orange','Banana'],ex2.shape[0])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two dataframes by partial string match - python

Related

Python Pandas calculate value_counts of two columns and use groupby

How to split two strings into different columns in Python with Pandas?

check if a name in one dataframe exist in other dataframe python

For loop and if for multiple conditions and change another column in the same row in pandas

Assign specific nominal values randomly to rows using pandas

Categories

Resources