How to match between row values - python

serial
name
match
1
John
5,6,8
2
Steve
1,7
3
Kevin
4
4
Kevin
3
5
John
1,6,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
Need to check and match the name of each row with the name of the row serial number mentioned in it's match column. Keep the serials matching else remove it. If nothing is matching then put null.
serial
name
match
updated_match
1
John
5,6,8
5,8
2
Steve
1,7
3
Kevin
4
4
4
Kevin
3
3
5
John
1,6,8
1,8
6
Johnn
1,5,8
7
Steves
2
8
John
1,5,6
1,5

Convert values of serial column to strings,then mapping aggregate sets to Series wth same size like original, split column match and get difference with intersection of sets, last sorting and join back to strings:
s = df['serial'].astype(str)
sets = df['name'].map(s.groupby(df['name']).agg(set))
match = df['match'].str.split(',')
df['updated_match'] = [','.join(sorted(b.difference([c]).intersection(a)))
for a, b, c in zip(match, sets, s)]
print (df)
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5

You can use mappings to determine which match have the same names:
# ensure we have strings in serial
df = df.astype({'serial': 'str'})
# split and explode he individual serials
s = df['match'].str.split(',').explode()
# make a mapper: serial -> name
mapper = df.set_index('serial')['name']
# get exploded names
s2 = df.loc[s.index, 'serial'].map(mapper)
# keep only match with matching names
# aggregate back as string
df['updated_match'] = (s[s.map(mapper).eq(s2)]
.groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='')
)
output:
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
Alternative with a groupby.apply:
df['updated_match'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g['match'].str.split(',').explode()
.loc[lambda x: x.isin(g['serial'])]
.groupby(level=0).agg(','.join))
.reindex(df.index, fill_value='')
)

Related

How to loop pandas dataframe with subset of data (like group by) [duplicate]

This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 11 months ago.
I have a pandas dataframe after sorted, it looks like bellow (like few person working for shop as shift):
A B C D
1 1 1 Anna
2 3 1 Anna
3 1 2 Anna
4 3 2 Tom
5 3 2 Tom
6 3 2 Tom
7 3 2 Tom
8 1 1 Anna
9 3 1 Anna
10 1 2 Tom
...
I want to loop and split dataframe to subset of dataframe, then call my another function, eg:
first subset df would be
A B C D
1 1 1 Anna
2 3 1 Anna
3 1 2 Anna
second subset df would be
4 3 2 Tom
5 3 2 Tom
6 3 2 Tom
7 3 2 Tom
third subset df would be
8 1 1 Anna
9 3 1 Anna
Is there a good way to loop the main datafraem and split it?
for x in some_magic_here:
sub_df = some_mage_here_too()
my_fun(sub_df)
Thanks!
You need loop by groupby object with consecutive groups created by compare shifted D values for not equal with cumulative sum:
for i, sub_df in df.groupby(df.D.ne(df.D.shift()).cumsum()):
print (sub_df)
my_fun(sub_df)

Pandas: compare how to compare two columns in different sheets and return matched value

I have two dataframes with multiple columns.
I would like to compare df1['id'] and df2['id'] and return a new df with another column that have the match value.
example:
df1
**id** **Name**
1 1 Paul
2 2 Jean
3 3 Alicia
4 4 Jennifer
df2
**id** **Name**
1 1 Paul
2 6 Jean
3 3 Alicia
4 7 Jennifer
output
**id** **Name** *correct_id*
1 1 Paul 1
2 2 Jean N/A
3 3 Alicia 3
4 4 Jennifer N/A
Note- the length of the two columns I want to match is not the same.
Try:
df1["correct_id"] = (df1["id"].isin(df2["id"]) * df1["id"]).replace(0, "N/A")
print(df1)
Prints:
id Name correct_id
0 1 Paul 1
1 2 Jean N/A
2 3 Alicia 3
3 4 Jennifer N/A

Pandas - Data transformation of column using now delimiters

I have a pandas dataframe which consists of players names and statistics from a sporting match. The only source of data lists them in the following format:
# PLAYER M FG 3PT FT REB AST STL PTS
34 BLAKE Brad 38 17 5 6 3 0 3 0 24
12 JONES Ben 42 10 2 6 1 0 4 1 12
8 SMITH Todd J. 16 9 1 4 1 0 3 2 18
5 MAY-DOUGLAS James 9 9 0 3 1 0 2 1 6
44 EDLIN Taylor 12 6 0 5 1 0 0 1 8
The players names are in reverse order: Surname Firstname. I need to transform the names to the current order of firstname lastname. So, specifically:
BLAKE Brad -> Brad BLAKE
SMITH Todd J. -> Todd J. SMITH
MAY-DOUGLAS James -> James MAY-DOUGLAS
The case of the letters do not matter, however I thought potentially they could be used to differentiate the first and lastname. I know all lastnames with always be in uppercase even if they include a hyphen. The first name will always be sentence case (first letter uppercase and the rest lowercase). However some names include the middle name to differentiate players with the same name. I see how a space character can be used a delemiter and potentially use a "split" transformation but it guess difficult with the middle name character.
Is there any suggestions of a function from Pandas I can use to achieve this?
The desired out put is:
# PLAYER M FG 3PT FT REB AST STL PTS
34 Brad BLAKE 38 17 5 6 3 0 3 0 24
12 Ben JONES 42 10 2 6 1 0 4 1 12
8 Todd J. SMITH 16 9 1 4 1 0 3 2 18
5 James MAY-DOUGLAS 9 9 0 3 1 0 2 1 6
44 Taylor EDLIN 12 6 0 5 1 0 0 1 8
Try to split by first whitespace, then reverse the list and join list values with whitespace.
df['PLAYER'] = df['PLAYER'].str.split(' ', 1).str[::-1].str.join(' '))
To reverse only certain names, you can use isin then boolean indexing
names = ['BLAKE Brad', 'SMITH Todd J.', 'MAY-DOUGLAS James']
mask = df['PLAYER'].isin(names)
df.loc[mask, 'PLAYER'] = df.loc[mask, 'PLAYER'].str.split('-', 1).str[::-1].str.join(' ')

Grouping values in a a dataframe

i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Bravo
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
how can i group these numbers based on names
for Number,Names in zip(dsa['Number'],dsa['Names'])
print(Number,Names)
The above code gives me following output
1 Josh
2 Jon
3 Adam
4 Barsa
5 Fekse
6 Bravo
7 Barsa
8 Talyo
9 Jon
10 Zidane
How can i get a output like below
1 Josh
2,9 Jon
3 Adam
4,7 Barsa
5 Fekse
6 Bravo
8 Talyo
10 Zidane
I want to group the numbers based on names
Something like this?
df.groupby("Names")["Number"].unique()
This will return you a series and then you can transform as you wish.
Use pandas' groupby function with agg which aggregates columns. Assuming your dataframe is called df:
grouped_df = df.groupby(['Names']).agg({'Number' : ['unique']})
This is grouping by Names and within those groups reporting the unique values of Number.
Lets say the DF is:
A = pd.DataFrame({'n':[1,2,3,4,5], 'name':['a','b','a','c','c']})
n name
0 1 a
1 2 b
2 3 a
3 4 c
4 5 c
You can use groupby to group by name, and then apply 'list' to the n of those names:
A.groupby('name')['n'].apply(list)
name
a [1, 3]
b [2]
c [4, 5]

How to strip the string and replace the existing elements in DataFrame

I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?
Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)
Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')
Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])
You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob

Categories