Update DataFrame based on matching rows in another DataFrame - python

Say there is a group of people who can choose an English and / or a Spanish word. Let's say they chose like this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water',None,'hello','thanks',None,'green'],spanish=[None,'agua',None,None,'bienvenido','verde']))
person english spanish
0 mary water None
1 james None agua
2 patricia hello None
3 robert thanks None
4 jennifer None bienvenido
5 michael green verde
Say I also have an English-Spanish dictionary (assume no duplicates, i.e. one-to-one relationship):
>>> pandas.DataFrame(dict(english=['hello','bad','green','thanks','welcome','water'],spanish=['hola','malo','verde','gracias','bienvenido','agua']))
english spanish
0 hello hola
1 bad malo
2 green verde
3 thanks gracias
4 welcome bienvenido
5 water agua
How can I fill in any missing words, i.e. update the first DataFrame using the second DataFrame where either english or spanish is None, to arrive at this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water','water','hello','thanks','welcome','green'],spanish=['agua','agua','hola','gracias','bienvenido','verde']))
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

You may check the map with fillna
df['english'] = df['english'].fillna(df['spanish'].map(df2.set_index('spanish')['english']))
df['spanish'] = df['spanish'].fillna(df['english'].map(df2.set_index('english')['spanish']))
df
Out[200]:
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

Related

How to deal with long names in data cleaning?

I have a users database. I want to separate them into two columns to have user1 and user2.
The way I was solving this was to split the names into multiple columns then merge the names to have the two columns of users.
The issue I run into is some names are long and after the split. Those names take some spot on the data frame which makes it harder to merge properly.
Users
Maria Melinda Del Valle Justin Howard
Devin Craig Jr. Michael Carter III
Jeanne De Bordeaux Alhamdi
After I split the user columns
0
1
2
3
4
5
6
7
8
Maria
Melinda
Del
Valle
Justin
Howard
Devin
Craig
Jr.
Michael
Carter
III
Jeanne
De
Bordeaux
Alhamdi
The expected result is the following
User1
User2
Maria Melinda Del valle
Justin Howard
Devin Craig Jr.
Michael Carter III
Jeanne De Bordeaux
Alhamdi
You can use:
def f(sr):
m = sr.isna().cumsum().loc[lambda x: x < 2]
return sr.dropna().groupby(m).apply(' '.join)
out = df.apply(f, axis=1).rename(columns=lambda x: f'User{x+1}')
Output:
>>> out
User1 User2
0 Maria Melinda Del Valle Justin Howard
1 Devin Craig Jr. Michael Carter III
2 Jeanne De Bordeaux Alhamdi
As suggested by #Barmar, If you know where to put the blank columns in the first split, you should know how to create both columns.

How to slice pandas column with index list?

I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.

Remove some strings in dataframe

I'm trying to remove some strings in a dataframe that start with System:
My dataframe:
A B C
French house Blablabla System:Microsoft Windows XP; Browser:Chrome 32.0.1700;
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment System:Microsoft Windows XP; Browser:Chrome 32.0.1700;
I tried:
def remove_lines():
df['C'] = df['C'].str.replace(r'(\s+)(System:).+','')
return df
Nothing happens...
Good output:
A B C
French house Blablabla
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment
Use:
df.C = df.C.str.replace('System:.*','')
df.C
# 0 Blablabla
# 1 my address: 101-102 bd Charles de Gaulle 75001...
# 2 my name is Liam
# 3 Hello George!
# 4
# Name: C, dtype: object
You can simply use split function on System and pick the first part, like this:
In [1936]: df.C = pd.DataFrame(df.C.str.split('System').tolist())[0]
In [1937]: df
Out[1937]:
A B C
0 French house Blablabla
1 English house my address: 101-102 bd Charles de Gaulle 75001...
2 French apartment my name is Liam
3 French house Hello George!
4 English apartment

Extract certain elements based on element location from another column

I have two columns in a DataFrame, crewname is a list of crew members worked on a film. Director_loc is the location within the list of the director.
I want to create a new column which has the name of the director.
crewname Director_loc
[John Lasseter, Joss Whedon, Andrew Stanton, J... 0
[Larry J. Franco, Jonathan Hensleigh, James Ho... 3
[Howard Deutch, Mark Steven Johnson, Mark Stev... 0
[Forest Whitaker, Ronald Bass, Ronald Bass, Ez... 0
[Alan Silvestri, Elliot Davis, Nancy Meyers, N... 5
[Michael Mann, Michael Mann, Art Linson, Micha... 0
[Sydney Pollack, Barbara Benedek, Sydney Polla... 0
[David Loughery, Stephen Sommers, Peter Hewitt... 2
[Peter Hyams, Karen Elise Baldwin, Gene Quinta... 0
[Martin Campbell, Ian Fleming, Jeffrey Caine, ... 0
I've tried a number of codes using list comprehension, enumerate etc. I'm a bit embarrassed to put them here.
Any help will be appreciated.
Use indexing with list comprehension:
df['name'] = [a[b] for a , b in zip(df['crewname'], df['Director_loc'])]
print (df)
crewname Director_loc \
0 [John Lasseter, Joss Whedon, Andrew Stanton] 2
1 [Larry J. Franco, Jonathan Hensleigh] 1
name
0 Andrew Stanton
1 Jonathan Hensleigh

Pandas: Concatenate two dataframes with different column names

I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?

Categories