Combine two colums of string data in one with selection rule - python

I have to combine two colums of string data in one (in the same DataFrame), also I need some sort of selection rule, I give you an example
import numpy as np
import pandas as pd
df = pd.DataFrame({'nameA':['martin', 'peter', 'john', 'tom', 'bill'],
'nameB':[ np.NaN,np.NaN , 'jhon', 'tomX', 'billX']})
df
nameA nameB
0 martin NaN
1 peter NaN
2 john jhon
3 tom tomX
4 bill billX
This output is the expected behavior
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX bilX
the rule should be something like this:
if A and B are different write B
if B are NaN write A
if A and B are NaN write NaN
I have found tips with numbers but no with strings, I think I have to test row by row and get a true o false value and then write the appropriate value
Any guidance or assistance would be greatly appreciated!

You can use df.combine_first():
In [1972]: df['nameAB'] = df.nameB.combine_first(df.nameA)
In [1973]: df
Out[1973]:
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX billX

Use np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame({'nameA': ['martin', 'peter', 'john', 'tom', 'bill'],
'nameB': [np.NaN, np.NaN, 'jhon', 'tomX', 'billX']})
df['nameAB'] = np.where(pd.isna(df['nameB']), df['nameA'], df['nameB'])
print(df)
Output
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX billX
Given your conditions you only return nameA when nameB is nan.

Series.fillna
df['nameAB'] = df['nameB'].fillna(df['nameA'])
print(df)
Output
nameA nameB nameAB
0 martin NaN martin
1 peter NaN peter
2 john jhon jhon
3 tom tomX tomX
4 bill billX billX

Related

Keep original string values after pandas str.extract() if the regex doesn't match

My input data:
df=pd.DataFrame({'A':['adam','monica','joe doe','michael mo'], 'B':['david','valenti',np.nan,np.nan]})
print(df)
A B
0 adam david
1 monica valenti
2 joe doe NaN
3 michael mo NaN
I need to extract strings after space, to a second column, but when I use my code...:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')
print(df)
A B
0 adam NaN
1 monica NaN
2 joe doe doe
3 michael mo mo
...I receive NaN in each cell where value has not been extracted. How to avoid it?
I tried to extract only from rows where NaN exist using this code:
df.loc[df.B.isna(),'B'] = df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
ValueError: Incompatible indexer with DataFrame
Expected output:
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
I think solution should be simplify - split by spaces and get second lists and pass to Series.fillna function:
df['B'] = df['B'].fillna(df['A'].str.split().str[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.split().str[1])
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
Your solution should be changed:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')[0].fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Better solution wich changed regex and expand=False for Series:
df['B'] = df['A'].str.extract(r'( [a-zA-Z].*)', expand=False).fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.extract(r'( [a-zA-Z].*)', expand=False))
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
EDIT:
For extract also values from first column simpliest is use:
df1 = df['A'].str.split(expand=True)
df['A'] = df1[0]
df['B'] = df['B'].fillna(df1[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe
3 michael mo
Your approach doesn't function because of the different shapes of the right and the left sides of your statement. The left part has the shape (2,) and the right part (2, 2):
df.loc[df.B.isna(),'B']
Returns:
2 NaN
3 NaN
And you want to fill this with:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
Returns:
0 1
2 doe oe
3 mo o
You can take the column 1 and then it will have the same shape (2,) as the left part and will fit:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')[1]
Returns:
2 oe
3 o

how to melt a dataframe -- get the column name in the field of melt dataframe

I have a df as below
name 0 1 2 3 4
0 alex NaN NaN aa bb NaN
1 mike NaN rr NaN NaN NaN
2 rachel ss NaN NaN NaN ff
3 john NaN ff NaN NaN NaN
the melt function should return the below
name code
0 alex 2
1 alex 3
2 mike 1
3 rachel 0
4 rachel 4
5 john 1
Any suggestion is helpful. thanks.
Just follow these steps: melt, dropna, sort column name, reset index, and finally drop any unwanted columns
In [1171]: df.melt(['name'],var_name='code').dropna().sort_values('name').reset_index().drop(['index', 'value'], 1)
Out[1171]:
name code
0 alex 2
1 alex 3
2 john 1
3 mike 1
4 rachel 0
5 rachel 4
This should work.
df.unstack().reset_index().dropna()
df.set_index('name').unstack().reset_index().rename(columns={'level_0':'Code'}).dropna().drop(0,axis =1)[['name','Code']].sort_values('name')
output will be
name Code
alex 2
alex 3
john 1
mike 1
rachel 0
rachel 4

Pandas: how to merge to dataframes on multiple columns?

I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN
Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.
You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN

Pandas data frame spread function or similar?

Here's a pandas df:
df = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'],
'Last' : ['Smith', 'Doe', 'Johnson'],
'Group' : ['A', 'B', 'A'],
'Measure' : [2, 11, 1]})
df
Out[38]:
First Last Group Measure
0 John Smith A 2
1 Jane Doe B 11
2 Mary Johnson A 1
I would like to "spread" the Group variable with the values in Measure.
df_desired
Out[39]:
First Last A B
0 John Smith 2 0
1 Jane Doe 0 11
2 Mary Johnson 1 0
Each level within Group variable becomes its own column populated with the values contained in column Measure. How can I achieve this?
Using pivot_table
df.pivot_table(index=['First','Last'],columns='Group',values='Measure',fill_value=0)
Out[247]:
Group A B
First Last
Jane Doe 0 11
John Smith 2 0
Mary Johnson 1 0
If your order doesn't matter, you can do something along these lines:
df.set_index(['First','Last', 'Group']).unstack('Group').fillna(0).reset_index()
First Last Measure
Group A B
0 Jane Doe 0.0 11.0
1 John Smith 2.0 0.0
2 Mary Johnson 1.0 0.0

How to assign a unique ID to detect repeated rows in a pandas dataframe?

I am working with a large pandas dataframe, with several columns pretty much like this:
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
Homer Bart 2 3
Lisa John 5 0
Homer Bart 2 3
Homer Bart 2 3
Tom Maggie 1 4
How can I assign an unique id to each repeated row? For example:
A B C D new_id
John Tom 0 1.2 1
Homer Bart 2 3.0 2
Tom Maggie 1 4.2 3
Lisa John 5 0 4
Homer Bart 2 3 5
Lisa John 5 0 4
Homer Bart 2 3.0 2
Homer Bart 2 3.0 2
Tom Maggie 1 4.1 6
I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:
df.assign(id=(df.columns).astype('category').cat.codes)
df
However, is not working. How can I get a unique id for detecting groups of duplicated rows?
For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
groupby is more efficient for larger dataframes:
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
Group by the columns you are trying to find duplicates over and use ngroup:
df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

Categories