IOB format merge - python

I have a dataframe in IOB format as below:-
Name
Label
Alan
B-PERSON
Smith
I-PERSON
is
O
Alice's
B-PERSON
uncle
O
from
O
New
B-LOCATION
York
I-LOCATION
city
I-LOCATION
I would like to convert into a new dataframe as below:-
Name
Label
Alan Smith
PERSON
Alice's
PERSON
New York city
LOCATION
Any help is much appreciated!

You can create groups by compare values O, remove IO- values in Label column and with helper groups created by cumulative sum aggregate join:
m = df['Label'].eq('O')
df = (df[~m].assign(Label=lambda x: x['Label'].str.replace('^[IB]-', ''))
.groupby([m.cumsum(), 'Label'])['Name']
.agg(' '.join)
.droplevel(0)
.reset_index()
.reindex(df.columns, axis=1))
print (df)
Name Label
0 Alan Smith PERSON
1 Alice's PERSON
2 New York city LOCATION

Related

How to replace a part of column value with values from another two columns based on a condition in pandas

I have a dataframe df as shown below. I want replace all the temp_idcolumn values which are having _(underscore with another value which combination of numerical part of the temp_id + city+ country column values.
df
temp_id city country
12225IND DELHI IND
14445UX_TY AUSTIN US
56784SIN BEDOK SIN
72312SD_IT_UZ NEW YORK US
47853DUB DUBAI UAE
80976UT_IS_SZ SYDENY AUS
89012TY_JP_IS TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
Expected Output
temp_id city country
12225IND DELHI IND
14445AUSTINUS AUSTIN US
56784SIN BEDOK SIN
72312NEWYORKUS NEW YORK US
47853DUB DUBAI UAE
80976SYDENYAUS SYDENY AUS
89012TOKOYOJPN TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
How can this be done in pandas python
Use boolean indexing:
# find rows with value in country and city
m1 = df[['city', 'country']].notna().all(axis=1)
# find rows with a "_"
m2 = df['temp_id'].str.contains('_')
# both conditions above
m = m1&m2
# replace matching rows by number + city + country
df.loc[m, 'temp_id'] = (df.loc[m, 'temp_id'].str.extract('^(\d+)', expand=False)
+df.loc[m, 'city'].str.replace(' ', '')+df.loc[m, 'country']
)
Output:
temp_id city country
0 12225IND DELHI IND
1 14445AUSTINUS AUSTIN US
2 56784SIN BEDOK SIN
3 72312NEWYORKUS NEW YORK US
4 47853DUB DUBAI UAE
5 80976SYDENYAUS SYDENY AUS
6 89012TOKOYOJPN TOKOYO JPN
7 51309HJ_IS_IS None None
8 42087IND MUMBAI IND
You can use the str.replace() method on the temp_id column and use regular expressions to match the pattern of the values you want to replace. Here is an example:
import re
df['temp_id'] = df['temp_id'].apply(lambda x: re.sub(r'^(\d+)_.*', r'\1'+df['city']+df['country'], x))
This uses a regular expression to match the pattern of the temp_id values that you want to replace (in this case, any value that starts with one or more digits followed by an underscore), and replaces them with the matched digits concatenated with the values of the corresponding city and country columns. The result will be the temp_id column with the desired format.

Pandas: set group id based on identical columns and same elements in list

Updated!
Hi,
my data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:
first_name and last_name are identical
or (if 1. doesn't hold) their last_name are the same and they have lived in at least one identical city.
The result should be a new column indicating the group id that each person belongs to.
The DataFrame df looks like this:
>>> df
last_name first_name cities
0 Dorsey Nancy [Moscow, New York]
1 Harper Max [Munich, Paris, Shanghai]
2 Mueller Max [New York, Los Angeles]
3 Dorsey Nancy [New York, Miami]
4 Harper Maxwell [Munich, Miami]
The new dataframe df_id should look like this. The order of id is irrelevant (i.e., which group gets id=1), but only observations that fulfill either condition 1 or 2 should get the same id.
>>> df_id
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1
4 Harper Maxwell [Munich, Miami] 2
My current code:
df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_couter and dictionionary for to match index to id
id_counter = 1
id_matched = dict()
# define id function
def match_id(df):
global id_counter
# check if index already matched
if df.index not in id_matched.keys():
# get all persons with similar names (condition 1)
select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
# get all persons with same last_name and city (condition 2)
if select.empty:
select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'].isin(df['cities']))]
# create new id for this specific person
if select_2.empty:
id_matched[df.index] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select_2.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# set next id
id_counter += 1
# run function
df = df.progress_apply(match_id, axis=1)
# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index')
df_id_matched['id'] = df_id_matched.index
# merge back together with df to create df_id
Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...
Thanks in advance!
Use:
#sample data was changed for lists for each cities
#like 'Moscow, New York' changed to 'Moscow', 'New York'
df_id = pd.DataFrame({'last_name':['Dorsey','Harper', 'Mueller', 'Dorsey'],
'first_name':['Nancy','Max', 'Max', 'Nancy'],
'cities':[['Moscow', 'New York'], ['Munich', 'Paris', 'Shanghai'], ['New York', 'Los Angeles'], ['New York', 'Miami']]})
#created default index values
df_id = df_id.reset_index(drop=True)
#explode lists to rows
df = df_id.explode('cities')
#get duplicates per 3 columns, get at least one dupe by index and sorting
s = (df.duplicated(['last_name','first_name','cities'], keep=False)
.any(level=0)
.sort_values(ascending=False))
#create new column with cumulative sum by inverted mask
df_id['id'] = (~s).cumsum().add(1)
print (df_id)
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1

searching substring for match in dataframe

I am trying to use my df as a lookup table, and trying to determine if my string contains a value in that df. Simple example
str = 'John Smith Business Analyst'
df = pd.read_pickle('job_titles.pickle')
The df would be one column with several job titles.
df = accountant, lawyer, CFO, business analyst, etc..
Now, somehow be able to determine that str has a substring: Business Analyst, because that value is contained in my df.
The return result would be the substring = 'Business Analyst'
If the original str was:
str = 'John Smith Business'
Then the return would be empty since no substring matches a string in the df.
I have it working if it is for one word. For example:
df = pd.read_pickle('cities.pickle')
df = Calgary, Edmonton, Toronto, etc
str = 'John Smith Business Analyst Calgary AB Canada'
str_list = str.split()
for word in str_list:
df_location = df[df['name'].str.match(word)]
if not df_location.empty:
break
df_location = Calgary
The city will be found in the df, and return that one row. Just not sure how when it is more than one word.
I am not sure what you want to do with the returned value exactly, but here is a way to identify it at least. First, I made a toy dataframe:
import pandas as pd
titles_df = pd.DataFrame({'title' : ['Business Analyst', 'Data Scientist', 'Plumber', 'Baker', 'Accountant', 'CEO']})
search_name = 'John Smith Business Analyst'
titles_df
title
0 Business Analyst
1 Data Scientist
2 Plumber
3 Baker
4 Accountant
5 CEO
Then, I loop through the values in the title column to see if any of them are in the search term:
for val in titles_df['title'].values:
if val in search_name:
print(val)
If you want to do this over all the names in a dataframe column and assign a new column with the title you can do the following:
First, I create a dataframe with some names:
names_df = pd.DataFrame({'name' : ['John Smith Business Analyst', 'Dorothy Roberts CEO', 'Jim Miller Dancer', 'Samuel Adams Accountant']})
Then, I loop through the values of names and values of titles and assign the matched titles to a title column in the names dataframe (unmatched ones will have an empty string):
names_df['title'] = ''
for name in names_df['name'].values:
for title in titles_df['title'].values:
if title in name:
names_df['title'][names_df['name'] == name] = title
names_df
name title
0 John Smith Business Analyst Business Analyst
1 Dorothy Roberts CEO CEO
2 Jim Miller Dancer
3 Samuel Adams Accountant Accountant

Modify series from other series objects

so I've data like this:
Id Title Fname lname email
1 meeting with Jay, Aj Jay kay jk#something.com
1 meeting with Jay, Aj Aj xyz aj#something.com
2 call with Steve Steve Jack st#something.com
2 call with Steve Harvey Ray h#something.com
3 lunch Mike Mil Mike m#something.com
I want to remove firstname & last name for each unique Id from Title.
I tried grouping by Id which gives series Objects for Title, Fname, Lname,etc
df.groupby('Id')
I've concatenated Fname with .agg(lambda x: x.sum() if x.dtype == 'float64' else ','.join(x))
& kept in concated dataframe.
likewise all other columns get aggregated. Question is how do I replace values in Title based on this aggregated series.
concated['newTitle'] = [ concated.Title.str.replace(e[0]).replace(e[1]).replace(e[1])
for e in
zip(concated.FName.str.split(','), concated.LName.str.split(','))
]
I want something like this, or some other way, by which for each Id, I could get newTitle, with replaced values.
output be like:
Id Title
1 Meeting with ,
2 call with
3 lunch
Create a mapper series by joining Fname and lname and replace,
s = df.groupby('Id')[['Fname', 'lname']].apply(lambda x: '|'.join(x.stack()))
df.set_index('Id')['Title'].replace(s, '', regex = True).drop_duplicates()
Id
1 meeting with ,
2 call with
3 lunch

Comparing columns from two data frames

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

Categories