searching substring for match in dataframe - python

I am trying to use my df as a lookup table, and trying to determine if my string contains a value in that df. Simple example
str = 'John Smith Business Analyst'
df = pd.read_pickle('job_titles.pickle')
The df would be one column with several job titles.
df = accountant, lawyer, CFO, business analyst, etc..
Now, somehow be able to determine that str has a substring: Business Analyst, because that value is contained in my df.
The return result would be the substring = 'Business Analyst'
If the original str was:
str = 'John Smith Business'
Then the return would be empty since no substring matches a string in the df.
I have it working if it is for one word. For example:
df = pd.read_pickle('cities.pickle')
df = Calgary, Edmonton, Toronto, etc
str = 'John Smith Business Analyst Calgary AB Canada'
str_list = str.split()
for word in str_list:
df_location = df[df['name'].str.match(word)]
if not df_location.empty:
break
df_location = Calgary
The city will be found in the df, and return that one row. Just not sure how when it is more than one word.

I am not sure what you want to do with the returned value exactly, but here is a way to identify it at least. First, I made a toy dataframe:
import pandas as pd
titles_df = pd.DataFrame({'title' : ['Business Analyst', 'Data Scientist', 'Plumber', 'Baker', 'Accountant', 'CEO']})
search_name = 'John Smith Business Analyst'
titles_df
title
0 Business Analyst
1 Data Scientist
2 Plumber
3 Baker
4 Accountant
5 CEO
Then, I loop through the values in the title column to see if any of them are in the search term:
for val in titles_df['title'].values:
if val in search_name:
print(val)
If you want to do this over all the names in a dataframe column and assign a new column with the title you can do the following:
First, I create a dataframe with some names:
names_df = pd.DataFrame({'name' : ['John Smith Business Analyst', 'Dorothy Roberts CEO', 'Jim Miller Dancer', 'Samuel Adams Accountant']})
Then, I loop through the values of names and values of titles and assign the matched titles to a title column in the names dataframe (unmatched ones will have an empty string):
names_df['title'] = ''
for name in names_df['name'].values:
for title in titles_df['title'].values:
if title in name:
names_df['title'][names_df['name'] == name] = title
names_df
name title
0 John Smith Business Analyst Business Analyst
1 Dorothy Roberts CEO CEO
2 Jim Miller Dancer
3 Samuel Adams Accountant Accountant

Related

Count the number of strings with length in pandas

I am trying to calculate the number of strings in a column with length of 5 or more. These strings are in a column separated by comma.
df= pd.DataFrame(columns=['first'])
df['first'] = ['Jack Ryan, Tom O','Stack Over Flow, StackOverFlow','Jurassic Park, IT', 'GOT']
Code I have used till now but not creating a new column with counts of strings of more than 5 characters.
df['countStrings'] = df['first'].str.split(',').count(r'[a-zA-Z0-9]{5,}')
Expected Output: Counting Strings of length 5 or More.
first
countString
Jack Ryan, Tom O
0
Stack Over Flow, StackOverFlow
2
Jurassic Park, IT
1
GOT
0
Edge Case: Strings of length more than 5 separated by comma and have multiple spaces
first
wrongCounts
rightCounts
Accounts Payable Goods for Resale
4
1
Corporate Finance, Financial Engineering
4
2
TBD
0
0
Goods for Not Resale, SAP
2
1
Pandas str.len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.
Since this is a string method, .str has to be prefixed everytime before calling this method.
Yo can try this :
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = ['jack,utah,TOMHAWK
Somer,SORITNO','jill','bob,texas','matt,AR','john']
df['first'].replace(',',' ', regex=True, inplace=True)
df['first'].str.count(r'\w+').sum()
You can match 5 chars and on the left and right match optional chars other than a comma.
[^,]*[A-Za-z0-9]{5}[^,]*
See a regex demo with the matches.
Example
import pandas as pd
df = pd.DataFrame(columns=['first'])
df['first'] = [
'Accounts Payable Goods for Resale',
'Corporate Finance, Financial Engineering',
'TBD',
'Goods for Not Resale, SAP',
'Jack Ryan, Tom O',
'Stack Over Flow, StackOverFlow',
'Jurassic Park, IT',
'GOT'
]
df['countStrings'] = df['first'].str.count(r'[^,]*[A-Za-z0-9]{5}[^,]*')
print(df)
Output
first countStrings
0 Accounts Payable Goods for Resale 1
1 Corporate Finance, Financial Engineering 2
2 TBD 0
3 Goods for Not Resale, SAP 1
4 Jack Ryan, Tom O 0
5 Stack Over Flow, StackOverFlow 2
6 Jurassic Park, IT 1
7 GOT 0
This is how i would try to get the number of strings with len>=5 in a column:
data=[i for k in df['first']
for i in k.split(',')
if len(i)>=5]
result=len(data)

IOB format merge

I have a dataframe in IOB format as below:-
Name
Label
Alan
B-PERSON
Smith
I-PERSON
is
O
Alice's
B-PERSON
uncle
O
from
O
New
B-LOCATION
York
I-LOCATION
city
I-LOCATION
I would like to convert into a new dataframe as below:-
Name
Label
Alan Smith
PERSON
Alice's
PERSON
New York city
LOCATION
Any help is much appreciated!
You can create groups by compare values O, remove IO- values in Label column and with helper groups created by cumulative sum aggregate join:
m = df['Label'].eq('O')
df = (df[~m].assign(Label=lambda x: x['Label'].str.replace('^[IB]-', ''))
.groupby([m.cumsum(), 'Label'])['Name']
.agg(' '.join)
.droplevel(0)
.reset_index()
.reindex(df.columns, axis=1))
print (df)
Name Label
0 Alan Smith PERSON
1 Alice's PERSON
2 New York city LOCATION

Pandas: set group id based on identical columns and same elements in list

Updated!
Hi,
my data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:
first_name and last_name are identical
or (if 1. doesn't hold) their last_name are the same and they have lived in at least one identical city.
The result should be a new column indicating the group id that each person belongs to.
The DataFrame df looks like this:
>>> df
last_name first_name cities
0 Dorsey Nancy [Moscow, New York]
1 Harper Max [Munich, Paris, Shanghai]
2 Mueller Max [New York, Los Angeles]
3 Dorsey Nancy [New York, Miami]
4 Harper Maxwell [Munich, Miami]
The new dataframe df_id should look like this. The order of id is irrelevant (i.e., which group gets id=1), but only observations that fulfill either condition 1 or 2 should get the same id.
>>> df_id
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1
4 Harper Maxwell [Munich, Miami] 2
My current code:
df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_couter and dictionionary for to match index to id
id_counter = 1
id_matched = dict()
# define id function
def match_id(df):
global id_counter
# check if index already matched
if df.index not in id_matched.keys():
# get all persons with similar names (condition 1)
select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
# get all persons with same last_name and city (condition 2)
if select.empty:
select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'].isin(df['cities']))]
# create new id for this specific person
if select_2.empty:
id_matched[df.index] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select_2.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# create new id for group of person and record in dictionary
else:
select_list = select.index.unique().tolist()
for i in select_list:
id_matched[select_list[i]] = id_counter
# set next id
id_counter += 1
# run function
df = df.progress_apply(match_id, axis=1)
# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index')
df_id_matched['id'] = df_id_matched.index
# merge back together with df to create df_id
Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...
Thanks in advance!
Use:
#sample data was changed for lists for each cities
#like 'Moscow, New York' changed to 'Moscow', 'New York'
df_id = pd.DataFrame({'last_name':['Dorsey','Harper', 'Mueller', 'Dorsey'],
'first_name':['Nancy','Max', 'Max', 'Nancy'],
'cities':[['Moscow', 'New York'], ['Munich', 'Paris', 'Shanghai'], ['New York', 'Los Angeles'], ['New York', 'Miami']]})
#created default index values
df_id = df_id.reset_index(drop=True)
#explode lists to rows
df = df_id.explode('cities')
#get duplicates per 3 columns, get at least one dupe by index and sorting
s = (df.duplicated(['last_name','first_name','cities'], keep=False)
.any(level=0)
.sort_values(ascending=False))
#create new column with cumulative sum by inverted mask
df_id['id'] = (~s).cumsum().add(1)
print (df_id)
last_name first_name cities id
0 Dorsey Nancy [Moscow, New York] 1
1 Harper Max [Munich, Paris, Shanghai] 2
2 Mueller Max [New York, Los Angeles] 3
3 Dorsey Nancy [New York, Miami] 1

Modify series from other series objects

so I've data like this:
Id Title Fname lname email
1 meeting with Jay, Aj Jay kay jk#something.com
1 meeting with Jay, Aj Aj xyz aj#something.com
2 call with Steve Steve Jack st#something.com
2 call with Steve Harvey Ray h#something.com
3 lunch Mike Mil Mike m#something.com
I want to remove firstname & last name for each unique Id from Title.
I tried grouping by Id which gives series Objects for Title, Fname, Lname,etc
df.groupby('Id')
I've concatenated Fname with .agg(lambda x: x.sum() if x.dtype == 'float64' else ','.join(x))
& kept in concated dataframe.
likewise all other columns get aggregated. Question is how do I replace values in Title based on this aggregated series.
concated['newTitle'] = [ concated.Title.str.replace(e[0]).replace(e[1]).replace(e[1])
for e in
zip(concated.FName.str.split(','), concated.LName.str.split(','))
]
I want something like this, or some other way, by which for each Id, I could get newTitle, with replaced values.
output be like:
Id Title
1 Meeting with ,
2 call with
3 lunch
Create a mapper series by joining Fname and lname and replace,
s = df.groupby('Id')[['Fname', 'lname']].apply(lambda x: '|'.join(x.stack()))
df.set_index('Id')['Title'].replace(s, '', regex = True).drop_duplicates()
Id
1 meeting with ,
2 call with
3 lunch

Splitting DataFrame into DataFrame's

I have one DataFrame where different rows can have the same value for one column.
As an example:
import pandas as pd
df = pd.DataFrame( {
"Name" : ["Alice", "Bob", "John", "Mark", "Emma" , "Mary"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland John
3 Seattle Mark
4 Seattle Emma
5 Portland Mary
Here, a given value for "City" (e.g. "Portland") is shared by several rows.
I want to create from this data frame several data frames that have in common the value of one column. For the example above, I want to get the following data frames:
City Name
0 Seattle Alice
1 Seattle Bob
3 Seattle Mark
4 Seattle Emma
and
City Name
2 Portland John
5 Portland Mary
From this answer, I am creating a mask that can be used to generate one data frame:
def mask_with_in1d(df, column, val):
mask = np.in1d(df[column].values, [val])
return df[mask]
# Return the last data frame above
mask_with_in1d(df, 'City', 'Portland')
The problem is to create efficiently all data frames, to which a name will be assigned. I am doing it this way:
unique_values = np.sort(df['City'].unique())
for city_value in unique_values:
exec("df_{0} = mask_with_in1d(df, 'City', '{0}')".format(city_value))
which gives me the data frames df_Seattle and df_Portland that I can further manipulate.
Is there a better way of doing this?
Have you got a fixed list of cities you want to do this for? Simplest solution is to group by city and can then loop over the groups
for city, names in df.groupby("City"):
print(city)
print(names)
Portland
City Name
2 Portland John
5 Portland Mary
Seattle
City Name
0 Seattle Alice
1 Seattle Bob
3 Seattle Mark
4 Seattle Emma
Could then assign to a dictionary or some such (df_city[city] = names) if you wanted df_city["Portland"] to work. Depends what you want to do with the groups once split.
You can use groupby for this:
dfs = [gb[1] for gb in df.groupby('City')]
This will construct a list of dataframes, one per value of the 'City' column.
In case you want tuples with the value of the dataframe, you can use:
dfs = list(df.groupby('City'))
Note that assigning by name is usually an anti-pattern. And exec and eval are definitely antipatterns.

Categories