Replace a string with a string out of many in Pandas - python

So, I have a pandas data frame where one column contains the description of the nationality of a user and I want to replace this whole description with the country he's from.
My inputs are the df and the list of countries:
Description
ID
I am from Atlantis
1
My family comes from Narnia
2
["narnia","uzbekistan","Atlantis",...]
I know that:
I only have one country per description
the description contains the name of the country or does not, there is no necessity to infer the country from what he says, I only want to map [phrase containing name of country] to [country].
If I had only one country to replace I could use something like
df.loc[df['description'].str.contains('Atlantis', case=False), 'description'] = 'Atlantis'
I know that, because the country names are organised in a list, I could cycle through it and apply this to all the elements, something like:
for country in country_list:
df.loc[df['description'].str.contains(country, case=False), 'description'] = country
but it seems to me quite unpythonic so I was wondering if anyone could help me finding a better way (that I'm sure exists)
The output should be:
Description
ID
Atlantis
1
Narnia
2

You can use pd.Series.str.extract:
country_list = ["narnia","uzbekistan","Atlantis"]
df = pd.DataFrame({'Description': {0: 'I am from Atlantis',
1: 'My family comes from Narnia'},
'ID': {0: 1, 1: 2}})
print (df["Description"].str.extract(f"({'|'.join(country_list)})", flags=re.I))
0
0 Atlantis
1 Narnia

Related

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():
Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'
Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex
If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.
I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Update one column's value based on another column's value in Pandas using regular expression

Suppose I have a dataframe like below:
>>> df = pd.DataFrame({'Category':['Personal Care', 'Home Care', 'Pharma', 'Pet'], 'SubCategory':['Shampoo', 'Floor Wipe', 'Veterinary', 'Animal Feed']})
>>> df
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pharma Veterinary
3 Pet Animal Feed
I'd like to update the value in 'Category' column whenever the 'Subcategory' column's value contains either 'Veterinary' or 'Animal' (case-insensitive). To do that, I devised a method like below:
def update_col1_values_based_on_values_in_col2_using_regex_mappings(
df,
col1_name: str,
col2_name: str,
dictionary_of_regex_mappings: dict):
for pattern, new_str_value in dictionary_of_regex_mappings.items():
mask = df[col2_name].str.contains(pattern)
df.loc[mask, col1_name] = new_str_value
return df
This method works as expected as shown below:
>>> df1 = update_col1_values_based_on_values_in_col2_using_regex_mappings(df, 'Category', 'SubCategory', {"(?i).*Veterinary.*": "Pet Related", "(?i).*Animal.*": "Pet Related"})
>>> df1
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed
In practice, there will be more than 'Veterinary' and 'Animal Feed' to map from, so some of the suggestions below, although they read elegant, are not going be practical for the actual use case. In other words, please assume that the mapping is going to be more like this:
{
"(?i).*Veterinary.*": "Pet Related",
"(?i).*Animal.*": "Pet Related"
"(?i).*Pharma.*": "Pharmaceutical",
"(?i).*Diary.*": "Other",
... # lots and lots more mapping here
}
I'm wondering if there's a more elegant (Pandas-ish) way to accomplish this. Thank you in advance for your suggestions!
EDIT: I didn't clarify in the beginning that the mapping between 'Category' and 'Subcategory' columns wouldn't be restricted to just 'Veterinary' and 'Animal'.
You can use the following code, which is intuitive.
df['Category'] = df['SubCategory'].map(lambda x: "Pet Related" if "Animal" in x or "Veterinary" in x else x)
You could do it with pd.DataFrame.where, and re to add the flag case-insensitive:
import re
df.Category.where(~df.SubCategory.str.contains('Veterinary|Animal',flags = re.IGNORECASE),'Pet Related',inplace=True)
Output:
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed
Not sure if this is the best way, but you can do this:
df.loc[df.SubCategory.str.contains('Veterinary|Animal'), 'Category']='Pet Related'
If you need to use regex, str.contains() does also support regex
pattern = r'(?i)veterinary|animal'
df.loc[df.SubCategory.str.contains(pattern, regex=True), 'Category']='Pet Related'
And this is the result
In [3]: df
Out[3]:
Category SubCategory
0 Personal Care Shampoo
1 Home Care Floor Wipe
2 Pet Related Veterinary
3 Pet Related Animal Feed

How can I eliminate a duplicated row of a form submission in Python with Pandas?

I've got a dataset of form submissions - and some of the forms have been submitted multiple times.
The same person, same selections in the form, but slightly different submission_ids and submission dates.
I want to remove one of the submissions (I'll say the 2nd one, but it shouldn't matter because they are identical). If I do :
lit_subset[lit_subset.duplicated()]
I either don't get what I want (because the submission_ids are unique) or if I subset the columns (remove the submission_id and submission_date) then I can see which records are duped up, but I don't know how to grab one of the submission_ids and remove it from the original dataset. This is an easy thing for me to do in SQL Server:
select first_name
,last_name
,email
,telephone
,accountNumber
,refund_option
,max(submission_id) as 'max_submission'
from #refund_form_data
group by first_name
,last_name
,email
,telephone
,accountNumber
,refund_option
having count(*) > 1
Here's a sample dataset:
import pandas as pd
data = {'submission_id': ['abc456', 'abc123','def456','ghi789'],
'first_name': ['Mark', 'Mark','Andrew','Allie'],
'last_name': ['Baseball', 'Baseball','football','hockey'],
'choice': ['Athletics', 'Athletics','Falcons','Canucks'],
}
df = pd.DataFrame (data, columns = ['submission_id', 'first_name','last_name','choice'])
print(df)
I'd like an output that looks like this:
submission_id first_name last_name choice
0 abc123 Mark Baseball Athletics
1 def456 Andrew football Falcons
2 ghi789 Allie hockey Canucks
In your example, you can do something like:
_df = df.groupby(['first_name','last_name','choice'],as_index=False)['submission_id'].head(1)
df = df.merge(_df,how='inner')
Or do this if you want max:
df = df.groupby(['first_name','last_name','choice'],as_index=False)['submission_id'].max()
Using the drop_duplicates method you can choose which columns to consider using the subset argument:
df.drop_duplicates(subset=['first_name', 'last_name', 'choice'], inplace=True)

Pandas: Replacing column values with ones as retrieved from other dataframe

I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.

Categories