Python - Matching and extracting data from excel with pandas - python

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.

I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

Related

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Add string in a certain position in column in dataframe

Basically this:
hash = "355879ACB6"
hash = hash[:4] + '-' + hash[4:]
print (hash)
3558-79ACB6
I got this part above from another stackoverflow post here
but for a DataFrame.
I am only able to successfully add strings before and after, like this:
data ['col1'] = data['col1'] + 'teststring'
If I try the solution from the link above [:amountofcharacterstocutafter] to add values at a certain position, which would be something like:
test = data[:2] + 'zz'
print (test)
It does not seem to be applicable, as the [:2] operator works different for dataframes as it does for strings. It cuts the ouput after the first 2 rows.
Goal:
I want to add a ' - ' at a certain position. Let's say the input row value is 'TTTT1234', output should be 'TTTT-1234'. For every row.
You can perform the operation you presented on a list but you have a column in a dataframe so its (a bit) different.
So while you can do this:
hash = "355879ACB6"
hash = hash[:4] + '-' + hash[4:]
in order to do this on a dataframe you can do it in at least 2 ways:
consider this dummy df:
LOCATION Hash
0 USA 355879ACB6
1 USA 455879ACB6
2 USA 388879ACB6
3 USA 800879ACB6
4 JAPAN 355870BCB6
5 JAPAN 355079ACB6
A. vectorization: the most efficient way
df['new_hash']=df['Hash'].str[:4]+'-'+df['Hash'].str[4:]
LOCATION Hash new_hash
0 USA 355879ACB6 3558-79ACB6
1 USA 455879ACB6 4558-79ACB6
2 USA 388879ACB6 3888-79ACB6
3 USA 800879ACB6 8008-79ACB6
4 JAPAN 355870BCB6 3558-70BCB6
5 JAPAN 355079ACB6 3550-79ACB6
B. apply lambda: intuitive to implement but less attractive in terms of performance
df['new_hash'] = df.apply(lambda x: x['Hash'][:4]+'-'+x['Hash'][4:], axis=1)
Use pd.Series.str. For example:
import pandas as pd
df = pd.DataFrame({
"c": ["TTTT1234"]
})
df["c"].str[:4] + "-" + df["c"].str[4:] # It will output 'TTTT-1234'
pd.Series.str gives vectorized string functions.

How to extract specific codes from string in separate columns?

I have data in the following format.
Data
Data Sample Excel
I want to extract the codes from the column "DIAGNOSIS" and paste each code in a separate column after the "DIAGNOSIS" column. I Know the regular expression to be used to match this which is
[A-TV-Z][0-9][0-9AB].?[0-9A-TV-Z]{0,4}
source: https://www.johndcook.com/blog/2019/05/05/regex_icd_codes/
These are called ICD10 codes represented like Z01.2, E11, etc. The Above expression is meant to match all ICD10 codes.
But I am not sure how to use this expression in python code to do the above task.
The problem that I am trying to solve is?
Count the Total number of Codes assigned for all patients?
Count Total number of UNIQUE code assigned (since multiple patients might have same code assigned)
Generate data Code wise - i.e if I select code Z01.2, I want to extract Patient data (maybe PATID, MOBILE NUMBER OR ANY OTHER COLUMN OR ALL) who have been assigned this code.
Thanks in advance.
Using Python Pandas as follows.
Code
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t')
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['Length'] = df['CODES'].str.len()
print(f"Total Codes: {df['Length'].sum()}")
all_codes = df['CODES'].sum()#.set()
unique_codes = set(all_codes)
print(f'all codes {all_codes}\nCount: {len(all_codes)}')
print(f'unique codes {unique_codes}\nCount: {len(unique_codes)}')
# Select patients with code Z01.2
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
# Show selected columns
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Explanation
Imported data as tab-delimited CSV
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t'
Resulting DataFrame df
PATID PATIENT_NAME MOBILE_NUMBER EMAIL_ADDRESS GENDER PATIENT_AGE \
0 11 Mac 98765 ab1#gmail.com F 51 Y
1 22 Sac 98766 ab1#gmail.com F 24 Y
2 33 Tac 98767 ab1#gmail.com M 43 Y
3 44 Lac 98768 ab1#gmail.com M 54 Y
DISTRICT CLINIC DIAGNOSIS
0 Mars Clinic1 Z01.2 - Dental examinationC50 - Malignant neop...
1 Moon Clinic2 S83.6 - Sprain and strain of other and unspeci...
2 Earth Clinic3 K60.1 - Chronic anal fissureZ20.9 - Contact wi...
3 Saturn Clinic4 E11 - Type 2 diabetes mellitusE78.5 - Hyperlip...
Extract from DIAGNOSIS column using the specified pattern
Add an escape character before . otherwise, it would be a wildcard and match any character (no difference on data supplied).
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['CODES'] each row in the column is a list of codes
0 [Z01.2, C50 , Z10.0]
1 [S83.6, L05.0, Z20.9]
2 [K60.1, Z20.9, J06.9, C50 ]
3 [E11 , E78.5, I10 , E55 , E79.0, Z24.0, Z01.2]
Name: CODES, dtype: object
Add length column to df DataFrame
df['Length'] = df['CODES'].str.len()
df['Length']--correspond to length of each code list
0 3
1 3
2 4
3 7
Name: Length, dtype: int64
Total Codes Used--sum over the length of codes
df['Length'].sum()
Total Codes: 17
All Codes Used--concatenating all the code lists
all_codes = df['CODES'].sum()
['Z01.2', 'C50 ', 'Z10.0', 'S83.6', 'L05.0', 'Z20.9', 'K60.1', 'Z20.9', 'J06.9', 'C50
', 'E11 ', 'E78.5', 'I10 ', 'E55 ', 'E79.0', 'Z24.0', 'Z01.2']
Count: 17
Unique Codes Used--take the set() of the list of all codes
unique_codes = set(all_codes)
{'L05.0', 'S83.6', 'E79.0', 'Z01.2', 'I10 ', 'J06.9', 'K60.1', 'E11 ', 'Z24.0', 'Z
10.0', 'E55 ', 'E78.5', 'Z20.9', 'C50 '}
Count: 14
Select patients by code (i.e. Z01.2)
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
Show PATIE, PATIENT_NAME and MOBILE_NUMBER for these patients
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Result
PATID PATIENT_NAME MOBILE_NUMBER
0 11 Mac 98765
3 44 Lac 98768

How to group data in a DataFrame and also show the number of row in that group?

first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.

fuzzy matching for SQL using fuzzy wuzzy and pandas

i have the following table in SQL and want to use Fuzzy Wuzzy to compare all the records in the table for any potential duplicates which in this instance line 1 is a duplicate of line 2 (or vice versa). can someone explain how i can add two additional columns to this table (Highest Score and Record Line Num) using Fuzzy Wuzzy and pandas? thanks.
Input:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount
ABC 5/12/2019 5/10/2019 ABCDE56. 56
ABC 5/13/2019 5/10/2019 ABCDE56 56
TIM 4/15/2019 4/10/2019 RTET5SDF 100
Desired Output:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount Highest Score Record Line Num
ABC 5/12/2019 5/10/2019 ABCDE56. 56 96 2
ABC 5/13/2019 5/10/2019 ABCDE56 56 96 1
TIM 4/15/2019 4/10/2019 RTET5SDF 100 0 N/A
Since you are looking for duplicates, you should filter your data frame first using the vendor name. This is to ensure it doesn't match with invoices of other vendors and reduce the processing time. However, since you didn't mention anything about it, you can skip it.
Decide on a threshold for duplicates based on the length of your invoice references. For example if the average is 5 chars, make the threshould 80%. Then, use fuzzywuzzy to get the best match.
from fuzzywuzzy import fuzz, process
# Assuming no NaNs in invoices references
inv_list = df['Invoice Ref'].to_list()
for i, inv in enumerate(inv_list)
result = process.extractOne(inv, inv_list, scorer=fuzz.token_sort_ratio)
if result[1] >= your_threshould:
df.loc[i, 'Highest Score'] = result[1]
df.loc[i, 'Record Line Num'] = inv_list.index(result[0])

Categories