How to extract specific codes from string in separate columns?

How to extract specific codes from string in separate columns? - python

I have data in the following format.
Data
Data Sample Excel
I want to extract the codes from the column "DIAGNOSIS" and paste each code in a separate column after the "DIAGNOSIS" column. I Know the regular expression to be used to match this which is
[A-TV-Z][0-9][0-9AB].?[0-9A-TV-Z]{0,4}
source: https://www.johndcook.com/blog/2019/05/05/regex_icd_codes/
These are called ICD10 codes represented like Z01.2, E11, etc. The Above expression is meant to match all ICD10 codes.
But I am not sure how to use this expression in python code to do the above task.
The problem that I am trying to solve is?
Count the Total number of Codes assigned for all patients?
Count Total number of UNIQUE code assigned (since multiple patients might have same code assigned)
Generate data Code wise - i.e if I select code Z01.2, I want to extract Patient data (maybe PATID, MOBILE NUMBER OR ANY OTHER COLUMN OR ALL) who have been assigned this code.
Thanks in advance.

Using Python Pandas as follows.
Code
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t')
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['Length'] = df['CODES'].str.len()
print(f"Total Codes: {df['Length'].sum()}")
all_codes = df['CODES'].sum()#.set()
unique_codes = set(all_codes)
print(f'all codes {all_codes}\nCount: {len(all_codes)}')
print(f'unique codes {unique_codes}\nCount: {len(unique_codes)}')
# Select patients with code Z01.2
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
# Show selected columns
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Explanation
Imported data as tab-delimited CSV
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t'
Resulting DataFrame df
PATID PATIENT_NAME MOBILE_NUMBER EMAIL_ADDRESS GENDER PATIENT_AGE \
0 11 Mac 98765 ab1#gmail.com F 51 Y
1 22 Sac 98766 ab1#gmail.com F 24 Y
2 33 Tac 98767 ab1#gmail.com M 43 Y
3 44 Lac 98768 ab1#gmail.com M 54 Y
DISTRICT CLINIC DIAGNOSIS
0 Mars Clinic1 Z01.2 - Dental examinationC50 - Malignant neop...
1 Moon Clinic2 S83.6 - Sprain and strain of other and unspeci...
2 Earth Clinic3 K60.1 - Chronic anal fissureZ20.9 - Contact wi...
3 Saturn Clinic4 E11 - Type 2 diabetes mellitusE78.5 - Hyperlip...
Extract from DIAGNOSIS column using the specified pattern
Add an escape character before . otherwise, it would be a wildcard and match any character (no difference on data supplied).
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['CODES'] each row in the column is a list of codes
0 [Z01.2, C50 , Z10.0]
1 [S83.6, L05.0, Z20.9]
2 [K60.1, Z20.9, J06.9, C50 ]
3 [E11 , E78.5, I10 , E55 , E79.0, Z24.0, Z01.2]
Name: CODES, dtype: object
Add length column to df DataFrame
df['Length'] = df['CODES'].str.len()
df['Length']--correspond to length of each code list
0 3
1 3
2 4
3 7
Name: Length, dtype: int64
Total Codes Used--sum over the length of codes
df['Length'].sum()
Total Codes: 17
All Codes Used--concatenating all the code lists
all_codes = df['CODES'].sum()
['Z01.2', 'C50 ', 'Z10.0', 'S83.6', 'L05.0', 'Z20.9', 'K60.1', 'Z20.9', 'J06.9', 'C50
', 'E11 ', 'E78.5', 'I10 ', 'E55 ', 'E79.0', 'Z24.0', 'Z01.2']
Count: 17
Unique Codes Used--take the set() of the list of all codes
unique_codes = set(all_codes)
{'L05.0', 'S83.6', 'E79.0', 'Z01.2', 'I10 ', 'J06.9', 'K60.1', 'E11 ', 'Z24.0', 'Z
10.0', 'E55 ', 'E78.5', 'Z20.9', 'C50 '}
Count: 14
Select patients by code (i.e. Z01.2)
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
Show PATIE, PATIENT_NAME and MOBILE_NUMBER for these patients
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Result
PATID PATIENT_NAME MOBILE_NUMBER
0 11 Mac 98765
3 44 Lac 98768

Related

group by similar value of column in dataframe

I'm using a DataFrame that contains sample data on rocks and soils. I want to create 2 separate plots, one for rocks and one for soils, showing SO3 composition relative to SIO2. I created a dictionary of rocks only, but there are still 90+ samples. As it's shown in the figure, some have similar names. For example 'Adirondack' appears 3 times. I could manually go through them all, but that would take a while (P.S. I did, but I would still like to know the easier way than if ... elif ... statements, since I had to manually create a legend entry to avoid many duplicate entries).
How can I just group together the ones with the same x letters and save them in a new dataframe or my dictionary as just 'Adirondack (all)', for example (take the part of the name before the '_' perhaps, so that it will appear in the legend that way), and have the three sets of values for 'Adirondack_' etc. in one dictionary entry.
Rocks = APXSData[APXSData.Type.str.contains('R')]
RockLabels = Rocks['Sample'].to_list()
RockDict = {}
for i in RockLabels:
SiO2val = np.extract(Rocks["Sample"]==i, Rocks["SiO2"])
SO3val = np.extract(Rocks["Sample"]==i, Rocks["SO3"])
newKey = i
RockDict[newKey] = {'SiO2':SiO2val, 'SO3':SO3val}
DatabyRockSample = pd.DataFrame(RockDict)
fig = plt.figure()
for i in RockLabels:
plt.scatter(
DatabyRockSample[i]["SiO2"],
DatabyRockSample[i]["SO3"],
marker='o',
label = i) #, color = colors[count], edgecolors = edgecolor[count],
plt.xlabel("SiO$_2$", labelpad = 10)
plt.ylabel("SO$_3$", labelpad = 10)
plt.title('Composition of all rocks \n at Gusev Crater')
plt.legend()

Let's prepare some dummy data:
df = pd.DataFrame({
'Sol': [14,18,33,34,41],
'Type': ['SU','RU','RB','RR','SU'],
'Sample': ['Gusev_Soil','Adirondack_asis','Adirondack_brush','Adirondack_RAT','Gusev_Other'],
'N': [45,126,129,128,76],
'Na2O': [2.8,2.3,2.8,2.4,2.7],
# ...
})
So here's our data frame:
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
4 41 SU Gusev_Other 76 2.7
We can use grouping in this way.
If the only option we have is matching first n letters, then:
n = 5
grouper = df['Sample'].str[:n]
groups = {name: group for name, group in df.groupby(grouper)}
If we can extract meaningful data by splitting, which is better I think, then:
# in this case we can split by '_' and get the first word
grouper = df['Sample'].str.split('_').str.get(0)
groups = {name: group for name, group in df.groupby(grouper)}
If splitting isn't that simple, say our words are separated by space, underscore or hyphen, then we could use str.extract method:
grouper = df['Sample'].str.extract('\A(.*)(?=[ _-])')
groups = {name: group for name, group in df.groupby(grouper)}
We can also avoid creating dictionaries. Let's see how we can iterate over the groups obtained by splitting as an example:
grouper = df['Sample'].str.split('_').str.get(0)
groups = df.groupby(grouper)
for name, dataframe in groups:
print(f'name: {name}')
print(dataframe, '\n')
Output:
name: Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
name: Gusev
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
4 41 SU Gusev_Other 76 2.7
The same with rocks. IMO we can do better than APXSData.Type.str.contains('R'):
APXSData['Type'].str[0] == 'R'
APXSData['Type'].str.startswith('R')
Let's separate rocks and group them by the leading name:
is_rock = df['Type'].str.startswith('R')
grouper = df['Sample'].str.split('_').str.get(0)
groups_of_rocks = df[is_rock].groupby(grouper)
for k,v in groups_of_rocks:
print(k)
print(v)
Output:
Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
To plot data for some group of interest only, we can use get_group(name):
groups.get_group('Adirondack').plot.bar(x='Sample', y=['N','Na2O'])
See also:
detail about str in pandas
pandas.Series.split
pandas.Series.str.get
pandas.Series.str.extract
regex in python
run help('pandas.core.strings.StringMethods') to see help offline

Pandas: How to remove words in string which appear before a certain word from another column

I have a large csv file with a column containing strings. At the beginning of these strings there are a set of id numbers which appear in another column as below.
0 Home /buy /York /Warehouse /P000166770Ou... P000166770
1 Home /buy /York /Plot /P000165923A plot of la... P000165923
2 Home /buy /London /Commercial /P000165504A str... P000165504
...
804 Brand new apartment on the first floor, situat... P000185616
I want to remove all text which appears before the ID number so here we would get:
0 Ou...
1 A plot of la...
2 A str...
...
804 Brand new apartment on the first floor, situat...
I tried something like
df['column_one'].str.split(df['column_two'])
and
df['column_one'].str.replace(df['column_two'],'')

You could replace the pattern using regex as follows:
>> my_pattern = "^(Alpha|Beta|QA|Prod)\s[A-Z0-9]{7}"
>> my_series = pd.Series(['Alpha P17089OText starts here'])
>> my_series.str.replace(my_pattern, '', regex=True)
0 Text starts here
There is a bit of work to be done to determine the nature of your pattern. I would suggest experimenting a bit with https://regex101.com/

To extend your split() idea:
df.apply(lambda x: x['column_one'].split(x['column_two'])[1], axis=1)
0 Text starts here

I managed to get it to work using:
df.apply(lambda x: x['column1'].split(x['column2'])[1] if x['column2'] in x['column1'] else x['column1'], axis=1)
This also works when the ID is not in the description. Thanks for the help!

Here is one way to do it, by applying regex to each of the row based on the code
import re
def ext(row):
mch = re.findall(r"{0}(.*)".format(row['code']), row['txt'])
if len(mch) >0:
rtn = mch.pop()
else:
rtn = row['txt']
return rtn
df['ext'] = df.apply(ext, axis=1)
df
0 Ou...
1 A plot of la...
2 A str...
3 Brand new apartment on the first floor situat...
x txt code ext
0 0 Home /buy /York /Warehouse / P000166770 Ou... P000166770 Ou...
1 1 Home /buy /York /Plot /P000165923A plot of la... P000165923 A plot of la...
2 2 Home /buy /London /Commercial /P000165504A str... P000165504 A str...
3 804 Brand new apartment on the first floor situat... P000185616 Brand new apartment on the first floor situat...

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.

I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.

As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.

You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract specific codes from string in separate columns? - python

Related

group by similar value of column in dataframe

Pandas: How to remove words in string which appear before a certain word from another column

Python - Matching and extracting data from excel with pandas

How to split two first names that together in two different words in python

Pandas - Extract a string starting with a particular character

Categories

Resources