group by similar value of column in dataframe

group by similar value of column in dataframe - python

I'm using a DataFrame that contains sample data on rocks and soils. I want to create 2 separate plots, one for rocks and one for soils, showing SO3 composition relative to SIO2. I created a dictionary of rocks only, but there are still 90+ samples. As it's shown in the figure, some have similar names. For example 'Adirondack' appears 3 times. I could manually go through them all, but that would take a while (P.S. I did, but I would still like to know the easier way than if ... elif ... statements, since I had to manually create a legend entry to avoid many duplicate entries).
How can I just group together the ones with the same x letters and save them in a new dataframe or my dictionary as just 'Adirondack (all)', for example (take the part of the name before the '_' perhaps, so that it will appear in the legend that way), and have the three sets of values for 'Adirondack_' etc. in one dictionary entry.
Rocks = APXSData[APXSData.Type.str.contains('R')]
RockLabels = Rocks['Sample'].to_list()
RockDict = {}
for i in RockLabels:
SiO2val = np.extract(Rocks["Sample"]==i, Rocks["SiO2"])
SO3val = np.extract(Rocks["Sample"]==i, Rocks["SO3"])
newKey = i
RockDict[newKey] = {'SiO2':SiO2val, 'SO3':SO3val}
DatabyRockSample = pd.DataFrame(RockDict)
fig = plt.figure()
for i in RockLabels:
plt.scatter(
DatabyRockSample[i]["SiO2"],
DatabyRockSample[i]["SO3"],
marker='o',
label = i) #, color = colors[count], edgecolors = edgecolor[count],
plt.xlabel("SiO$_2$", labelpad = 10)
plt.ylabel("SO$_3$", labelpad = 10)
plt.title('Composition of all rocks \n at Gusev Crater')
plt.legend()

Let's prepare some dummy data:
df = pd.DataFrame({
'Sol': [14,18,33,34,41],
'Type': ['SU','RU','RB','RR','SU'],
'Sample': ['Gusev_Soil','Adirondack_asis','Adirondack_brush','Adirondack_RAT','Gusev_Other'],
'N': [45,126,129,128,76],
'Na2O': [2.8,2.3,2.8,2.4,2.7],
# ...
})
So here's our data frame:
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
4 41 SU Gusev_Other 76 2.7
We can use grouping in this way.
If the only option we have is matching first n letters, then:
n = 5
grouper = df['Sample'].str[:n]
groups = {name: group for name, group in df.groupby(grouper)}
If we can extract meaningful data by splitting, which is better I think, then:
# in this case we can split by '_' and get the first word
grouper = df['Sample'].str.split('_').str.get(0)
groups = {name: group for name, group in df.groupby(grouper)}
If splitting isn't that simple, say our words are separated by space, underscore or hyphen, then we could use str.extract method:
grouper = df['Sample'].str.extract('\A(.*)(?=[ _-])')
groups = {name: group for name, group in df.groupby(grouper)}
We can also avoid creating dictionaries. Let's see how we can iterate over the groups obtained by splitting as an example:
grouper = df['Sample'].str.split('_').str.get(0)
groups = df.groupby(grouper)
for name, dataframe in groups:
print(f'name: {name}')
print(dataframe, '\n')
Output:
name: Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
name: Gusev
Sol Type Sample N Na2O
0 14 SU Gusev_Soil 45 2.8
4 41 SU Gusev_Other 76 2.7
The same with rocks. IMO we can do better than APXSData.Type.str.contains('R'):
APXSData['Type'].str[0] == 'R'
APXSData['Type'].str.startswith('R')
Let's separate rocks and group them by the leading name:
is_rock = df['Type'].str.startswith('R')
grouper = df['Sample'].str.split('_').str.get(0)
groups_of_rocks = df[is_rock].groupby(grouper)
for k,v in groups_of_rocks:
print(k)
print(v)
Output:
Adirondack
Sol Type Sample N Na2O
1 18 RU Adirondack_asis 126 2.3
2 33 RB Adirondack_brush 129 2.8
3 34 RR Adirondack_RAT 128 2.4
To plot data for some group of interest only, we can use get_group(name):
groups.get_group('Adirondack').plot.bar(x='Sample', y=['N','Na2O'])
See also:
detail about str in pandas
pandas.Series.split
pandas.Series.str.get
pandas.Series.str.extract
regex in python
run help('pandas.core.strings.StringMethods') to see help offline

Related

Pandas Fuzzy Matching

I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn't finished.
My code is:
import pandas as pd
from fuzzywuzzy import fuzz, process
### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")
#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")
### Variable for accuracy scoring
accuracy = 0
for index, value in df["address"].iteritems():
### This gathers the index from the correct address column in the reference df
ref_index = ref_df["correct_address"][
ref_df["correct_address"]
== process.extractOne(value, ref_df["correct_address"])[0]
].index.toList()[0]
### if each row can score a max total of 1, the ratio must be divided by 100
accuracy += (
fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
)
Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.
I don't believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple - just value and ratio of match.

Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.
You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.
For five addresses in the reference frame and 100 contacts in the other its execution timings are:
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms
The below code should be quicker than .iteritems() etc.
Code:
# %%time
import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib
# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')
# create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')
# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
max_score = -1
max_add = ''
for x in list_add:
score = fuzz.ratio(add, x)
if (score > min_score) & (score > max_score):
max_add = x
max_score = score
return (max_add, max_score)
# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
o = process.extractOne(x, s, score_cutoff = 60)
if o != None:
return o[1]
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())
# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}
names = []
for x in ref_addresses:
match = match_addresses(x, contacts_addresses, 75)
if match[1] >= 75:
name = (str(x), str(match[0]))
names.append(name)
name_dict = dict(names)
# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)
# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()
# add difflib scoring for a bit of interest.
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
(None, x['ref_address'], x['matched_address']).ratio(),axis=1)
# clean up column ordering ('correct_address' and 'ref_address' are basically
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
'fuzzywuzzy_score', 'difflib_score']]
# see what we've got
print(compare_df)
# remember: correct_address and ref_address are copies
# so just pick one to compare to matched_address
correct_address ref_address matched_address \
0 86 Nw 66th Street #8673 86 Nw 66th Street #8673 86 Nw 66th St #8673
1 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230 2737 Pistorio Rd #9230
2 6649 N Blue Gum St 6649 N Blue Gum St 6649 N Blue Gum St
3 59 n Groesbeck Hwy 59 n Groesbeck Hwy 59 N Groesbeck Hwy
4 1 Central Avenue 1 Central Avenue 1 Central Ave
fuzzywuzzy_score difflib_score
0 90 0.904762
1 100 1.000000
2 100 1.000000
3 100 0.944444
4 90 0.896552

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():

Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'

Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex

If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.

I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

How to read space delimited data, two row types, no fixed width and plenty of missing values?

There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.
http://jonathansoma.com/lede/foundations-2017/pandas/opening-fixed-width-files/
Reading space delimited file in Python/Pandas with missing values
ASCII table with consecutive white-spaces as separators and missing data python pandas
I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:
# Header rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII
# Data rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P
It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.
I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.
My question:
How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.

I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.
After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)
Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.
A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.
Step 1: map the data out
There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701
While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200
Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...
Anyway, now we have a nice, nested data structure in native python types. The fun can begin!
Step 2: Load this into a usable format
To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:
import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)
Put it all together now:
from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!

There's someone who might have had the same problem and created a library for it, you can check it out here:
https://github.com/miniufo/besttracks
It also includes a quickstart notebook with loading the same dataset.

Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:
def get_idxs(string, char):
idxs = []
for i in range(len(string)):
if string[i - 1].isalpha() and string[i] == char:
idxs.append(i)
return idxs
def replace(string, idx, replacement):
string = list(string)
try:
for i in idx: string[i] = replacement
except TypeError:
string[idx] = replacement
return ''.join(string)
# test it out
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
header_idxs = get_idxs(header_fmt, ' ')
track_idxs = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))
Testing the function on the format strings, we see commas were put in the appropriate places:
AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH, IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF, GGG, HIIII,JJJJ,KLLLL,MMMM, P
So next apply those functions to the .txt and create a .csv file with the output:
from contextlib import ExitStack
from tqdm.notebook import tqdm
with ExitStack() as stack:
read_file = stack.enter_context(open('data/bst_all.txt', 'r'))
write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
for line in tqdm(read_file.readlines()):
if ' ' in line[:8]: # line is header data
write_file.write(replace(line, header_idxs, ',') + '\n')
else: # line is track data
write_file.write(replace(line, track_idxs, ',') + '\n')
The next task is to add the header data to ALL rows, so that all rows have the same format:
header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup',
'final_flag', 'delta_t_fin', 'name', 'last_revision']
track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed',
'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']
data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)
data.date = data.date.astype('string')
# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5]
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]
# Rearrange columns; bring identifiers to the first three columns.
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]
# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')
# delete now extraneous header rows
data = data.drop(headers.index)
And that yields some nicely formatted data, like this:
storm_id records name date indicator grade latitude longitude
15 5102.0 37.0 GEORGIA 51031900 2 2 67.0 1614
16 5102.0 37.0 GEORGIA 51031906 2 2 70.0 1625
17 5102.0 37.0 GEORGIA 51031912 2 2 73.0 1635

How to extract specific codes from string in separate columns?

I have data in the following format.
Data
Data Sample Excel
I want to extract the codes from the column "DIAGNOSIS" and paste each code in a separate column after the "DIAGNOSIS" column. I Know the regular expression to be used to match this which is
[A-TV-Z][0-9][0-9AB].?[0-9A-TV-Z]{0,4}
source: https://www.johndcook.com/blog/2019/05/05/regex_icd_codes/
These are called ICD10 codes represented like Z01.2, E11, etc. The Above expression is meant to match all ICD10 codes.
But I am not sure how to use this expression in python code to do the above task.
The problem that I am trying to solve is?
Count the Total number of Codes assigned for all patients?
Count Total number of UNIQUE code assigned (since multiple patients might have same code assigned)
Generate data Code wise - i.e if I select code Z01.2, I want to extract Patient data (maybe PATID, MOBILE NUMBER OR ANY OTHER COLUMN OR ALL) who have been assigned this code.
Thanks in advance.

Using Python Pandas as follows.
Code
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t')
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['Length'] = df['CODES'].str.len()
print(f"Total Codes: {df['Length'].sum()}")
all_codes = df['CODES'].sum()#.set()
unique_codes = set(all_codes)
print(f'all codes {all_codes}\nCount: {len(all_codes)}')
print(f'unique codes {unique_codes}\nCount: {len(unique_codes)}')
# Select patients with code Z01.2
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
# Show selected columns
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Explanation
Imported data as tab-delimited CSV
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t'
Resulting DataFrame df
PATID PATIENT_NAME MOBILE_NUMBER EMAIL_ADDRESS GENDER PATIENT_AGE \
0 11 Mac 98765 ab1#gmail.com F 51 Y
1 22 Sac 98766 ab1#gmail.com F 24 Y
2 33 Tac 98767 ab1#gmail.com M 43 Y
3 44 Lac 98768 ab1#gmail.com M 54 Y
DISTRICT CLINIC DIAGNOSIS
0 Mars Clinic1 Z01.2 - Dental examinationC50 - Malignant neop...
1 Moon Clinic2 S83.6 - Sprain and strain of other and unspeci...
2 Earth Clinic3 K60.1 - Chronic anal fissureZ20.9 - Contact wi...
3 Saturn Clinic4 E11 - Type 2 diabetes mellitusE78.5 - Hyperlip...
Extract from DIAGNOSIS column using the specified pattern
Add an escape character before . otherwise, it would be a wildcard and match any character (no difference on data supplied).
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['CODES'] each row in the column is a list of codes
0 [Z01.2, C50 , Z10.0]
1 [S83.6, L05.0, Z20.9]
2 [K60.1, Z20.9, J06.9, C50 ]
3 [E11 , E78.5, I10 , E55 , E79.0, Z24.0, Z01.2]
Name: CODES, dtype: object
Add length column to df DataFrame
df['Length'] = df['CODES'].str.len()
df['Length']--correspond to length of each code list
0 3
1 3
2 4
3 7
Name: Length, dtype: int64
Total Codes Used--sum over the length of codes
df['Length'].sum()
Total Codes: 17
All Codes Used--concatenating all the code lists
all_codes = df['CODES'].sum()
['Z01.2', 'C50 ', 'Z10.0', 'S83.6', 'L05.0', 'Z20.9', 'K60.1', 'Z20.9', 'J06.9', 'C50
', 'E11 ', 'E78.5', 'I10 ', 'E55 ', 'E79.0', 'Z24.0', 'Z01.2']
Count: 17
Unique Codes Used--take the set() of the list of all codes
unique_codes = set(all_codes)
{'L05.0', 'S83.6', 'E79.0', 'Z01.2', 'I10 ', 'J06.9', 'K60.1', 'E11 ', 'Z24.0', 'Z
10.0', 'E55 ', 'E78.5', 'Z20.9', 'C50 '}
Count: 14
Select patients by code (i.e. Z01.2)
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
Show PATIE, PATIENT_NAME and MOBILE_NUMBER for these patients
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Result
PATID PATIENT_NAME MOBILE_NUMBER
0 11 Mac 98765
3 44 Lac 98768

How to fill rows automatically in pandas, from the content found in a column?

In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?

If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.