trim string to first space python - python

I have a dataframe of this style:
id patient_full_name
7805 TOMAS FRANCONI
7810 Camila Gualtieri
7821 Lola Borrego
7823 XIMENA ALVAREZ LANUS
7824 MONICA VIVIANA RODRIGUEZ DE MARENGO
I need to save the first name of values from the second column. I want to trim that value down to the first spacing and I don't know how.
I would like it to stay in a structure like this:
patients_names = ["TOMAS","CAMILA","LOLA","XIMANA","MONICA",...."N-NAME"]
All this done in Pandas Python

You can use the split function in a list comprehension to do this:
df = pd.DataFrame([
{"id": 7805, "patient_full_name": "TOMAS FRANCONI"},
{"id": 7810, "patient_full_name": "Camila Gualtieri"},
{"id": 7821, "patient_full_name": "Lola Borrego"}
])
df["first_name"] = [n.split(" ")[0] for n in df["patient_full_name"]]
That adds a column (first_name) with the output you wanted, which you can then pull off as a list or series if you want:
first_name_as_series = df["first_name"]
first_name_as_list = list(df["first_name"])
In your question, you show the desired output in all upper case. That's easy to get with a simple tweak to the list comprehension:
df["first_name"] = [n.split(" ")[0].upper() for n in df["patient_full_name"]]

You can do it by using extract as well, which do not rely on a loop:
(df
.assign(first_name=lambda x: x.fullname.str.extract(r"(.*) "))
)

Related

How do you remove similar (not duplicated) rows in pandas dataframe using text similarity?

I have thousands of data that may or may not be similar to each other. Using python's default function drop_duplicates() doesn't really help since they only detect similar data only, for example, what if my data contains something like these:
Hey, good morning!
Hey, good morning.
Python wouldn't detect them as duplicates. There are many variations to this really, that simply cleaning the text wouldn't suffice, so I opt for text similarity.
I have tried the following code,
import textdistance
from tqdm import tqdm
tqdm.pandas()
all_sims = []
for id1, text1 in tqdm(enumerate(df1['cleaned'])):
for id2, text2 in enumerate(df1['cleaned'].iloc[id1:]):
if id1==id2:
continue
sim = textdistance.jaro_winkler(text1, text2)
if sim>=0.9:
# print("similarity value: ",sim)
# print("text 1 >> ",text1)
# print("text 2 >> ",text2)
# print("====><====")
all_sims.append(id1)
Basically I tried to iterate all the rows in the column and check it with themselves. If the jaro-winkler value detected turns out to be >= 0.9 then the index will be saved to a list.
I will then remove all these similar indices with the following code.
df1[~df1.index.isin(all_sims)]
But my code is really slow and inefficient, and I am not sure if it's the right approach. Do you have any idea to improve this?
You could try this:
import pandas as pd
import textdistance
# Toy dataframe
df = pd.DataFrame(
{
"name": [
"Mulligan Nick",
"Hitt S C",
"Breda Joy Mulligen",
"David James Tsan",
"Mulligan Nick",
"Htti S C ",
"Brenda Joy Mulligan",
"Dave James Tsan",
],
}
)
# Calculate similarities between rows
# and save corresponding indexes in a new column "match"
df["match"] = df["name"].map(
lambda x: [
i
for i, text in enumerate(df["name"])
if textdistance.jaro_winkler(x, text) >= 0.9
]
)
# Iterate to remove similar rows (keeping only the first one)
indices = []
for i, row in df.iterrows():
indices.append(i)
df = df.drop(
index=[item for item in row["match"] if item not in indices], errors="ignore"
)
# Clean up
df = df.drop(columns="match")
print(df)
# Outputs
name
0 Mulligan Nick
1 Hitt S C
2 Breda Joy Mulligen
3 David James Tsan

Order file based on numbers in name

I have a bunch of file with names as follows:
tif_files = av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
However, they are not guaranteed to be in any sort of order.
I want to sort them based on names such that files from the same year are sorted together. When I do this
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
I get the following result:
av_v5_1983_001.tif, av_v5_1984_001.tif, av_v5_1985_001.tif...av_v5_2021_001.tif
but I want this:
av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
take the last two using [2:] for example ['1984', '001.tif']
tif_files = 'av_v5_1983_001.tif', 'av_v5_1983_002.tif', 'av_v5_1983_003.tif',\
'av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v5_2021_001.tif', 'av_v5_2021_002.tif'
sorted(tif_files, key=lambda x: x.split('_')[2:])
# ['av_v5_1983_001.tif',
# 'av_v5_1983_002.tif',
# 'av_v5_1983_003.tif',
# 'av_v5_1984_001.tif',
# 'av_v5_1984_002.tif',
# 'av_v5_2021_001.tif',
# 'av_v5_2021_002.tif']
if you have v1 or v2 or ... v5 or ... you need to consider number of version also like below:
tif_files = ['av_v1_1983_001.tif', 'av_v5_1983_002.tif', 'av_v6_1983_002.tif','av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v4_2021_001.tif','av_v5_2021_001.tif', 'av_v5_2021_002.tif', 'av_v4_1984_002.tif']
sorted(tif_files, key=lambda x: [x.split('_')[2:], x.split('_')[1]])
Output:
['av_v1_1983_001.tif',
'av_v5_1983_002.tif',
'av_v6_1983_002.tif',
'av_v5_1984_001.tif',
'av_v4_1984_002.tif',
'av_v5_1984_002.tif',
'av_v4_2021_001.tif',
'av_v5_2021_001.tif',
'av_v5_2021_002.tif']
What you did was sorting it by the 00x index first then by the year as x.split('_')[-1] produces 001 and etc. Try to change the index to sort by year first , then sort it again by the index:
sorted(tif_files, key=lambda x:x.split('_')[2])
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
As long as your naming convention remains consistent, you should be able to just sort them alphanumerically. As such, the below code should work;
sorted(tif_files)
If you instead wanted to sort by the last two numbers in the file name while ignoring the prefix, you would need something a bit more dramatic that would break those numbers out and let you order by them. You could use something like the below:
import pandas as pd
tif_files_list = [[xx, int(xx.split("_")[2]), int(xx.split("_")[3])] for xx in tif_files]
tif_files_frame = pd.DataFrame(tif_files_list, columns=["Name", "Primary Index", "Secondary Index"])
tif_files_frame_ordered = tif_files_frame.sort_values(["Primary Index", "Secondary Index"], axis=0)
tif_files_ordered = tif_files_frame_ordered["Name"].tolist()
This breaks the numbers in the names out into separate columns of a Pandas Dataframe, then sorts your entries by those broken out columns, at which point you can extract the ordered name column on its own.
If key returns a tuple of 2 values, the sort function will try to sort based on the first value then the second value.
please refer to: https://stackoverflow.com/a/5292332/9532450
tif_files = [
"hea_der_1983_002.tif",
"hea_der_1983_001.tif",
"hea_der_1984_002.tif",
"hea_der_1984_001.tif",
]
def parse(filename: str) -> tuple[str, str]:
split = filename.split("_")
return split[2], split[3]
sort = sorted(tif_files, key=parse)
print(sort)
output
['hea_der_1983_001.tif', 'hea_der_1983_002.tif', 'hea_der_1984_001.tif', 'hea_der_1984_002.tif']
right click your folder and click sort by >> name.

Normalizing JSON in Python

I have a JSON response which is:
{"Neighborhoods":[
{"Name":"Project A",
"Balcony":false,
"Sauna":false,
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":null,
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7},
{"Name"
I then use pd.json_normalize(Response,'Neighborhoods') for normalizing.
The Location part is then flattened out as I want, as two columns "Location.Lat" and "Location.Lon". My issue is "ProjectIds" which I get in one column as
['f94d25e2-3709-42bc-a4a2-bf8e073e9790', 'b106b4f1-32b9-4fc2-b2b3-55a7e5348c24']
But I would like to have it without '[] and the space in the middle. So that the output would be
f94d25e2-3709-42bc-a4a2-bf8e073e9790,b106b4f1-32b9-4fc2-b2b3-55a7e5348c24
You can use .str.join() to convert the list of strings into comma separated string, as follows:
df['ProjectIds'] = df['ProjectIds'].str.join(',')
Demo
Response ={"Neighborhoods":[
{"Name":"Project A",
"Balcony":'false',
"Sauna":'false',
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":'null',
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7}]}
df = pd.json_normalize(Response,'Neighborhoods')
df['ProjectIds'] = df['ProjectIds'].str.join(',')
print(df)
Name Balcony Sauna ProjectIds NextViewing SalesStatus TypeOfContract Location.Lat Location.Lon
0 Project A false false f94d25e2-3709-42bc-a4a2-bf8e073e9790,b106b4f1-32b9-4fc2-b2b3-55a7e5348c24 null ForSale 7 52.484295 13.505814
Use ",".join() on the projectIds to convert them to string from list before you pass it to json_nornalize
The way you can solve this is by using ','.join() on the ProjectIds column:
data ={"Neighborhoods":[
{"Name":"Project A",
"Balcony":'false',
"Sauna":'false',
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":'null',
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7}]}
df = pd.json_normalize(data['Neighborhoods'])
df['ProjectIds'] = df['ProjectIds'].apply(lambda x: ','.join(x))

Mining for Term that is "Included In" Entry Rather than "Equal To"

I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.

How to replace string values in pandas dataframe to integers?

I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?
You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])
It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')
You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)

Categories