trim string to first space python

trim string to first space python - python

I have a dataframe of this style:
id patient_full_name
7805 TOMAS FRANCONI
7810 Camila Gualtieri
7821 Lola Borrego
7823 XIMENA ALVAREZ LANUS
7824 MONICA VIVIANA RODRIGUEZ DE MARENGO
I need to save the first name of values from the second column. I want to trim that value down to the first spacing and I don't know how.
I would like it to stay in a structure like this:
patients_names = ["TOMAS","CAMILA","LOLA","XIMANA","MONICA",...."N-NAME"]
All this done in Pandas Python

You can use the split function in a list comprehension to do this:
df = pd.DataFrame([
{"id": 7805, "patient_full_name": "TOMAS FRANCONI"},
{"id": 7810, "patient_full_name": "Camila Gualtieri"},
{"id": 7821, "patient_full_name": "Lola Borrego"}
])
df["first_name"] = [n.split(" ")[0] for n in df["patient_full_name"]]
That adds a column (first_name) with the output you wanted, which you can then pull off as a list or series if you want:
first_name_as_series = df["first_name"]
first_name_as_list = list(df["first_name"])
In your question, you show the desired output in all upper case. That's easy to get with a simple tweak to the list comprehension:
df["first_name"] = [n.split(" ")[0].upper() for n in df["patient_full_name"]]

You can do it by using extract as well, which do not rely on a loop:
(df
.assign(first_name=lambda x: x.fullname.str.extract(r"(.*) "))
)

Related

How do you remove similar (not duplicated) rows in pandas dataframe using text similarity?

I have thousands of data that may or may not be similar to each other. Using python's default function drop_duplicates() doesn't really help since they only detect similar data only, for example, what if my data contains something like these:
Hey, good morning!
Hey, good morning.
Python wouldn't detect them as duplicates. There are many variations to this really, that simply cleaning the text wouldn't suffice, so I opt for text similarity.
I have tried the following code,
import textdistance
from tqdm import tqdm
tqdm.pandas()
all_sims = []
for id1, text1 in tqdm(enumerate(df1['cleaned'])):
for id2, text2 in enumerate(df1['cleaned'].iloc[id1:]):
if id1==id2:
continue
sim = textdistance.jaro_winkler(text1, text2)
if sim>=0.9:
# print("similarity value: ",sim)
# print("text 1 >> ",text1)
# print("text 2 >> ",text2)
# print("====><====")
all_sims.append(id1)
Basically I tried to iterate all the rows in the column and check it with themselves. If the jaro-winkler value detected turns out to be >= 0.9 then the index will be saved to a list.
I will then remove all these similar indices with the following code.
df1[~df1.index.isin(all_sims)]
But my code is really slow and inefficient, and I am not sure if it's the right approach. Do you have any idea to improve this?

You could try this:
import pandas as pd
import textdistance
# Toy dataframe
df = pd.DataFrame(
{
"name": [
"Mulligan Nick",
"Hitt S C",
"Breda Joy Mulligen",
"David James Tsan",
"Mulligan Nick",
"Htti S C ",
"Brenda Joy Mulligan",
"Dave James Tsan",
],
}
)
# Calculate similarities between rows
# and save corresponding indexes in a new column "match"
df["match"] = df["name"].map(
lambda x: [
i
for i, text in enumerate(df["name"])
if textdistance.jaro_winkler(x, text) >= 0.9
]
)
# Iterate to remove similar rows (keeping only the first one)
indices = []
for i, row in df.iterrows():
indices.append(i)
df = df.drop(
index=[item for item in row["match"] if item not in indices], errors="ignore"
)
# Clean up
df = df.drop(columns="match")
print(df)
# Outputs
name
0 Mulligan Nick
1 Hitt S C
2 Breda Joy Mulligen
3 David James Tsan

Order file based on numbers in name

I have a bunch of file with names as follows:
tif_files = av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
However, they are not guaranteed to be in any sort of order.
I want to sort them based on names such that files from the same year are sorted together. When I do this
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
I get the following result:
av_v5_1983_001.tif, av_v5_1984_001.tif, av_v5_1985_001.tif...av_v5_2021_001.tif
but I want this:
av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif

take the last two using [2:] for example ['1984', '001.tif']
tif_files = 'av_v5_1983_001.tif', 'av_v5_1983_002.tif', 'av_v5_1983_003.tif',\
'av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v5_2021_001.tif', 'av_v5_2021_002.tif'
sorted(tif_files, key=lambda x: x.split('_')[2:])
# ['av_v5_1983_001.tif',
# 'av_v5_1983_002.tif',
# 'av_v5_1983_003.tif',
# 'av_v5_1984_001.tif',
# 'av_v5_1984_002.tif',
# 'av_v5_2021_001.tif',
# 'av_v5_2021_002.tif']

if you have v1 or v2 or ... v5 or ... you need to consider number of version also like below:
tif_files = ['av_v1_1983_001.tif', 'av_v5_1983_002.tif', 'av_v6_1983_002.tif','av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v4_2021_001.tif','av_v5_2021_001.tif', 'av_v5_2021_002.tif', 'av_v4_1984_002.tif']
sorted(tif_files, key=lambda x: [x.split('_')[2:], x.split('_')[1]])
Output:
['av_v1_1983_001.tif',
'av_v5_1983_002.tif',
'av_v6_1983_002.tif',
'av_v5_1984_001.tif',
'av_v4_1984_002.tif',
'av_v5_1984_002.tif',
'av_v4_2021_001.tif',
'av_v5_2021_001.tif',
'av_v5_2021_002.tif']

What you did was sorting it by the 00x index first then by the year as x.split('_')[-1] produces 001 and etc. Try to change the index to sort by year first , then sort it again by the index:
sorted(tif_files, key=lambda x:x.split('_')[2])
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])

As long as your naming convention remains consistent, you should be able to just sort them alphanumerically. As such, the below code should work;
sorted(tif_files)
If you instead wanted to sort by the last two numbers in the file name while ignoring the prefix, you would need something a bit more dramatic that would break those numbers out and let you order by them. You could use something like the below:
import pandas as pd
tif_files_list = [[xx, int(xx.split("_")[2]), int(xx.split("_")[3])] for xx in tif_files]
tif_files_frame = pd.DataFrame(tif_files_list, columns=["Name", "Primary Index", "Secondary Index"])
tif_files_frame_ordered = tif_files_frame.sort_values(["Primary Index", "Secondary Index"], axis=0)
tif_files_ordered = tif_files_frame_ordered["Name"].tolist()
This breaks the numbers in the names out into separate columns of a Pandas Dataframe, then sorts your entries by those broken out columns, at which point you can extract the ordered name column on its own.

If key returns a tuple of 2 values, the sort function will try to sort based on the first value then the second value.
please refer to: https://stackoverflow.com/a/5292332/9532450
tif_files = [
"hea_der_1983_002.tif",
"hea_der_1983_001.tif",
"hea_der_1984_002.tif",
"hea_der_1984_001.tif",
]
def parse(filename: str) -> tuple[str, str]:
split = filename.split("_")
return split[2], split[3]
sort = sorted(tif_files, key=parse)
print(sort)
output
['hea_der_1983_001.tif', 'hea_der_1983_002.tif', 'hea_der_1984_001.tif', 'hea_der_1984_002.tif']

right click your folder and click sort by >> name.

Normalizing JSON in Python

I have a JSON response which is:
{"Neighborhoods":[
{"Name":"Project A",
"Balcony":false,
"Sauna":false,
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":null,
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7},
{"Name"
I then use pd.json_normalize(Response,'Neighborhoods') for normalizing.
The Location part is then flattened out as I want, as two columns "Location.Lat" and "Location.Lon". My issue is "ProjectIds" which I get in one column as
['f94d25e2-3709-42bc-a4a2-bf8e073e9790', 'b106b4f1-32b9-4fc2-b2b3-55a7e5348c24']
But I would like to have it without '[] and the space in the middle. So that the output would be
f94d25e2-3709-42bc-a4a2-bf8e073e9790,b106b4f1-32b9-4fc2-b2b3-55a7e5348c24

You can use .str.join() to convert the list of strings into comma separated string, as follows:
df['ProjectIds'] = df['ProjectIds'].str.join(',')
Demo
Response ={"Neighborhoods":[
{"Name":"Project A",
"Balcony":'false',
"Sauna":'false',
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":'null',
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7}]}
df = pd.json_normalize(Response,'Neighborhoods')
df['ProjectIds'] = df['ProjectIds'].str.join(',')
print(df)
Name Balcony Sauna ProjectIds NextViewing SalesStatus TypeOfContract Location.Lat Location.Lon
0 Project A false false f94d25e2-3709-42bc-a4a2-bf8e073e9790,b106b4f1-32b9-4fc2-b2b3-55a7e5348c24 null ForSale 7 52.484295 13.505814

Use ",".join() on the projectIds to convert them to string from list before you pass it to json_nornalize

The way you can solve this is by using ','.join() on the ProjectIds column:
data ={"Neighborhoods":[
{"Name":"Project A",
"Balcony":'false',
"Sauna":'false',
"ProjectIds":["f94d25e2-3709-42bc-a4a2-bf8e073e9790","b106b4f1-32b9-4fc2-b2b3-55a7e5348c24"],
"NextViewing":'null',
"Location":{"Lat":52.484295,"Lon":13.5058143},
"SalesStatus":"ForSale",
"TypeOfContract":7}]}
df = pd.json_normalize(data['Neighborhoods'])
df['ProjectIds'] = df['ProjectIds'].apply(lambda x: ','.join(x))

Mining for Term that is "Included In" Entry Rather than "Equal To"

I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!

You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.

How to replace string values in pandas dataframe to integers?

I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?

You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])

It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')

You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

trim string to first space python - python

You can do it by using extract as well, which do not rely on a loop: (df .assign(first_name=lambda x: x.fullname.str.extract(r"(.*) ")) )

Related

How do you remove similar (not duplicated) rows in pandas dataframe using text similarity?

Order file based on numbers in name

Normalizing JSON in Python

Mining for Term that is "Included In" Entry Rather than "Equal To"

How to replace string values in pandas dataframe to integers?

Categories

Resources