Parsing String based on Order of Occurence

Parsing String based on Order of Occurence - python

I have data like the SampleDf below, and I'm trying to create code that would pick off the first 'Avg','Sum' or 'Count' that it runs in to in each string and put that in a new column 'Agg'. The code I have below almost does it but it has a hierarchy. So in the code I have below if Count comes before Sum it still puts Sum in the 'Agg' column. I have an OutputDf below showing what I'm hoping to get.
Sample Data:
SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0)"]],columns=['ReportField','OtherField'])
Sample Output:
OutputDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)",'Avg'],['bob',"isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0)",'Sum']],columns=['ReportField','OtherField','Agg'])
Code:
import numpy as np
SampleDf['Agg'] = np.where(SampleDf.SQLTranslation.str.contains("Sum"),"Sum",
np.where(SampleDf.SQLTranslation.str.contains("Count"),"Count",
np.where(SampleDf.SQLTranslation.str.contains("Avg"),"Avg","Nothing")))

A quick and dirty attempt at this problem would be writing a function that returns:
- any term of interest, i.e. ['Avg','Sum','Count'], occurring first, if it's present in the string
- or None, if there is no such:
import re
terms = ['Avg','Sum','Count']
def extractTerms(s, t=terms):
s_clean = re.sub("[^\w]|[\d]"," ", s).split()
s_array = [w for w in s_clean if w in t]
try:
return s_array[0]
except:
return None
Proof if terms in the string:
SampleDf['Agg'] = SampleDf['OtherField'].apply(lambda s: extractTerms(s))
SampleDf
ReportField OtherField Agg
0 tom Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end) Avg
1 bob isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0) Sum
Proof if terms are not in the string:
SampleDf['Agg'] = SampleDf['OtherField'].apply(lambda s: extractTerms(s))
SampleDf
ReportField OtherField Agg
0 tom foo None
1 bob isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0) Sum

Related

Python extract value from multiple substring

I have a dataframe named df which has a column named "text" consisting of each row which a string like this: This is the string of the MARC data format.
d20s 22 i2as¶001VNINDEA455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶7146 $q9$uAgency XYZ¶8799 $q1$uAgency ABC$fHTML$
Here I want to extract information containing in zones ¶7146, after $u or zone ¶0441, after $c.
The result table will be like this :
¶7146$u
¶0441$c
wwww.abc.org
2609-2565
Agency XYZ
2609-2565
Here is the code I made :
import os
import pandas as pd
import numpy as np
import requests
df = pd.read_csv('dataset.csv')
def extract(text, start_pattern, sc):
ist = text.find(start_pattern)
if ist < 0:
return ""
ist = text.find(sc, ist)
if ist < 0:
return ""
im = text.find("$", ist + len(sc))
iz = text.find("¶", ist + len(sc))
if im >= 0:
if iz >= 0:
ie = min(im, iz)
else:
ie = im
else:
ie = iz
if ie < 0:
return ""
return text[ist + len(sc): ie]
def extract_text(row, list_in_zones):
text = row["text"]
if pd.isna(text):
return [""] * len(list_in_zones)
patterns = [("¶" + p, "$" + c) for p, c in [zone.split("$") for zone in list_in_zones]]
return [extract(text, pattern, sc) for pattern, sc in patterns]
list_in_zones = ["7146$u", "0441$u", "200$y"]
df[list_in_zones] = df.apply(lambda row: extract_text(row, list_in_zones),
axis=1,
result_type="expand")
df.to_excel("extract.xlsx", index = False)
For zones ¶7146 and after $u, my code only extracted "www.abc.org", he cannot extract the duplicate with value "Agency XYZ". What's wrong here?.
Additional logical structure : The logic about the structure of the string is that each zone will start with a character ¶ like ¶7146, ¶0441,.. , and the fields start with $ for example $u, $c and this field ends with either $ or ¶. Here, I want to extract information in the fields $.

You could try splitting and then cleaning up strings as follows
import pandas as pd
text = ('d20s 22 i2as¶001VNINDEA455133910000005¶008180529c 1996 frmmm wz 7b ¶009se z 1 m mm c¶008a ¶008at ¶008ap ¶008a ¶0441 $a2609-2565$c2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965$c4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$atdi$bc¶110 $317737535$w20..b.....$astock market situation¶3330 $aimport and export agency ABC¶7146 $q1$uwwww.abc.org$ma1¶7146 $q9$uAgency XYZ¶8799 $q1$uAgency ABC$fHTML$')
u = text.split('$u')[1:3] # Taking just the seconds and third elements in the array because they match your desired output
c = text.split('$c')[1:3]
pd.DataFrame([u,c]).T
OUTPUT
0 1
0 wwww.abc.org$ma1¶7146 $q9 2609-2565¶0410 $afre$aeng$apor ¶0569 $a2758-8965
1 Agency XYZ¶8799 $q1 4578-7854¶0300 $a789$987$754 ¶051 $atxt$asti$a...
From here you can try to clean up the strings until they match the desired output.
It would be easier to give a more helpful answer if we could understand the logic behind this data structure - when do certain fields start and end?

How do I force a blank for rows in a dataframe that have any str or character apart from numerics?

I have a datframe
>temp
Age Rank PhoneNumber State City
10 1 99-22344-1 Ga abc
15 12 No Ma xyz
For the column(Phone Number), I want to strip all characters like - unless they are full phone numbers and if it says No or any word apart from a numeric, I want it to be a blank. How can I do this
My attempt is able to handle special chars but not words symbols like 'No'
temp['PhoneNumber '] = temp['PhoneNumber '].str.replace('[^\d]+', '')
Desired Output df -
>temp
Age Rank PhoneNumber State City
10 1 99223441 Ga abc
15 12 Ma xyz

This does the job.
import pandas as pd
import re
data = [
[10, 1, '99-223344-1', 'GA', 'Abc'],
[15, 12, "No", 'MA', 'Xyz']
]
df = pd.DataFrame(data, columns=['Age Rank PhoneNumber State City'.split()])
print(df)
def valphone(p):
p = p['PhoneNumber']
if re.match(r'[123456789-]+$', p):
return p
else:
return ""
print(df['PhoneNumber'])
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone, axis=1)
print(df)
Output:
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 No MA Xyz
Age Rank PhoneNumber State City
0 10 1 99-223344-1 GA Abc
1 15 12 MA Xyz
I do have to admit to a bit of frustration with this. I EXPECTED to be able to do
df['PhoneNumber'] = df['PhoneNumber'].apply(valphone)
because df['PhoneNumber'] should return a Series, and the Series.apply function should pass me one value at a time. However, that's not what happens here, and I don't know why. df['PhoneNumber'] returns a DataFrame instead of a Series, so I have to use the column reference inside the function.
Thus, YOU may need to do some experimentation. If df['PhoneNumber'] returns a Series for you, then you don't need the axis=1, and you don't need the p = p['PhoneNumber'] line in the function.
Followup
OK, assuming the presence of a "phone number validation" module, as is mentioned in the comments, this becomes:
import phonenumbers
...
def valphone(p):
p = p['PhoneNumber'] # May not be required
n = phonenumbmers.parse(p)
if phonenumbers.is_possible_number(n):
return p
else:
return ''
...

temp['PhoneNumber'] = temp['PhoneNumber'].apply(str).str.findall(r'\d').str.join('')

Pandas finding a text in row and assign a dummy variable value based on this

I have a data frame which contains a text column i.e. df["input"],
I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
which looks like:
input
amazon listing subtitle
medical
film biotechnology dentist
final dataset should look like:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0

One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
But in this case, it might be more efficient to use a nested for-loop:
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
Output:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1

Here is a simpler - more pandas vector style solution:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
Result
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0

Great question and good answers (I somehow missed it yesterday)! Here's another variation with .str.extractall():
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)

How to extract specific codes from string in separate columns?

I have data in the following format.
Data
Data Sample Excel
I want to extract the codes from the column "DIAGNOSIS" and paste each code in a separate column after the "DIAGNOSIS" column. I Know the regular expression to be used to match this which is
[A-TV-Z][0-9][0-9AB].?[0-9A-TV-Z]{0,4}
source: https://www.johndcook.com/blog/2019/05/05/regex_icd_codes/
These are called ICD10 codes represented like Z01.2, E11, etc. The Above expression is meant to match all ICD10 codes.
But I am not sure how to use this expression in python code to do the above task.
The problem that I am trying to solve is?
Count the Total number of Codes assigned for all patients?
Count Total number of UNIQUE code assigned (since multiple patients might have same code assigned)
Generate data Code wise - i.e if I select code Z01.2, I want to extract Patient data (maybe PATID, MOBILE NUMBER OR ANY OTHER COLUMN OR ALL) who have been assigned this code.
Thanks in advance.

Using Python Pandas as follows.
Code
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t')
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['Length'] = df['CODES'].str.len()
print(f"Total Codes: {df['Length'].sum()}")
all_codes = df['CODES'].sum()#.set()
unique_codes = set(all_codes)
print(f'all codes {all_codes}\nCount: {len(all_codes)}')
print(f'unique codes {unique_codes}\nCount: {len(unique_codes)}')
# Select patients with code Z01.2
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
# Show selected columns
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Explanation
Imported data as tab-delimited CSV
import pandas as pd
import re
df = pd.read_csv("data.csv", delimiter='\t'
Resulting DataFrame df
PATID PATIENT_NAME MOBILE_NUMBER EMAIL_ADDRESS GENDER PATIENT_AGE \
0 11 Mac 98765 ab1#gmail.com F 51 Y
1 22 Sac 98766 ab1#gmail.com F 24 Y
2 33 Tac 98767 ab1#gmail.com M 43 Y
3 44 Lac 98768 ab1#gmail.com M 54 Y
DISTRICT CLINIC DIAGNOSIS
0 Mars Clinic1 Z01.2 - Dental examinationC50 - Malignant neop...
1 Moon Clinic2 S83.6 - Sprain and strain of other and unspeci...
2 Earth Clinic3 K60.1 - Chronic anal fissureZ20.9 - Contact wi...
3 Saturn Clinic4 E11 - Type 2 diabetes mellitusE78.5 - Hyperlip...
Extract from DIAGNOSIS column using the specified pattern
Add an escape character before . otherwise, it would be a wildcard and match any character (no difference on data supplied).
pattern = '([A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4})'
df['CODES'] = df['DIAGNOSIS'].str.findall(pattern)
df['CODES'] each row in the column is a list of codes
0 [Z01.2, C50 , Z10.0]
1 [S83.6, L05.0, Z20.9]
2 [K60.1, Z20.9, J06.9, C50 ]
3 [E11 , E78.5, I10 , E55 , E79.0, Z24.0, Z01.2]
Name: CODES, dtype: object
Add length column to df DataFrame
df['Length'] = df['CODES'].str.len()
df['Length']--correspond to length of each code list
0 3
1 3
2 4
3 7
Name: Length, dtype: int64
Total Codes Used--sum over the length of codes
df['Length'].sum()
Total Codes: 17
All Codes Used--concatenating all the code lists
all_codes = df['CODES'].sum()
['Z01.2', 'C50 ', 'Z10.0', 'S83.6', 'L05.0', 'Z20.9', 'K60.1', 'Z20.9', 'J06.9', 'C50
', 'E11 ', 'E78.5', 'I10 ', 'E55 ', 'E79.0', 'Z24.0', 'Z01.2']
Count: 17
Unique Codes Used--take the set() of the list of all codes
unique_codes = set(all_codes)
{'L05.0', 'S83.6', 'E79.0', 'Z01.2', 'I10 ', 'J06.9', 'K60.1', 'E11 ', 'Z24.0', 'Z
10.0', 'E55 ', 'E78.5', 'Z20.9', 'C50 '}
Count: 14
Select patients by code (i.e. Z01.2)
patients=df[df['CODES'].apply(', '.join).str.contains('Z01.2')]
Show PATIE, PATIENT_NAME and MOBILE_NUMBER for these patients
print(patients.loc[:, ['PATID', 'PATIENT_NAME', 'MOBILE_NUMBER']])
Result
PATID PATIENT_NAME MOBILE_NUMBER
0 11 Mac 98765
3 44 Lac 98768

Pandas to modify values in csv file based on function

I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505

The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20

First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)

Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing String based on Order of Occurence - python

Related

Python extract value from multiple substring

How do I force a blank for rows in a dataframe that have any str or character apart from numerics?

Pandas finding a text in row and assign a dummy variable value based on this

How to extract specific codes from string in separate columns?

Pandas to modify values in csv file based on function

Categories

Resources