How to remove unique character based on the same index via regex - python

while learning through SO's one of the question, where using regex to extract values.
I am wondering how we can implement a regex to remove all the characters if the are same in every row and matching the same index position.
Below is the DataFrame:
print(df)
column1
0 [b,e,c]
1 [e,a,c]
2 [a,b,c]
regex :
df.column1.str.extract(r'(\w,\w)')
print(df)
column1
0 b,e
1 e,a
2 a,b
In the above regex it extract the characters needed but i want to preserve [] this as well.

You can use
df['column2'] = df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
df['column2'] = '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
In the .str.replace approach, the (?s).*?\[(\w,\w).* matches any zero or more chars as few as possible, then a [, then captures a word char + comma + a word char into Group 1 (\1) and then the rest of the string and replaces the match with [ + Group 1 value + ].
In the second approach, [ and ] are added to the result of the extraction, this solution is best for your toy examples here.
Here is a Pandas test:
>>> import pandas as pd
>>> df = pd.DataFrame({'column1':['[b,e,c]']})
>>> df['column1'].str.replace(r'(?s).*?\[(\w,\w).*', r'[\1]', regex=True)
0 [b,e]
Name: column1, dtype: object
>>> '[' + df['column1'].str.extract(r'(\w,\w)') + ']'
0
0 [b,e]

Related

extracting a string from between to strings in dataframe

im trying to extract a value from my data frame
i have a column ['Desc'] it contains sentences in the folowing format
_000it_ZZZ$$$-
_0780it_ZBZT$$$-
_011it_BB$$$-
_000it_CCCC$$$-
I want to extract the string between 'it_' and '$$$'
I have tried this code but does not seem to work
# initializing substrings
sub1 = "it_"
sub2 = "$$$"
# getting index of substrings
idx1 = df['DESC'].find(sub1)
idx2 = df['DESC'].find(sub2)
# length of substring 1 is added to
# get string from next character
df['results'] = df['DESC'][idx1 + len(sub1) + 1: idx2]
I would appreciate your help
You can use str.extract to get the desired output in your new column.
import pandas as pd
import re
df = pd.DataFrame({
'DESC' : ["_000it_ZZZ$$$-", "_0780it_ZBZT$$$-", "_011it_BB$$$-", "_000it_CCCC$$$-", "_000it_123$$$-"]
})
pat = r"(?<=it_)(.+)(?=[\$]{3}-)"
df['results'] = df['DESC'].str.extract(pat)
print(df)
DESC results
0 _000it_ZZZ$$$- ZZZ
1 _0780it_ZBZT$$$- ZBZT
2 _011it_BB$$$- BB
3 _000it_CCCC$$$- CCCC
4 _000it_123$$$- 123
You can see the regex pattern on Regex101 for more details.
You could try using a regex pattern. It matches your cases you listed here, but I can't guarantee that it will generalize to all possible patterns.
import re
string = "_000it_ZZZ$$$-"
p = re.compile(r"(?<=it_)(.*)(?<!\W)")
m = p.findall(string)
print(m) # ['_ZZZ']
The pattern looks for it in the string and then stops untill it meets a non-word character.

Python Trimming a few column names but not all in a dataframe

I have a dataframe of many columns. Now I am trimming a few columns to reduce the text length.
Code:
xdf = pd.DataFrame({'Column1':[10,25],'Column2':[10,25],'Fix_col':[10,25]})
## Rename `Column1` to `C1` and for `C2` as well
req_cols = ['Column1','Column2']
xdf[req_cols].columns = [x[0]+y for name in xdf[req_cols].str.findall(r'([A-Za-z]+)(\d+)' for x,y in name]
Present solution:
print([x[0]+y for name in xdf[req_cols].str.findall(r'([A-Za-z]+)(\d+)' for x,y in name])
['C1','C2']
print(xdf[req_cols].columns)
['Column1','Column2']
Column names did not change. Don't know why?
Expected Answer:
xdf.columns = ['C1','C2','Fix_col']
You can use
import pandas as pd
import re
xdf = pd.DataFrame({'Column1':[10,25],'Column2':[10,25],'Fix_col':[10,25]})
req_cols = ['Column1','Column2']
xdf.rename(columns=lambda x : x if x not in req_cols else re.sub(r'^(\D?)\D*(\d*)', r'\1\2', x), inplace=True)
Output of xdf.columns:
Index(['C1', 'C2', 'Fix_col'], dtype='object')
See the regex demo. Details:
^ - start of string
(\D?) - Group 1 (\1): an optional non-digit char
\D* - zero or more non-digit chars
(\d*) - Group 2 (\2): zero or more digits.

Transforming letters into 0 in pandas

I am a very beginner in Python Pandas.
I have a Data set with wrongly types postal codes : last characters are random letters.
How can I transform these letters into 0 ?
I tried this but obviously the whole postal code turns out to a 0 :
if data["CODE_POSTAL_PATIENT"].str.isalpha:
df1 = data["CODE_POSTAL_PATIENT"].transform(lambda x: 0)
Thanks in advance !
Assuming you have zip codes like '12XY#' and want to change to '12000', use a regex to match the non digits and replace them with "0" using str.replace:
df['CODE_POSTAL_CORRECTED'] = df['CODE_POSTAL'].str.replace('\D', '0', regex=True)
output:
CODE_POSTAL CODE_POSTAL_CORRECTED
0 12345 12345
1 12XY# 12000
regex:
\D # match a non digit
Use:
df = pd.DataFrame({'CODE_POSTAL_PATIENT': ['abcdr', 'efghr']})
df['0'] = ['0' for i in range(len(df))]
df['new'] = df['CODE_POSTAL_PATIENT'].str[:-1]+df['0']
Output:
Replace everything except digits
df['CODE_POSTAL'].str.replace('[^\d]','0',regex=True)

Regex replace first two letters within column in python

I have a dataframe such as
COL1
A_element_1_+_none
C_BLOCA_element
D_element_3
element_'
BasaA_bloc
B_basA_bloc
BbasA_bloc
and I would like to remove the first 2 letters within each row of COL1 only if they are within that list :
the_list =['A_','B_','C_','D_']
Then I should get the following output:
COL1
element_1_+_none
BLOCA_element
element_3
element_'
BasaA_bloc
basA_bloc
BbasA_bloc
So far I tried the following :
df['COL1']=df['COL1'].str.replace("A_","")
df['COL1']=df['COL1'].str.replace("B_","")
df['COL1']=df['COL1'].str.replace("C_","")
df['COL1']=df['COL1'].str.replace("D_","")
But it also remove the pattern such as in row2 A_ and does not remove only the first 2 letters...
If the values to replace in the_list always have that format, you could also consider using str.replace with a simple pattern matching an uppercase char A-D followed by an underscore at the start of the string ^[A-D]_
import pandas as pd
strings = [
"A_element_1_+_none ",
"C_BLOCA_element ",
"D_element_3",
"element_'",
"BasaA_bloc",
"B_basA_bloc",
"BbasA_bloc"
]
df = pd.DataFrame(strings, columns=["COL1"])
df['COL1'] = df['COL1'].str.replace(r"^[A-D]_", "")
print(df)
Output
COL1
0 element_1_+_none
1 BLOCA_element
2 element_3
3 element_'
4 BasaA_bloc
5 basA_bloc
6 BbasA_bloc
You can also use apply() function from pandas. So if the string is with the concerned patterns, we ommit the two first caracters else return the whole string.
d["COL1"] = d["COL1"].apply(lambda x: x[2:] if x.startswith(("A_","B_","C_","D_")) else x)

swap last 2 string of columns seperated by delimiter

I hava a dataframe. I want to swap last 2 string of columns seperated by "_" if the 2nd last string is "pi"
Dataframe has columns such as:
abc_rte abc_rte_log abc_rte_log_pi1 abc_rte_pi1_log xyz_pnct_pi2_log
Desired column names:
abc_rte abc_rte_log abc_rte_log_pi1 abc_rte_log_pi1 xyz_pnct_log_pi2
What i tried so far:
for i in range(0, len(df.columns)):
if str(df.columns[i].split('_')[-2] == 'pi':
df.columns[i].split('_')[-2] = str(df.columns[i].split('_')[-1])
Index.str.replace
df.columns = df.columns.str.replace(r'(pi\d*)_([^_]+)$', r'\2_\1')
>>> df.columns
Index(['abc_rte', 'abc_rte_log', 'abc_rte_log_pi1', 'abc_rte_log_pi1',
'xyz_pnct_log_pi2'],
dtype='object')
Regex details:
(pi\d*) : First capturing group
pi : Matches the characters pi literally
\d* : Matches a digit between zero or more times
_ : Matches the character _
([^_]+) : Second capturing group
[^_]+ : Matches any character not present in the list [_] one or more times
$ : Asserts position at the end of line
See the online regex demo
mapping = {col:col for col in df.columns}
for colname in df.columns:
splits = colname.rsplit("_",2)
if splits[-2] == 'pi':
newname = "_".join((splits[0], splits[-1], splits[-2]))
mapping[colname] = newname
df.rename(columns=mapping, inplace=True)

Categories