pandas column modification with regular expression

pandas column modification with regular expression - python

i want to fix some string entries in pandas series, such that all values with pattern '0x.202' (last digit of year is missing) will be appended with one zero at the end (so that it is the full date of format 'mm.yyyy'). Here is the pattern i got:
pattern = '\d*\.202(?:$|\W)'
Matches exactly the 2 digits separated by point and exactly 202 in the end. Could you please help me with the way how to replace the strings in series, while preserving original indexes?
My current way to do this is:
date = df['Calendar Year/Month'].astype('str')
pattern = re.compile('\d*\.202(?:$|\W)')
date.str.replace(pattern, pattern.pattern + '0', regex=True)
but i get an error:
error: bad escape \d at position 0
Edit: Sorry for lack of details, i forgot to mention that dates were misinterpreted by pandas as floats, so that is why dates with year 2020 were not completely shown (5.2020 is rounded to 5.202, for example). So the expression i used:
date = df['Year/Month'].astype('str')
date = date.apply(lambda _: _ if _[-1] == '1' or _[-1] == '9' else f'{_}0')
So that only 'xx.202' are edited and dates like 'xx.2021' and 'xx.2019' are omitted. Thanks everyone for help!

Do you have to use regex here? If not, this would work (add a 0 if the length of the string is x).
df["Calendar Year/Month"].apply(lambda _: _ if len(_)==7 else f'{_}0')
Or maybe this (add a 0 if the last digit is 2):
df["Calendar Year/Month"].apply(lambda _: _ if _[-1] == 0 else f'{_}0')

I would do a str.replace:
df = pd.DataFrame({'Year/Month':['10.202 abc', 'abc 1.202']})
df['Year/Month'].str.replace(r'(\d*\.202)\b', r'\g<1>0')
Output:
0 10.2020 abc
1 abc 1.2020
Name: Year/Month, dtype: object

Related

Can I use a dictionary in Python to replace multiple characters?

I am looking for a way to write this code consisely. It's for replacing certain characters in a Pandas DataFrame column.
df['age'] = ['[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)']
df['age'] = df['age'].str.replace('[', '')
df['age'] = df['age'].str.replace(')', '')
df['age'] = df['age'].str.replace('50-60', '50-59')
df['age'] = df['age'].str.replace('60-70', '60-69')
df['age'] = df['age'].str.replace('70-80', '70-79')
df['age'] = df['age'].str.replace('80-90', '80-89')
df['age'] = df['age'].str.replace('90-100', '90-99')
I tried this, but it didn't work, strings in df['age'] were not replaced:
chars_to_replace = {
'[' : '',
')' : '',
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'
}
for key in chars_to_replace.keys():
df['age'] = df['age'].replace(key, chars_to_replace[key])
UPDATE
As pointed out in the comments, I did forget str before replace. Adding it solved my problem, thank you!
Also, thank you tdelaney for that answer, I gave it a try and it works just as well. I am not familiar with regex substitions yet, I wasn't comfortable with the other options.

Use two passes of regex substitution.
In the first pass, match each pair of numbers separated by -, and decrement the second number.
In the second pass, remove any occurrences of [ and ).
By the way, did you mean to have spaces between each pair of numbers? Because as it is now, implicit string concatenation puts them together without spaces.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
def repl(m: re.Match):
age1 = m.group(1)
age2 = int(m.group(2)) - 1
return f"{age1}-{age2}"
string = re.sub(r'(\d+)-(\d+)', repl, string)
string = re.sub(r'\[|\)', '', string)
print(string) # 70-7950-5960-6940-4980-8990-99
The repl function above can be condensed into a lambda:
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
Update: Actually, this can be done in one pass.
import re
string = '[70-80)' '[50-60)' '[60-70)' '[40-50)' '[80-90)' '[90-100)'
repl = lambda m: f"{m.group(1)}-{int(m.group(2))-1}"
string = re.sub(r'\[(\d+)-(\d+)\)', repl, string)
print(string) # 70-7950-5960-6940-4980-8990-99

Assuming these brackets are on all of the entries, you can slice them off and then replace the range strings. From the docs, pandas.Series.replace, pandas will replace the values from the dict without the need for you to loop.
import pandas as pd
df = pd.DataFrame({
"age":['[70-80)', '[50-60)', '[60-70)', '[40-50)', '[80-90)', '[90-100)']})
ranges_to_replace = {
'50-60' : '50-59',
'60-70' : '60-69',
'70-80' : '70-79',
'80-90' : '80-89',
'90-100': '90-99'}
df['age'] = df['age'].str.slice(1,-1).replace(ranges_to_replace)
print(df)
Output
age
0 70-79
1 50-59
2 60-69
3 40-50
4 80-89
5 90-99

In addition to previous response, if you want to apply the regex substitution to your dataframe, you can use the apply method from pandas. To do so, you need to put the regex substitution into a function, then use the apply method:
def replace_chars(chars):
string = re.sub(r'(\d+)-(\d+)', repl, chars)
string = re.sub(r'\[|\)', ' ', string)
return string
df['age'] = df['age'].apply(replace_chars)
print(df)
which gives the following output:
age
0 70-79 50-59 60-69 40-49 80-89 90-99
By the way, here I put spaces between the ages intervals. Hope this helps.

change the last part to this
for i in range(len(df['age'])):
for x in chars_to_replace:
df['age'].iloc[i]=df['age'].iloc[i].replace(x,chars_to_replace[x])

pandas regex look ahead and behind from a 1st occurrence of character

I have python strings like below
"1234_4534_41247612_2462184_2131_GHI.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx"
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx"
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx"
I would like to do the below
a) extract characters that appear before and after 1st dot
b) The keywords that I want are always found after the last _ symbol
For ex: If you look at 2nd input string, I would like to get only PQRST.GHI as output. It is after last _ and before 1st . and we also get keyword after 1st .
So, I tried the below
for s in strings:
after_part = (s.split('.')[1])
before_part = (s.split('.')[0])
before_part = qnd_part.split('_')[-1]
expected_keyword = before_part + "." + after_part
print(expected_keyword)
Though this works, this is definitely not nice and elegant way to write a regex.
Is there any other better way to write this?
I expect my output to be like as below. As you can see that we get keywords before and after 1st dot character
GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

Try (regex101):
import re
strings = [
"1234_4534_41247612_2462184_2131_ABCDEF.GHI.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41247612_2462184_2131_PQRST.GHI.xlsx",
"12JSAF34_45aAF34__sfhaksj_DHJKhd_hJD_41247612_2f462184_2131_JKLMN.OPQ.xlsx",
"1234_4534__sfhaksj_DHJKhd_hJD_41FA247612_2462184_2131_WXY.TUV.xlsx",
]
pat = re.compile(r"[^.]+_([^.]+\.[^.]+)")
for s in strings:
print(pat.search(s).group(1))
Prints:
ABCDEF.GHI
PQRST.GHI
JKLMN.OPQ
WXY.TUV

You can do (try the pattern here )
df['text'].str.extract('_([^._]+\.[^.]+)',expand=False)
Output:
0 ABCDEF.GHI
1 PQRST.GHI
2 JKLMN.OPQ
3 WXY.TUV
Name: text, dtype: object

You can also do it with rsplit(). Specify maxsplit, so that you don't split more than you need to (for efficiency):
[s.rsplit('_', maxsplit=1)[1].rsplit('.', maxsplit=1)[0] for s in strings]
# ['GHI', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']
If there are strings with less than 2 dots and each returned string should have one dot in it, then add a ternary operator that splits (or not) depending on the number of dots in the string.
[x.rsplit('.', maxsplit=1)[0] if x.count('.') > 1 else x
for s in strings
for x in [s.rsplit('_', maxsplit=1)[1]]]
# ['GHI.xlsx', 'PQRST.GHI', 'JKLMN.OPQ', 'WXY.TUV']

How to remove last few characters from each row/column from pandas (python) dataframe?

I have a dataset with lots of variation in format like this.
-0.002672945<120>
-0.077635566{600}
5.88365537e-005{500}
-0.116441565{1}
-4.549649974<29.448>
There are all kinds of variety in the end of the values, I need to remove all those weird brackets, problem is sometimes they are 3 characters, some times 6, etc. I also cannot just take first few characters as there are scientific notation numbers such as 8.645637e-007 like this.
Is there a smart way to clear this kind of mess from data?

The str.split function accepts regex too -
df = pd.DataFrame({'Fruit': ['Banana', 'Banana', 'Carrot<x2>', 'Carrot{78}', 'Carrot<91'], 'Person': list('ABCDE')})
df.loc[:, 'Fruit'] = df['Fruit'].str.split(r'<|{', n=1, expand=True)[0]

>>> df = pd.DataFrame({"x": [
... "-0.002672945<120>",
... "-0.077635566{600}",
... "5.88365537e-005{500}",
... "-0.116441565{1}",
... "-4.549649974<29.448>",
... ]})
>>> df["x"].replace(r"[<{].+$", "", regex=True)
0 -0.002672945
1 -0.077635566
2 5.88365537e-005
3 -0.116441565
4 -4.549649974
Name: x, dtype: object
>>>
You can assign that result back into the df then.

Use a regular expression to clean those:
df[column].str.replace(r'[<\[{].+?[>\]}]$', '', regex=True)
Output:
0 -0.002672945
1 -0.077635566
2 5.88365537e-005
3 -0.116441565
4 -4.549649974
Name: column, dtype: object
Breakdown of the regex:
[<\[{] -- Character class; Matches ONE of ANY of the characters between the `[` and `]` (the `\[` is just a literal `[`, escaped)
.+? -- "." means one of ANY character (except newline), "+" means one or more of the preceding token, ? means not to match the next thing...
[>\]}] -- Character class
$ -- Only match this stuff if it occurs at the VERY END of the string

How do i replace all instances of a dash (-) with the number zero (0) in the middle of a string in pandas dataframe?

I have a column that has 5 numbers then a dash then another 5 numbers for example 44004-23323. I would like to remove that dash in the middle. I would like the output to be something like this 44004023323
I have tried this code below but its not working.
df['Lane'] = df['Lane'].apply(lambda x: "0" if x == "-" else x)

How about .str.replace()?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html
Pandas: replace substring in string
# see documentation for other parameters, such as regex and case
df['Lane'] = df['Lane'].str.replace('-', '0')

Try this
df['Lane'] = df['Lane'].apply(lambda x: str(x).replace('-','0'))

applying replace strings lambda to all rows in python [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?

Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs

If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".

Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)

Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))

If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas column modification with regular expression - python

Do you have to use regex here? If not, this would work (add a 0 if the length of the string is x). df["Calendar Year/Month"].apply(lambda _: _ if len(_)==7 else f'{_}0') Or maybe this (add a 0 if the last digit is 2): df["Calendar Year/Month"].apply(lambda _: _ if _[-1] == 0 else f'{_}0')

I would do a str.replace: df = pd.DataFrame({'Year/Month':['10.202 abc', 'abc 1.202']}) df['Year/Month'].str.replace(r'(\d*\.202)\b', r'\g<1>0') Output: 0 10.2020 abc 1 abc 1.2020 Name: Year/Month, dtype: object

Related

Can I use a dictionary in Python to replace multiple characters?

pandas regex look ahead and behind from a 1st occurrence of character

How to remove last few characters from each row/column from pandas (python) dataframe?

How do i replace all instances of a dash (-) with the number zero (0) in the middle of a string in pandas dataframe?

applying replace strings lambda to all rows in python [duplicate]

Categories

Resources