replacing special characters in string Python - python

I'm trying to replace special characters in a data frame with unaccented or different ones.
I can replace one with
df['col_name'] = df.col_name.str.replace('?','j')
this turned the '?' to 'j' - but - I can't seem to figure out how to change more than one..
I have a list of special characters that I want to change. I've tried using a dictionary but it doesn't seem to work
the_reps = {'?','j'}
df1 = df.replace(the_reps, regex = True)
this gave me the error nothing to replace at position 0
EDIT:
this is what worked - although it is probably not that pretty:
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')...
for each one ..

import re
s=re.sub("[_list of special characters_]","",_your string goes here_)
print(s)
An example for this..
str="Hello$#& Python3$"
import re
s=re.sub("[$#&]","",str)
print (s)
#Output:Hello Python3
Explanation goes here..
s=re.sub("[$#&]","",s)
Pattern to be replaced → “[$#&]”
[] used to indicate a set of characters
[$#&] → will match either $ or # or &
The replacement string is given as an empty string
If these characters are found in the string, they’ll be replaced with an empty string

you can use Series.replace with a dictionary
#d = { 'actual character ':'replacement ',...}
df.columns = df.columns.to_series().replace(d, regex=True)

Try This:
import re
my_str = "hello Fayzan-Bhatti Ho~!w"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
Output: hello FayzanBhatti How

Related

Delete specific duplicated punctuation from string

I have this string s = "(0|\\+33)[1-9]( *[0-9]{2}){4}". And I want to delete just the duplicated just one ' \ ', like I want the rsult to look like (0|\+33)[1-9]( *[0-9]{2}){4}.
When I used this code, all the duplicated characters are removed:
result = "".join(dict.fromkeys(s)).
But in my case I want just to remove the duplicated ' \ '. Any help is highly appreciated
A solution using the re module:
import re
s = r"(0|\\+33)[1-9]( *[0-9]{2}){4}"
s = re.sub(r"\\(?=\\)", "", s)
print(s)
I look for all backslashes, that are followed by another backslash and replace it with an empty sign.
Output: (0|\+33)[1-9]( *[0-9]{2}){4}​
The function you need is replace
s = "(0|\\+33)[1-9]( *[0-9]{2}){4}"
result = s.replace("\\","")
EDIT
I see now that you want to remove just one \ and not both.
In order to do this you have to modify the call to replace this way
result = s.replace("\","",1) # last argument is the number of occurrances to replace
or
result = s.replace("\\","\")
EDIT of the EDIT
Backslashes are special in Python.
I'm using Python 3.10.5. If I do
x = "ab\c"
y = "ab\\c"
print(len(x)==len(y))
I get a True.
That's because backslashes are used to escape special characters, and that makes the backslash a special character :)
I suggest you to try a little bit with replace until you get what you need.

Slice string at last digit in Python

So I have strings with a date somewhere in the middle, like 111_Joe_Smith_2010_Assessment and I want to truncate them such that they become something like 111_Joe_Smith_2010. The code that I thought would work is
reverseString = currentString[::-1]
stripper = re.search('\d', reverseString)
But for some reason this doesn't always give me the right result. Most of the time it does, but every now and then, it will output a string that looks like 111_Joe_Smith_2010_A.
If anyone knows what's wrong with this, it would be super helpful!
You can use re.sub and $ to match and substitute alphabetical characters
and underscores until the end of the string:
import re
d = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
new_s = [re.sub('[a-zA-Z_]+$', '', i) for i in d]
Output:
['111_Joe_Smith_2010', '111_Bob_Smith_2010']
You could strip non-digit characters from the end of the string using re.sub like this:
>>> import re
>>> re.sub(r'\D+$', '', '111_Joe_Smith_2010_Assessment')
'111_Joe_Smith_2010'
For your input format you could also do it with a simple loop:
>>> s = '111_Joe_Smith_2010_Assessment'
>>> i = len(s) - 1
>>> while not s[i].isdigit():
... i -= 1
...
>>> s[:i+1]
'111_Joe_Smith_2010'
You can use the following approach:
def clean_names():
names = ['111_Joe_Smith_2010_Assessment', '111_Bob_Smith_2010_Test_assessment']
for name in names:
while not name[-1].isdigit():
name = name[:-1]
print(name)
Here is another solution using rstrip() to remove trailing letters and underscores, which I consider a pretty smart alternative to re.sub() as used in other answers:
import string
s = '111_Joe_Smith_2010_Assessment'
new_s = s.rstrip(f'{string.ascii_letters}_') # For Python 3.6+
new_s = s.rstrip(string.ascii_letters+'_') # For other Python versions
print(new_s) # 111_Joe_Smith_2010

regex replace '...' at the end of the string

I have a string like:
text1 = 'python...is...fun...'
I want to replace the multiple '.'s to one '.' only when they are at the end of the string, i want the output to be:
python...is...fun.
So when there is only one '.' at the end of the string, then it won't be replaced
text2 = 'python...is...fun.'
and the output is just the same as text2
My regex is like this:
text = re.sub(r'(.*)\.{2,}$', r'\1.', text)
which i want to match any string then {2 to n} of '.' at the end of the string, but the output is:
python...is...fun..
any ideas how to do this?
Thx in advance!
You are making it a bit complex, you can easily do it by using regex as \.+$ and replace the regex pattern with single . character.
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
You may extend this regex further to handle the cases with input such as ... only, etc but the main concept was that there is no need to counting the number of ., as you have done in your answer.
Just look for the string ending with three periods, and replace them with a single one.
import re
x = "foo...bar...quux..."
print(re.sub('\.{2,}$', '.', x))
// foo...bar...quux.
import re
print(re.sub(r'\.{2,}$', '.', 'I...love...python...'))
As simple as that. Note that you need to escape the . because otherwise, it means whichever char
except \n.
I want to replace the multiple '.'s to one '.' only when they are at
the end of the string
For such simple case it's easier to substitute without importing re module, checking the value of the last 3 characters:
text1 = 'python...is...fun...'
text1 = text1[:-2] if text1[-3:] == '...' else text1
print(text1)
The output:
python...is...fun.

Splitting a string using re module of python

I have a string
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
#I have to capture only the field 'count_EVENT_GENRE'
field = re.split(r'[(==)(>=)(<=)(in)(like)]', s)[0].strip()
#o/p is 'cou'
# for s = 'sum_EVENT_GENRE in [1,2,3,4,5]' o/p = 'sum_EVENT_GENRE'
which is fine
My doubt is for any character in (in)(like) it is splitting the string s at that character and giving me first slice.(as after "cou" it finds one matching char i:e n). It's happening for any string that contains any character from (in)(like).
Ex : 'percentage_AMOUNT' o/p = 'p'
as it finds a matching char as 'e' after p.
So i want some advice how to treat (in)(like) as words not as characters , when splitting occurs/matters.
please suggest a syntax.
Answering your question, the [(==)(>=)(<=)(in)(like)] is a character class matching single characters you defined inside the class. To match sequences of characters, you need to remove [ and ] and use alternation:
r'==?|>=?|<=?|\b(?:in|like)\b'
or better:
r'[=><]=?|\b(?:in|like)\b'
You code would look like:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
field = re.split(r'[=><]=?|\b(?:in|like)\b', s)[0].strip()
print(field)
However, there might be other (easier, or safer - depending on the actual specifications) ways to get what you want (splitting with space and getting the first item, use re.match with r'\w+' or r'[a-z]+(?:_[A-Z]+)+', etc.)
If your value is at the start of the string and starts with lowercase ASCII letters, and then can have any amount of sequences of _ followed with uppercase ASCII letters, use:
re.match(r'[a-z]+(?:_[A-Z]+)*', s)
Full demo code:
import re
ss = ['count_EVENT_GENRE in [1,2,3,4,5]','coint_EVENT_GENRE = "ROMANCE"']
for s in ss:
fieldObj = re.match(r'[a-z]+(?:_[A-Z]+)*', s)
if fieldObj:
print(fieldObj.group())
If you want only the first word of your string, then this should do the job:
import re
s = 'count_EVENT_GENRE in [1,2,3,4,5]'
field = re.split(r'\W', s)[0]
# count_EVENT_GENRE
Is there anything wrong with using split?
>>> s = 'count_EVENT_GENRE in [1,2,3,4,5]'
>>> s.split(' ')[0]
'count_EVENT_GENRE'
>>> s = 'coint_EVENT_GENRE = "ROMANCE"'
>>> s.split(' ')[0]
'coint_EVENT_GENRE'
>>>

Python Regex Get String Between Two Substrings

First off, I know this may seem like a duplicate question, however, I could find no working solution to my problem.
I have string that looks like the following:
string = "api('randomkey123xyz987', 'key', 'text')"
I need to extract randomkey123xyz987 which will always be between api(' and ',
I was planning on using Regex for this, however, I seem to be having some trouble.
This is the only progress that I have made:
import re
string = "api('randomkey123xyz987', 'key', 'text')"
match = re.findall("\((.*?)\)", string)[0]
print match
The following code returns 'randomkey123xyz987', 'key', 'text'
I have tried to use [^'], but my guess is that I am not properly inserting it into the re.findall function.
Everything that I am trying is failing.
Update: My current workaround is using [2:-4], but I would still like to avoid using match[2:-4].
If the string contains only one instance, use re.search() instead:
>>> import re
>>> s = "api('randomkey123xyz987', 'key', 'text')"
>>> match = re.search(r"api\('([^']*)'", s).group(1)
>>> print match
randomkey123xyz987
You want the string between the ( and ,, you are catching everything between the parens:
match = re.findall("api\((.*?),", string)
print match
["'randomkey123xyz987'"]
Or match between the '':
match = re.findall("api\('(.*?)'", string)
print match
['randomkey123xyz987']
If that is how your strings actually look you can split:
string = "api('randomkey123xyz987', 'key', 'text')"
print(string.split(",",1)[0][4:])
You should use the following regex:
api\('(.*?)'
Assuming that api( is fixed prefix
It matches api(, then captures what appears next, until ' token.
>>> re.findall(r"api\('(.*?)'", "api('randomkey123xyz987', 'key', 'text')")
['randomkey123xyz987']
If you are certain that randomkey123xyz987 will always be between "api('" and "',", then using the split() method can get it done in one line. This way you will not have to use regex matching. It will match the pattern between the starting and ending delimiter which is "api('" and "',
".
>>> string = "api('randomkey123xyz987', 'key', 'text')"
>>> value = (string.split("api('")[1]).split("',")[0]
>>> print value
randomkey123xyz987

Categories