Replace only specified occurrences of string - Python Regex - python

I am playing around with re module in python, what i am stuck at is I want to replace only specified occurrences of the string.
for example
import re
string = "aabbaabbaabbabbaabbaa"
#I want to replace only 3rd time 'bb' appeared in the string with white space
string = re.sub("bb"," ",string,3) #if iI do this all first 3 occurrences got replaced
print(string)
output
aa aa aa aabbaabbaa
any idea how to to replace only 3rd occurrence
so the output would look like This
aabbaabbaa aabbaabbaa

This may not be the perfect way but it is a solution:
string = re.sub('bb',' ',string, 3)
string = re.sub(' ','bb',string,2)
This is just an alternative solution I can think of.

Modify the regex so it only matches the third occurrence?
re.sub(r'(.*?bb.*?bb.*?)bb', r'\1 ', string, 1)
This could be extended to a large number of repetitions like r'(.*?(bb.*?){9999})bb'

Related

Multiple character replacement in a column

I am trying to replace some string characters with a single character, I can do it with a multiple lines of code but I was wondering if there is something like this to do it in a single line?
df['Column'].str.replace(['_','-','/'], ' ')
I can write 3 lines of code for normal str.replace() and change those strings one by one but I don't think that would be efficient.
Pandas Dataframe Str replace takes regex pattern or string as first argument. So you can provide a regex to change multiple patterns
code:
import pandas as pd
check_df = pd.DataFrame({"Column":["abc", "A_bC", "A_b-C/d"]})
check_df['Column'].str.replace("_|-|/", " ")
Output:
0 abc
1 A bC
2 A b C d
Name: Column, dtype: object
you can use a regular expression with an alternating group:
df['Column'].str.replace(r"_|-|/", " ", regex=True)
| means "either of these".
or you can use str.maketrans to make a translation table and use .str.translate:
df['Column'].str.translate(str.maketrans(dict.fromkeys("_-|", " ")))
Note that this is for 1-length characters' translation.
If characters are dynamically produced, e.g., within a list, then re.escape("|".join(chars)) can be used for the first way, and "".join(chars) for the second way. re.escape for the first one is for special characters' escaping, e.g., if "$" is to be replaced, since it is the end-of-string anchor in regexes, we need to have written "\$" instead, which re.escape will take care.
You could use a character class [/_-] listing the characters that you want to replace.
Note that if you have multiple consecutive characters and you replace them with a space, you will get space gaps. If you don't want that, you can repeat the character class with a + to match 1 or more characters and replace that match with a single space.
If you don't want the leading and trailing spaces, you can use .str.strip()
Example
import pandas as pd
df = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df['Column'] = df['Column'].str.replace(r"[/_-]", ' ')
print(df)
print("\n---------v2---------\n")
df_v2 = pd.DataFrame({"Column":[" a//b_c__-d", "a//////b "]})
df_v2['Column'] = df_v2['Column'].str.replace(r"[/_-]+", ' ').str.strip()
print(df_v2)
Output
Column
0 a b c d
1 a b
---------v2---------
Column
0 a b c d
1 a b

Matching a String in Python using regex

I have a string say like this:
ARAN22 SKY BYT and TRO_PAN
In the above string The first alphabet can be A or S or T or N and the two numbers after RAN can be any two digit. However the rest will be always same and last three characters will be always like _PAN.
So the few possibilities of the string are :
SRAN22 SK BYT and TRO_PAN
TRAN25 SK BYT and TRO_PAN
NRAN25 SK BYT and TRO_PAN
So I was trying to extract the string every time in python using regex as follows:
import re
pattern = "([ASTN])RAN" + "\w+\s+" +"_PAN"
pat_check = re.compile(pattern, flags=re.IGNORECASE)
sample_test_string = 'NRAN28 SK BYT and TRO_PAN'
re.match(pat_check, sample_test_string)
here string can be anything like the above examples I gave there.
But its not working as I am not getting the string name ( the sample test string) which I should. Not sure what I am doing wrong. Any help will be very much appreciated.
You are using \w+\s+, which will match one or more word (0-9A-Za-z_) characters, followed by one or more space characters. So it will match the two digits and space after RAN but then nothing more. Since the next characters are not _PAN, the match will fail. You need to use [\w\s]+ instead:
pattern = "([ASTN])RAN" + "[\w\s]+" +"_PAN"

Change string for defiened pattern (Python)

Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?
To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.

Using strip() to remove only one element

I have a word within two opening and closing parenthesis, like this ((word)).
I want to remove the first and the last parenthesis, so they are not duplicate, in order to obtain something like this: (word).
I have tried using strip('()') on the variable that contains ((word)). However, it removes ALL parentheses at the beginning and at the end. Result: word.
Is there a way to specify that I only want the first and last one removed?
For this you could slice the string and only keep from the second character until the second to last character:
word = '((word))'
new_word = word[1:-1]
print(new_word)
Produces:
(word)
For varying quantities of parenthesis, you could count how many exist first and pass this to the slicing as such (this leaves only 1 bracket on each side, if you want to remove only the first and last bracket you can use the first suggestion);
word ='((((word))))'
quan = word.count('(')
new_word = word[quan-1:1-quan]
print(new_word)
Produces;
(word)
You can use regex.
import re
word = '((word))'
re.findall('(\(?\w+\)?)', word)[0]
This only keeps one pair of brackets.
instead use str.replace, so you would do str.replace('(','',1)
basically you would replace all '(' with '', but the third argument will only replace n instances of the specified substring (as argument 1), hence you will only replace the first '('
read the documentation :
replace(...)
S.replace (old, new[, count]) -> string
Return a copy of string S with all occurrences of substring
old replaced by new. If the optional argument count is
given, only the first count occurrences are replaced.
you can replace double opening and double closing parentheses, and set the max parameter to 1 for both operations
print('((word))'.replace('((','(',1).replace('))',')',1) )
But this will not work if there are more occurrences of double closing parentheses
Maybe reversing the string before replacing the closing ones will help
t= '((word))'
t = t.replace('((','(',1)
t = t[::-1] # see string reversion topic [https://stackoverflow.com/questions/931092/reverse-a-string-in-python]
t = t.replace('))',')',1) )
t = t[::-1] # and reverse again
Well , I used regular expression for this purpose and substitute a bunch of brackets with a single one using re.sub function
import re
s="((((((word)))))))))"
t=re.sub(r"\(+","(",s)
g=re.sub(r"\)+",")",t)
print(g)
Output
(word)
Try below:
>>> import re
>>> w = '((word))'
>>> re.sub(r'([()])\1+', r'\1', w)
'(word)'
>>> w = 'Hello My ((word)) into this world'
>>> re.sub(r'([()])\1+', r'\1', w)
'Hello My (word) into this world'
>>>
try this one:
str="((word))"
str[1:len(str)-1]
print (str)
And output is = (word)

Retrieve part of string, variable length

I'm trying to learn how to use Regular Expressions with Python. I want to retrieve an ID number (in parentheses) in the end from a string that looks like this:
"This is a string of variable length (561401)"
The ID number (561401 in this example) can be of variable length, as can the text.
"This is another string of variable length (99521199)"
My coding fails:
import re
import selenium
# [Code omitted here, I use selenium to navigate a web page]
result = driver.find_element_by_class_name("class_name")
print result.text # [This correctly prints the whole string "This is a text of variable length (561401)"]
id = re.findall("??????", result.text) # [Not sure what to do here]
print id
This should work for your example:
(?<=\()[0-9]*
?<= Matches something preceding the group you are looking for but doesn't consume it. In this case, I used \(. ( is a special character, so it has to be escaped with \. [0-9] matches any number. The * means match any number of the directly preceding rule, so [0-9]* means match as many numbers as there are.
Solved this thanks to Kaz's link, very useful:
http://regex101.com/
id = re.findall("(\d+)", result.text)
print id[0]
You can use this simple solution :
>>> originString = "This is a string of variable length (561401)"
>>> str1=OriginalString.replace("("," ")
'This is a string of variable length 561401)'
>>> str2=str1.replace(")"," ")
'This is a string of variable length 561401 '
>>> [int(s) for s in string.split() if s.isdigit()]
[561401]
First, I replace parantheses with space. and then I searched the new string for integers.
No need to really use regular expressions here, if it is always at the end and always in parenthesis you can split, extract last element and remove the parenthesis by taking the substring ([1:-1]). Regexes are relatively time expensive.
line = "This is another string of variable length (99521199)"
print line.split()[-1][1:-1]
If you did want to use regular expressions I would do this:
import re
line = "This is another string of variable length (99521199)"
id_match = re.match('.*\((\d+)\)',line)
if id_match:
print id_match.group(1)

Categories