How to replace repeated pattern of characters - python

I have a string that has pairs of random characters repeating 3 times within it, for ex "abababwhatevercdcdcd" and i want to remove these pairs to get the rest of the string, like "whatever" in the former example, how do i do that?
I tried the following:
import re
re.sub(r'([a-z0-9]{2}){3}', r'', string)
but it does not work

You need backreferences here in order to repeat the match that was actually made, as opposed to trying to make a new match with the same pattern:
([a-z0-9]{2})\1\1
>>> import re
>>> re.sub(r'([a-z0-9]{2})\1\1', r'', "abababwhatevercdcdcd")
'whatever'
>>> re.sub(r'([a-z0-9]{2})\1\1', r'', "wabababhatevercdcdcd")
'whatever'

For more than one character, you can use :
(.{2,})\1+

Related

How to remove a string between two words without removing those words?

I want to remove a substring between two words from a string with Python without removing the words that delimit this substring.
what I have as input : "abcde"
what I want as output : "abde"
The code I have:
import re
s = "abcde"
a = re.sub(r'b.*?d', "", s)
what I get as Output : "ae"
------------Edit :
another example to explain the case :
what I have as input : "c:/user/home/56_image.jpg"
what I want as output : "c:/user/home/image.jpg"
The code I have:
import re
s = "c:/user/home/56_image.jpg"
a = re.sub(r'/.*?image', "", s)
what I get as Output : "c:/user/home.jpg"
/!\ the number before "image" is changing so I could not use replace() function I want to use something generic
You can do like bellow:
''.join('abcde'.split('c'))
I would phrase the regex replacement as:
s = "abcde"
a = re.sub(r'b\w*d', "bd", s)
print(a) # abde
I am using \w* to match zero or more word characters in between b and d. This is to ensure that we don't accidentally match across words.
You are also matching what you want to keep with an empty string, that is why you don't see it in the replacement.
You can use capture groups and use the group in the replacement, or lookarounds which are non consuming.
For example, using group 1 using \1 in the replacement:
(b)\w*?(?=d)
Regex demo
Or using a lookaround, and use an empty string in the replacement.
\d+_(?=image)
Regex demo

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?
You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Python replace middle digits with commas thousand separator

I have a string like this:
123456789.123456789-123456789
Before and after the decimal/hyphen there can be any number of digits, what I need to do is remove everything before the decimal including the decimal and remove the hyphen and everything after the hyphen. Then with the middle group of digits (that I need to keep) I need to place a comma thousands separators. 
So here the output would be: 
123,456,789
I can use lookarounds to capture the digits in the middle but then it wont replace the other digits and i'm not sure how to place commas using lookarounds. 
(?<=\.)\d+(?=-)
Then I figured I could use a capturing group like so which will work, but not sure how to insert the comma's
\d+\.(\d+)-\d+
How could I insert comma's using one of the above regex?
Don't try to insert the thousands separators with a regex; just pick out that middle number and use a function to produce the replacement; re.sub() accepts a function as replacement pattern:
re.sub(r'\d+\.(\d+)-\d+', lambda m: format(int(m.group(1)), ','), inputtext)
The , format for integers when used in the format() function handles formatting a number to one with thousands separators:
>>> import re
>>> inputtext = '123456789.123456789-123456789'
>>> re.sub(r'\d+\.(\d+)-\d+', lambda m: format(int(m.group(1)), ','), inputtext)
'123,456,789'
This will of course still work in a larger body of text containing the number, dot, number, dash, number sequence.
The format() function is closely related to the str.format() method but doesn't require a full string template (so no {} placeholder or field names required).
You've asked for a full regular expression here, It would probably be easier to split your string..
>>> import re
>>> s = '123456789.123456789-123456789'
>>> '{:,}'.format(int(re.split('[.-]', s)[1]))
123,456,789
If you prefer using regular expression, use a function call or lambda in the replacement:
>>> import re
>>> s = '123456789.123456789-123456789'
>>> re.sub(r'\d+\.(\d+)-\d+', lambda m: '{:,}'.format(int(m.group(1))), s)
123,456,789
You can take a look at the different format specifications.

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?
Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().
You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)
This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

Look for multiple occurances of a character using regex

i using the pattern pat='dd|dddd' , and i thought it would either match dd or dddd.
import re
re.search(pat,'ddd')
re.search(pat,'ddddd')
any number of d(s) matches for that matter why is it so ?
You'll need to anchor the regular expression somehow. A regular expression searches within strings to find a pattern. So "dd" will be found in "dddddddd" at offset 0,1,2,3,4,5,6.
If you want to match only entire strings try ^dd$. ^ matches the beginning of a string, $ matches the end. So ^(dd|dddd)$ will have the behavior you want.
If you want it to match only dd or dddd but within a string. Then you might want to use: [^d](dd|dddd)[^d] Which will match "anything that isn't d" then either two or four ds then "anything that isn't d"
As already pointed out by Charles Duffy, search isn't really the function that you should be using. Try using match or even findall.
>>> import re
>>> re.match('dd|dddd','dd').group()
'dd'
>>> re.findall('dd|dddd','dd')
['dd']
>>> re.match('dd|dddd','ddddd').group()
'dd'
>>> re.match('dddd|dd','ddddd').group()
'dddd'

Categories