Modify string with `re.sub` [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Suppose a string:
s = 'F3·Compute·Introduction to Methematical Thinking.pdf'
I substitute F3·Compute· with '' using regex
In [23]: re.sub(r'F3?Compute?', '',s)
Out[23]: 'F3·Compute·Introduction to Methematical Thinking.pdf'
It failed to work as I intented
When tried,
In [21]: re.sub(r'F3·Compute·', '', 'F3·Compute·Introduction to Methematical Thinking.pdf')
Out[21]: 'Introduction to Methematical Thinking.pdf'
What's the problem with my regex pattern?

The question mark ? does not stand in for a single character in regular expressions. It means 0 or 1 of the previous character, which in your case was 3 and e. Instead, the . is what you're looking for. It is a wildcard that stands for a single character (and has nothing to do with your middle-dot character; that is just coincidence).
re.sub(r'F3.Compute.', '',s)

Use dot to match any single character:
#coding: utf-8
import re
s = 'F3·Compute·Introduction to Methematical Thinking.pdf'
output = re.sub(r'F3.Compute.', '', unicode(s,"utf-8"), flags=re.U)
print output
Your original pattern, 'F3?Compute? was not having the desired effect. This said to match F followed by the number 3 optionally. Also, you made the final e of Compute optional. In any case, you were not accounting for the separator characters.
Note also that we must match on the unicode version of the string, and not the string directly. Without doing this, a dot won't match the unicode separator which you are trying to target. Have a look at the demo below for more information.
Demo

Related

python re string of length(3-4) max [duplicate]

This question already has answers here:
Regex matching 5-digit substrings not enclosed with digits
(2 answers)
Closed 2 years ago.
I need to extract from text any 3 or 4 consecutive numbers only, not longer, here's an example
text = 'abc 123\n ab3245ss a234234234234\n 12'
I'm trying this:
re.findall(r'\d{3,4}', text)
What I'm expecting:
['123', '3245']
What I'm getting:
['123', '3245', '2342', '3423', '4234']
When using a positive lookahead (?=\D) or lookbehind (?<=\D), there has to be a character present.
If you also want to match for example only 123, you can assert what is on the left and on the right is not a digit using a negative lookahead and lookbehind.
(?<!\d)\d{3,4}(?!\d)
Regex demo
text = 'abc 123\n ab3245ss a234234234234\n 12'
re.findall(r'(?<=\D)(\d{3,4})(?=\D)',text)
As a number of answers are suggesting, you need to explicitly check that the characters before and after your string are not numbers.
But additionally, there might not be any character before or after the number. Let's handle that as well.
re.findall(r"(?:^|[^\d])(\d{3,4})(?:$|[^\d])", text)
# ↑↑↑↑↑↑↑↑↑↑↑ ↑↑↑↑↑↑↑↑↑↑↑
# handles the handles the
# leading character trailing character
Please try below regex. This will take numbers that surrounded with a non-digit.
(?<=\D)(\d{3,4})(?=\D)
Demo
Edit: Use re.findall(r"(?:\D|^)(\d{3,4})(?:\D|$)",text) to also match numbers that may occur in start and end of the line.

What is the differences between these regular expressions: '^From .*#([^ ]*)' & '^From .*#(\S+)'? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am learning regex in python. Meanwhile, on a stage, I produced the first regex statement and my tutorial says the second. Both produce the same result for the given string. What are the differences? What may be the string for, that these codes will produce different results?
>>> f = 'From m.rubayet94#gmail.com sat Jan'
>>> y = re.findall('^From .*#(\S+)',f); print(y)
['gmail.com']
>>> y = re.findall('^From .*#([^ ]*)',f); print(y)
['gmail.com']
[^ ]* means zero or more non-space characters.
\S+ means one or more non-whitespace characters.
It looks like you're aiming to match a domain name which may be part of an email address, so the second regex is the better choice between the two since domain names can't contain any whitespace like tabs \t and newlines \n, beyond just spaces. (Domain names can't contain other characters too, but that's beside the point.)
Here are some examples of the differences:
import re
p1 = re.compile(r'^From .*#([^ ]*)')
p2 = re.compile(r'^From .*#(\S+)')
for s in ['From eric#domain\nTo john#domain', 'From graham#']:
print(p1.findall(s), p2.findall(s))
In the first case, whitespace isn't handled properly: ['domain\nTo'] ['domain']
In the second case, you get a null match where you shouldn't: [''] []
One of the regexes uses [^ ] while the other uses (\S+). I assume that at that point you're trying to match against anything but a whitespace.
The difference between both expressions is that (\S+) will match against anything that isn't any whitespace chracters (whitespace characteres are [ \t\n\r\f\v], you can read more here). [^ ] will match against anything that isn't a single whitespace character (i.e. a whitespace produced by pressing the spacebar).

Match repeated patterns in python [duplicate]

This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 3 years ago.
I am trying to find all strings that follows a specific pattern in a python string.
"\[\[Cats:.*\]\]"
However if many occurrences of such pattern exist together on a line in a string it sees the pattern as just one, instead of taking the patterns separately.
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
x = re.findall("\[\[Cats:.*\]\]", strng)
The output gotten is:
['[[Cats: Text1]] said I am in the [[Cats: Text2]]']
instead of
['[[Cats: Text1]]', '[[Cats: Text2]]']
which is a list of lists.
What regex do I use?
"\[\[Cats:.*?\]\]"
Your current regex is greedy - it's gobbling up EVERYTHING, from the first open brace to the last close brace. Making it non-greedy should return all of your results.
Demo
The problom is that you are doing a greedy search, add a ? after .* to get a non greedy return.
code follows:
import re
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
regex_template = re.compile(r'\[\[Cats:.*?\]\]')
matches = re.findall(regex_template, strng)
print(matches)
Don't do .*, because that will never terminate. .* means any character and not even one occurence is required.
import re
strng = '''[[Cats: lol, this is 100 % cringe]]
said I am in the [[Cats: lol, this is 100 % cringe]]
fhg is abnorn'''
x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng)
print(x)

Python re not extracting matched text based on regex while using boundary [duplicate]

This question already has answers here:
Why my regex with r'string' matches but not 'string' using Python?
(4 answers)
Closed 4 years ago.
I am extracting this text from regex, I matched required string in the text but while using python re to extract those matched text, its not extracting .
Here is the code I am using.
import re
PRICE = '\b(price|rs)?\s*(\d+[\s\d.]*\s*?(pkg|k|m|
(?:la(?:c|kh|k)|crore|cr)s?|l)\b\.?)'
content ='This should matchprice 5.6 lacincluding price(i.e price
5.6 lac) and rs 56 m. including rs (i.e rs 56 k rs 56 m) .
It will match normally if there is no price or rs written for example
or 56 k or 8.8 crs. are correct matching.
It should not match5.6 lac (Should not match eitherrs 6 lac asas
there is no spaces before 5.6'
for m in re.finditer(PRICE,content,pat.FLAG):
matched = m.group().strip()
print ("In matched "+ matched)`
Above code is not going inside the for loop. Any leads highly appreciated . Thanks.
Use raw strings to define regexes:
PRICE = r'\b(price|rs)?\s*(\d+[\s\d.]*\s*?(pkg|k|m|(?:la(?:c|kh|k)|crore|cr)s?|l)\b\.?)'
Otherwise \b is interpreted as backspace:
>>> print '\b(price|rs)?\s*(\d+[\s\d.]*\s*?(pkg|k|m|(?:la(?:c|kh|k)|crore|cr)s?|l)\b\.?)'
(price|rs)?\s*(\d+[\s\d.]*\s*?(pkg|k|m|(?:la(?:c|kh|k)|crore|cr)s?|l\.?)
>>> print r'\b(price|rs)?\s*(\d+[\s\d.]*\s*?(pkg|k|m|(?:la(?:c|kh|k)|crore|cr)s?|l)\b\.?)'
\b(price|rs)?\s*(\d+[\s\d.]*\s*?(pkg|k|m|(?:la(?:c|kh|k)|crore|cr)s?|l)\b\.?)
Note how the first print output does not contain the initial \b. Keep in mind that the string is first interpreted by the python compiler, which means all usual escapes like \n for newline or \b for backspace or \x42 for B are handled. The resulting string is then passed to the re module which interprets its own escapes. Hence in 99.9% of the cases you want to avoid that the compiler interprets escapes. The raw strings do just that.
The regex101 site assumes you are using raw string literals.

How to use a special sequence in substitute pattern in re.sub [duplicate]

This question already has an answer here:
Python re.sub back reference not back referencing [duplicate]
(1 answer)
Closed 5 years ago.
I have a string, in which I'd like to substitute each [ by [[] and ] by []] (at the same time). I thought about doing it with re.sub:
re.sub(r'(\[|\])', '[\1]', 'asdfas[adsfasd]')
Out: 'asdfas[\x01]adsfasd[\x01]'
But I'm not getting the desired result -- how do I make the re.sub consider \1 in the pattern as the first matched special group?
You should use r prefix for your replacing regex as well, otherwise \1 will be interpreted as a hex literal:
In [125]: re.sub(r'(\[|\])', r'[\1]', 'asdfas[adsfasd]')
Out[125]: 'asdfas[[]adsfasd[]]'

Categories