Match repeated patterns in python [duplicate]

Match repeated patterns in python [duplicate] - python

This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 3 years ago.
I am trying to find all strings that follows a specific pattern in a python string.
"\[\[Cats:.*\]\]"
However if many occurrences of such pattern exist together on a line in a string it sees the pattern as just one, instead of taking the patterns separately.
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
x = re.findall("\[\[Cats:.*\]\]", strng)
The output gotten is:
['[[Cats: Text1]] said I am in the [[Cats: Text2]]']
instead of
['[[Cats: Text1]]', '[[Cats: Text2]]']
which is a list of lists.
What regex do I use?

"\[\[Cats:.*?\]\]"
Your current regex is greedy - it's gobbling up EVERYTHING, from the first open brace to the last close brace. Making it non-greedy should return all of your results.
Demo

The problom is that you are doing a greedy search, add a ? after .* to get a non greedy return.
code follows:
import re
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
regex_template = re.compile(r'\[\[Cats:.*?\]\]')
matches = re.findall(regex_template, strng)
print(matches)

Don't do .*, because that will never terminate. .* means any character and not even one occurence is required.
import re
strng = '''[[Cats: lol, this is 100 % cringe]]
said I am in the [[Cats: lol, this is 100 % cringe]]
fhg is abnorn'''
x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng)
print(x)

Related

python finditer get start end of capture group [duplicate]

This question already has answers here:
Get start location of capturing group within regex pattern
(3 answers)
Python Regex - How to Get Positions and Values of Matches
(4 answers)
Closed 1 year ago.
I am trying to capture the start and end of a capture group for each group found using the finditer() method in re.
For example:
strng = 'move 12345-!'
matches = re.finditer('move ([0-9]+).*?', strng)
for each in matches:
print(*each.groups())
print(each.start(), each.end())
This will yield the start and end index position, but of the matched pattern and not specifically the captured group. I essentially want to always capture the number as this will change. The word move will always be an anchor, but I don't want to include that in the position, as I need to capture the actual position of the numbers found within the text document so that I can do slicing for each number found.
Full document might be like:
move 12345-!
move 57496-!
move 96038-!
move 00528-!
And I would capture 57496 starting/ending document[17:21] where start of the 57496 is at 17 and end is at 21. The underlying positions are being used to train a model.

If you don't want move to be part of the match, you can turn it into a positive lookbehind to assert it to the left.
Then you can use each.group() to get the match.
Note that you can omit .*? at the end of the pattern, as it is a non greedy quantifier without anything after that part and will not match any characters.
import re
strng = 'move 12345-!'
matches = re.finditer('(?<=move )[0-9]+', strng)
for each in matches:
print(each.group())
print(each.start(), each.end())
Output
12345
5 10

>>> import re
>>> strng = "move 12345-!"
>>> matches = re.finditer('move ([0-9]+).*?', strng)
>>> for each in matches:
print(each.group(1))
print(each.start(1), each.end(1))
12345
5 10
>>>

Regex group doesn't capture all of matched part of string [duplicate]

This question already has an answer here:
Why Does a Repeated Capture Group Return these Strings?
(1 answer)
Closed 1 year ago.
I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.
Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:
>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'
Why isn't the whole expected group being captured?
Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.

This will capture multiple repeated groups:
(/[a-zA-Z]+)*
However, as already discussed in another thread, quoting from #ByteCommander
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/
regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');

Don't need the * between the two expressions here, also move the first / into the brackets:
>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>

In this case, you may don't need regex.
You can simply use split function.
text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])
output:
/foo/bar
a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.
As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).
In this case the generic approach would be :
text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])
Output:
/foo/bar/baz/boo

Limiting phone numbers, regex starts with only a character "+" [duplicate]

This question already has answers here:
Checking whole string with a regex
(5 answers)
Closed 2 years ago.
Im trying to limit an input of phone numbers to:
1-16 digits OR
A single "+" followed by 1-16 digits.
This is my code
txt = "+++854"
x = re.search(str("([\+]{0,1}[0-9]{3,16})|([0-9]{3,16})"), txt)
###^\+[0-9]{1,16}|[0-9]{1,16}", txt) #startswith +, followed by numbers.
if x:
print("YES! We have a match!")
else:
print("No match")
# Thatas a match
Yet it yields a match. I tried also "^+{0,1}[0-9]{1,16}|[0-9]{1,16}" but despite it works in "https://regex101.com/r/aP0qH2/4" it doesnt work in my code as i think it should work.

re.search searches for "the first location where the regular expression pattern produces a match" and returns the resulting match object. In the string "+++854", the substring "+854" matches.
To match the whole string, use re.match. The documentation has a section about the difference between re.match and re.search.

pattern = r"\+?[0-9]{16}"

What is the differences between these regular expressions: '^From .#([^ ])' & '^From .*#(\S+)'? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am learning regex in python. Meanwhile, on a stage, I produced the first regex statement and my tutorial says the second. Both produce the same result for the given string. What are the differences? What may be the string for, that these codes will produce different results?
>>> f = 'From m.rubayet94#gmail.com sat Jan'
>>> y = re.findall('^From .*#(\S+)',f); print(y)
['gmail.com']
>>> y = re.findall('^From .*#([^ ]*)',f); print(y)
['gmail.com']

[^ ]* means zero or more non-space characters.
\S+ means one or more non-whitespace characters.
It looks like you're aiming to match a domain name which may be part of an email address, so the second regex is the better choice between the two since domain names can't contain any whitespace like tabs \t and newlines \n, beyond just spaces. (Domain names can't contain other characters too, but that's beside the point.)
Here are some examples of the differences:
import re
p1 = re.compile(r'^From .*#([^ ]*)')
p2 = re.compile(r'^From .*#(\S+)')
for s in ['From eric#domain\nTo john#domain', 'From graham#']:
print(p1.findall(s), p2.findall(s))
In the first case, whitespace isn't handled properly: ['domain\nTo'] ['domain']
In the second case, you get a null match where you shouldn't: [''] []

One of the regexes uses [^ ] while the other uses (\S+). I assume that at that point you're trying to match against anything but a whitespace.
The difference between both expressions is that (\S+) will match against anything that isn't any whitespace chracters (whitespace characteres are [ \t\n\r\f\v], you can read more here). [^ ] will match against anything that isn't a single whitespace character (i.e. a whitespace produced by pressing the spacebar).

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.

You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.

Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'

Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.

Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times

Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'

I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)

Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match repeated patterns in python [duplicate] - python

"\[\[Cats:.*?\]\]" Your current regex is greedy - it's gobbling up EVERYTHING, from the first open brace to the last close brace. Making it non-greedy should return all of your results. Demo

The problom is that you are doing a greedy search, add a ? after .* to get a non greedy return. code follows: import re strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn' regex_template = re.compile(r'\[\[Cats:.*?\]\]') matches = re.findall(regex_template, strng) print(matches)

Don't do ., because that will never terminate. . means any character and not even one occurence is required. import re strng = '''[[Cats: lol, this is 100 % cringe]] said I am in the [[Cats: lol, this is 100 % cringe]] fhg is abnorn''' x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng) print(x)

Related

python finditer get start end of capture group [duplicate]

Regex group doesn't capture all of matched part of string [duplicate]

Limiting phone numbers, regex starts with only a character "+" [duplicate]

What is the differences between these regular expressions: '^From .#([^ ])' & '^From .*#(\S+)'? [duplicate]

python regex: get end digits from a string

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match repeated patterns in python [duplicate] - python

"\[\[Cats:.*?\]\]" Your current regex is greedy - it's gobbling up EVERYTHING, from the first open brace to the last close brace. Making it non-greedy should return all of your results. Demo

The problom is that you are doing a greedy search, add a ? after .* to get a non greedy return. code follows: import re strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn' regex_template = re.compile(r'\[\[Cats:.*?\]\]') matches = re.findall(regex_template, strng) print(matches)

Don't do .*, because that will never terminate. .* means any character and not even one occurence is required. import re strng = '''[[Cats: lol, this is 100 % cringe]] said I am in the [[Cats: lol, this is 100 % cringe]] fhg is abnorn''' x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng) print(x)

Related

python finditer get start end of capture group [duplicate]

Regex group doesn't capture all of matched part of string [duplicate]

Limiting phone numbers, regex starts with only a character "+" [duplicate]

What is the differences between these regular expressions: '^From .*#([^ ]*)' & '^From .*#(\S+)'? [duplicate]

python regex: get end digits from a string

Categories

Resources

Don't do ., because that will never terminate. . means any character and not even one occurence is required. import re strng = '''[[Cats: lol, this is 100 % cringe]] said I am in the [[Cats: lol, this is 100 % cringe]] fhg is abnorn''' x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng) print(x)

What is the differences between these regular expressions: '^From .#([^ ])' & '^From .*#(\S+)'? [duplicate]