Using regex assertion in python

Using regex assertion in python - python

I am experimenting with regex and i have read up on assertions a bit and seen examples but for some reason I can not get this to work.. I am trying to get the word after the following pattern using look-behind.
import re
s = '123abc456someword 0001abde19999anotherword'
re.findall(r'(?<=\d+[a-z]+\d+)[a-z]+', s, re.I)
The results should be someword and anotherword
But i get error: look-behind requires fixed-width pattern
Any help appreciated.

Python's re module only allows fixed-length strings using look-behinds. If you want to experiment and be able to use variable length look-behinds in regexes, use the alternative regex module:
>>> import regex
>>> s = '123abc456someword 0001abde19999anotherword'
>>> regex.findall(r'(?i)(?<=\d+[a-z]+\d+)[a-z]+', s)
['someword', 'anotherword']
Or simply avoid using look-behind in general and use a capturing group ( ):
>>> import re
>>> s = '123abc456someword 0001abde19999anotherword'
>>> re.findall(r'\d+[a-z]+\d+([a-z]+)', s, re.I)
['someword', 'anotherword']

Convert it to Non-capturing group and get the matched group from index 1.
(?:\d+\w+\d+)(\w+\b)
here is DEMO
If you are interested in [a-z] only then change \w to [a-z] in above regex pattern. Here \b is added to assert position at a word boundary.
sample code:
import re
p = re.compile(ur'(?:\d+\w+\d+)(\w+\b)', re.IGNORECASE)
test_str = u"123abc456someword 0001abde19999anotherword"
re.findall(p, test_str)

Another easy method through lookahead,
>>> import re
>>> s = '123abc456someword 0001abde19999anotherword'
>>> m = re.findall(r'[a-z]+(?= |$)', s, re.I)
>>> m
['someword', 'anotherword']
It matches one or more alphabets in which the following character must be a space or end of a line.

Related

I want to slice out substrings using regex

import re
str_ = "8983605653Sudanshu452365423256Shinde"
print(re.findall(r"\d{10}\B|[A-Za-z]{8}|\d{12}|[A-Za-z]{6}",str_))
current output
['8983605653', 'Sudanshu', '4523654232', 'Shinde']
Desired output
['8983605653', 'Sudanshu', '452365423256', 'Shinde']

A regex find all on \d+|\D+ should work here:
str_ = "8983605653Sudanshu452365423256Shinde"
matches = re.findall(r'\d+|\D+', str_)
print(matches) # ['8983605653', 'Sudanshu', '452365423256', 'Shinde']
The pattern used here alternatively finds all digit substrings, or all non digit substrings.

Instead of using an alternation | you can use the matches with capture groups and then print the group values.
import re
str_ = "8983605653Sudanshu452365423256Shinde"
m = re.match(r"(\d{10})([A-Za-z]{8})(\d{12})([A-Za-z]{6})",str_)
if m:
print(list(m.groups()))
Output
['8983605653', 'Sudanshu', '452365423256', 'Shinde']
See a Python demo.

Extract string within parentheses - PYTHON

I have a string "Name(something)" and I am trying to extract the portion of the string within the parentheses!
Iv'e tried the following solutions but don't seem to be getting the results I'm looking for.
n.split('()')
name, something = n.split('()')

You can use a simple regex to catch everything between the parenthesis:
>>> import re
>>> s = 'Name(something)'
>>> re.search('\(([^)]+)', s).group(1)
'something'
The regex matches the first "(", then it matches everything that's not a ")":
\( matches the character "(" literally
the capturing group ([^)]+) greedily matches anything that's not a ")"

as an improvement on #Maroun Maroun 's answer:
re.findall('\(([^)]+)', s)
it finds all instances of strings in between parentheses

You can use split as in your example but this way
val = s.split('(', 1)[1].split(')')[0]
or using regex

You can use re.match:
>>> import re
>>> s = "name(something)"
>>> na, so = re.match(r"(.*)\((.*)\)" ,s).groups()
>>> na, so
('name', 'something')
that matches two (.*) which means anything, where the second is between parentheses \( & \).

You can look for ( and ) (need to escape these using backslash in regex) and then match every character using .* (capturing this in a group).
Example:
import re
s = "name(something)"
regex = r'\((.*)\)'
text_inside_paranthesis = re.match(regex, s).group(1)
print(text_inside_paranthesis)
Outputs:
something
Without regex you can do the following:
text_inside_paranthesis = s[s.find('(')+1:s.find(')')]
Outputs:
something

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:
I'm trying to find one small letter surrounded by three capital letters.
My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
['AbAD']
I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.
The second issue is the following regex:
>>> import re
>>>
>>> # String
... str = 'AbADaDD'
>>>
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>>
>>> print regex.findall(str)
[]
Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?

It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.
(?=([A-Z][a-z][A-Z][A-Z]))
Code:
>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']
DEMO
For the 2nd one, you should use negative lookahead and lookbehind assertion like below,
(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))
Code:
>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']
DEMO
The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.

The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.
pat = '(?=([A-Z][a-z][A-Z][A-Z]))'
For your second regex again do the same.
print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)
.For more insights see
1)After first match the string left is aDD as the first part has matched.
2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.

1st issue,
You should use this pattern,
r'([A-Z]{1}[a-z]{1}[A-Z]{1})'
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']
2nd issue
You should use,
(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))
Example
>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Regex pattern to extract substring

mystring = "q1)whatq2)whenq3)where"
want something like ["q1)what", "q2)when", "q3)where"]
My approach is to find the q\d+\) pattern then move till I find this pattern again and stop. But I'm not able to stop.
I did req_list = re.compile("q\d+\)[*]\q\d+\)").split(mystring)
But this gives the whole string.
How can I do it?

You could try the below code which uses re.findall function,
>>> import re
>>> s = "q1)whatq2)whenq3)where"
>>> m = re.findall(r'q\d+\)(?:(?!q\d+).)*', s)
>>> m
['q1)what', 'q2)when', 'q3)where']
Explanation:
q\d+\) Matches the string in the format q followed by one or more digits and again followed by ) symbol.
(?:(?!q\d+).)* Negative look-ahead which matches any char not of q\d+ zero or more times.

extracting multiple instances regex python

I have a string:
This is #lame
Here I want to extract lame. But here is the issue, the above string can be
This is lame
Here I dont extract anything. And then this string can be:
This is #lame but that is #not
Here i extract lame and not
So, output I am expecting in each case is:
[lame]
[]
[lame,not]
How do I extract these in robust way in python?

Use re.findall() to find multiple patterns; in this case for anything that is preceded by #, consisting of word characters:
re.findall(r'(?<=#)\w+', inputtext)
The (?<=..) construct is a positive lookbehind assertion; it only matches if the current position is preceded by a # character. So the above pattern matches 1 or more word characters (the \w character class) only if those characters were preceded by an # symbol.
Demo:
>>> import re
>>> re.findall(r'(?<=#)\w+', 'This is #lame')
['lame']
>>> re.findall(r'(?<=#)\w+', 'This is lame')
[]
>>> re.findall(r'(?<=#)\w+', 'This is #lame but that is #not')
['lame', 'not']
If you plan on reusing the pattern, do compile the expression first, then use the .findall() method on the compiled regular expression object:
at_words = re.compile(r'(?<=#)\w+')
at_words.findall(inputtext)
This saves you a cache lookup every time you call .findall().

You should use re lib here is an example:
import re
test case = "This is #lame but that is #not"
regular = re.compile("#[\w]*")
lst= regular.findall(test case)

This will give the output you requested:
import re
regex = re.compile(r'(?<=#)\w+')
print regex.findall('This is #lame')
print regex.findall('This is lame')
print regex.findall('This is #lame but that is #not')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex assertion in python - python

Another easy method through lookahead, >>> import re >>> s = '123abc456someword 0001abde19999anotherword' >>> m = re.findall(r'[a-z]+(?= |$)', s, re.I) >>> m ['someword', 'anotherword'] It matches one or more alphabets in which the following character must be a space or end of a line.

Related

I want to slice out substrings using regex

Extract string within parentheses - PYTHON

Python Regex matching already matched sub-string

Regex pattern to extract substring

extracting multiple instances regex python

Categories

Resources