Separating RegEx pattern matches that have the same potential starting characters - python

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)

Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.

You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Related

Python pattern matching with language-specific characters

From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:
import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")
Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:
['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']
I want it to yield:
['Sektion', 'München', 'Gruppe', 'Süd']
I am grateful for suggestions how to solve this problem.
You may use
import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']
See the Python 3 demo.
The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.
In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:
p = re.compile(r'[^\W\d_]+', re.U)

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

Splitting a string into groups of either a number and a letter or just a single letter

I'm trying to decode a string that looks like this "2a3bc" into "aabbbc" in Python. So the first thing I need to do is to split it up into a list with groups that make sense. In other words: ['2a','3b','c'].
Essentially, match either (1) a number and a letter or (2) just a letter.
I've got this:
re.findall('\d+\S|\s', '2a3bc')
and it returns:
['2a', '3b']
So it's actually missing the c.
Perhaps my regex skills is lacking here, any help is appreciated.
Your current expression could work with a small bugfix: \S is non-whitespace, while \s is whitespace. You're looking for non-whitespace in both cases, so you shouldn't use \s anywhere:
>>> re.findall(r'\d+\S|\S', '2a3bc')
['2a', '3b', 'c']
However, this expression could be shorter: instead of using + for one or more digits, use * for zero or more, since the group might not be preceded by any digits, and you can then get rid of the alternation.
>>> re.findall(r'\d*\S', '2a3bc')
['2a', '3b', 'c']
Again, though, note that \S is simply non-whitespace - that includes letters, digits, and even punctuation. \D, non-digits, has a similar problem: it excludes digits, but includes punctuation. The shortest, clearest regex for this, then, would replace the \S with \w, which indicates alphanumeric characters:
>>> re.findall(r'\d*\w', '2a3bc')
['2a', '3b', 'c']
Since the other character class in the group is already digits, this particular \w will only match letters.

Removing variable length characters from a string in python

I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().
Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.
For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:
x = x.replace("<..>", "")
Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]* matches zero or more characters that are not >.
No Need for a 2-Step Solution
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Option 1: Match All Instead of Splitting
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Option 2: One Single Split
<[^>]+>|[ .]
On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string

python re.findall() with substring in alternations

If I have a substring (or 'subpattern') of another string or pattern in a regex alternation, like so:
r'abcd|bc'
What is the expected behaviour of re.compile(r'abcd|bc').findall('abcd bcd bc ab')?
Trying it out, I get (as expected)
['abcd', 'bc', 'bc']
so I thought re.compile(r'bc|abcd').findall('abcd bcd bc ab') might yield ['bc', 'bc', 'bc'] but instead it again returns
['abcd', 'bc', 'bc']
Can someone explain this? I was under the impression that findall would greedily return matches but apparently, it backtracks and tries to match alternate patterns what would yield longer tokens.
No backtracking takes place at all. Your pattern matches two different types of strings; | means or. Each pattern is tried out at each position.
So when the expression finds abcd at the start of your input, that text matches your pattern just fine, it fits the abcd part of the (bc or abcd) pattern you gave it.
Ordering of the alternative parts doesn't play here, as far as the regular expression engine is concerned, abcd|bc is the same thing as bc|abcd. abcd is not disregarded just because bc might match later on in the string.

Categories