Finding and extracting multiple substrings in a string? - python

After looking a few similar questions, I have not been able to successfully implement a substring split on my data. For my specific case, I have a bunch of strings, and each string has a substring I need to extract. The strings are grouped together in a list and my data is NBA positions. I need to pull out the positions (either 'PG', 'SG', 'SF', 'PF', or 'C') from each string. Some strings will have more than one position. Here is the data.
text = ['Chi\xa0SG, SF\xa0\xa0DTD','Cle\xa0PF']
The code should ideally look at the first string, 'Chi\xa0SG, SF\xa0\xa0DTD', and return ['SG','SF'] the two positions. The code should look at the second string and return ['PF'].

Leverage (zero width) lookarounds:
(?<!\w)PG|SG|SF|PF|C(?!\w)
(?<!\w) is zero width negative lookbehind pattern, making sure the desired match is not preceded by any alphanumerics
PG|SG|SF|PF|C matches any of the desired patterns
(?!\w) is zero width negative lookahead pattern making sure the match is not followed by any alphanumerics
Example:
In [7]: s = 'Chi\xa0SG, SF\xa0\xa0DTD'
In [8]: re.findall(r'(?<!\w)PG|SG|SF|PF|C(?!\w)', s)
Out[8]: ['SG', 'SF']

heemayl's response is the most correct, but you could probably get away with splitting on commas and keeping only the last two (or in the case of 'C', the last) characters in each substring.
s = 'Chi\xa0SG, SF\xa0\xa0DTD'
fin = list(map(lambda x: x[-2:] if x != 'C' else x[-1:],s.split(',')))
I can't test this at the moment as I'm on a chromebook but it should work.

Related

Why does my regular expression return tuples for every character in a string? [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed last year.
I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /).
I am using regular expressions with the re.findall method.
My current code:
import re
def valid_expression(exp) -> bool:
# remove white spaces
exp = exp.replace(" ", "")
# characters to search for
chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"
z = re.findall(chars, exp)
return "".join(z) == exp
However, when I test this any expression the re.findall(chars, exp) will return a list of tuples with 3 empty strings: ('', '', '') for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.
Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine. I get that it uses javascript but normally there should be no difference right ?
Thank you for any explanation and/or fix.
Short answer: If the result you want is ['cos', '(', 'x', ')'], you need something like
'(cos|sin|tan|[)(-*x]|\d+)':
>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']
From the documentation for findall:
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
For 'cos(x)', you start with ('cos', '', '') because cos matched, but neither sin nor tan matched. For each of (, x, and ), none of the three capture groups matched, although the bracket expression did. Since it isn't inside a capture group, anything it matches isn't included in your output.
As an aside, [\d+/*x)(-] doesn't include multidigit integers as a match. \d+ is not a regular expression; it's the two characters d and +. (The escape is a no-op, since d has no special meaning inside [...].) As a result, it matches exactly one of the following eight characters:
d
+
/
*
x
)
(
-
You have three groups (an expression with parentheses) in your regex, so you get tuples with three items. Also you get four results for all substrings that matches with your regex: first for 'cos', second for '(', third for 'x', and the last for ')'. But the last part of your regex doesn't marked as a group, so you don't get this matches in your tuple. If you change your regex like r"(cos)|(sin)|(tan)|([\d+/*x)(-])" you will get tuples with four items. And every tuple will have one non empty item.
Unfortunately, this fix doesn't help you to verify that you have no prohibited lexemes. It's just to understand what's going on.
I would suggest you to convert your regex to a negative form: you may check that you have no anything except allowed lexemes instead of checking that you have some allowed ones. I guess this way should work for simple cases. But, I am afraid, for more sophisticated expression you have to use something other than regex.
findall returns tuples because your regular expression has capturing groups. To make a group non-capturing, add ?: after the opening parenthesis:
r"(?:cos)|(?:sin)|(?:tan)|[\d+/*x)(-]"

python filtering meta characters while preserving the integrity of the words

Hello I need to figure out how to count the number of words in a sentence but now I am stuck. The problem with my current code is that, it doesnt filter away meta characters, so string like "..." creates an error.
print(len(input().split()))
another method i tried to use was this(which is to use regex to filer out the meta characters. but this only resulted in the len function counting all the characters not words present:
import re
print(len(re.sub('[^a-zA-Z]+',' ',input())))
You can use split to split according to a separator (in your case default seperator of space character is enough) and then count the length of the list:
In [49]: my_str = 'A very valid, and nice example.'
In [50]: len(my_str.split())
Out[50]: 6
Edit: As you have punctuation characters in your example, you can first remove them:
In [59]: my_str
Out[59]: 'A very valid, and nice example.'
In [60]: len(re.sub('[^\w\s]', '', my_str).split())
Out[60]: 6
In [61]: len(re.sub('[^\w\s]', '', '...').split())
Out[61]: 0
So this will remove every character that is not alphanumeric and not space.
The below matches groups of letter characters. I tried before to utilize a couple different expressions, but symbol combinations such as '--' would get counted as words. Utilizing only the \w quantifier, this counts all groups of words and adds them to a list. Should you wish to see the words instead, remove the len quantifier. I tried this with as many examples as I could think of and it worked on all of them!
import re
def getWordCount(value):
list = re.findall('([\w]+)',value)
return len(list)
value = 'A very nice, and simple, example.'
print(getWordCount(value))

Python Regex results longer than original string

I have python code like this:
a = 'xyxy123'
b = re.findall('x*',a)
print b
This is the result:
['x', '', 'x', '', '', '', '', '']
How come b has eight elements when a only has seven characters?
There are eight "spots" in the string:
|x|y|x|y|1|2|3|
Each of them is a location where a regex could start. Since your regex includes the empty string (because x* allows 0 copies of x), each spot generates one match, and that match gets appended to the list in b. The exceptions are the two spots that start a longer match, x; as in msalperen's answer,
Empty matches are included in the result unless they touch the beginning of another match,
so the empty matches at the first and third locations are not included.
According to python documentation (https://docs.python.org/2/library/re.html):
re.findall returns all non-overlapping matches of pattern in string,
as a list of strings. The string is scanned left-to-right, and matches
are returned in the order found. If one or more groups are present in
the pattern, return a list of groups; this will be a list of tuples if
the pattern has more than one group. Empty matches are included in the
result unless they touch the beginning of another match.
So it returns all the results that match x*, including the empty ones.

Removing variable length characters from a string in python

I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().
Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.
For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:
x = x.replace("<..>", "")
Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]* matches zero or more characters that are not >.
No Need for a 2-Step Solution
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Option 1: Match All Instead of Splitting
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Option 2: One Single Split
<[^>]+>|[ .]
On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string

Separating RegEx pattern matches that have the same potential starting characters

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Categories