Python regex error: missing ), unterminated subpattern at position 35 [closed] - python

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I have this regex pattern:
(?P<prefix>.*)(?<!\\)\((?P<words>.+)(?<!\\)\)(?P<postfix>.*)
This regex is supposed to match a string like this:
hello my (friend|enemy) nice to see you again
The prefix group should capture hello my.
The words group should capture friend|enemy.
The postfix group should capture nice to see you again
This regex also uses lookbehinds to check if ( and ) are escaped using \ in string. For example, these two samples should not be detected since there is a \ before ( and ):
hello my \(friend|enemy) nice to see you again
hello my (friend|enemy\) nice to see you again
This pattern works well when I check it using online websites but when I try to run in in python (I'm using python 3.7), it throws the following error:
re.error: missing ), unterminated subpattern at position 35
What is the problem?
Edit:
Here is how I use it in python:
pattern = "(?P<prefix>.*)(?<!\\)\((?P<words>.+)(?<!\\)\)(?P<postfix>.*)"
match = re.search(pattern, line)

#Md Narimani
as #erhumoro suggested in comments, instead of:
line = "hello my (friend|enemy) nice to see you again"
pattern = "(?P<prefix>.*)(?<!\\)\((?P<words>.+)(?<!\\)\)(?P<postfix>.*)"
match = re.search(pattern, line)
Do:
line = "hello my (friend|enemy) nice to see you again"
pattern = r"(?P<prefix>.*)(?<!\\)\((?P<words>.+)(?<!\\)\)(?P<postfix>.*)"
match = re.search(pattern, line)
It is because of problems with escaping characters.

Related

Python using grep to count word occurrences in directory [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 12 months ago.
Improve this question
I'm trying to run
word = "where's"
os.popen(f"grep -ow '{word.replace("'", """'"'"'""")}' /texts/*.txt | wc -l").read()
I get:
File "<stdin>", line 1
os.popen(f"grep -ow '{word.replace("'", """'"'"'""")}' /texts/*.txt | wc -l").read()
^
SyntaxError: invalid syntax
Why would this be happening and how can I fix it?
If I escape the asterisk \* I get "Unexpected error after line continuation character" with the arrow pointing at the end of the line (after read())
You're trying to do too much complicated quoting all at once.
Instead, do it in two steps so the quoting doesn't conflict.
word = "where's"
word = word.replace("'", """'"'"'""")
os.popen(f"grep -ow '{word}' /texts/*.txt | wc -l").read()
To explain your error, it's because pairs of quotes do not nest in Python strings.
For example, you can't do this:
s = "alpha 'beta "gamma" delta' omega"
The double quote preceding alpha does not match up with the one following omega. It matches up with the very next double quote it sees, the one preceding gamma. The single quotes do not "protect" the double quotes.

Regex Pyhon: cannot replace newlines with "$1" [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
I have regular expression \n([\d]) that can match this following text:
Then I want to replace that matched text with first group or $1 in Visual Studio Code. This is the result:
I want the same idea in python, which I already make this code.
import re
file = "out FCE.txt"
pattern = re.compile(".+")
for i, line in enumerate(open(file)):
for match in re.finditer(pattern, line):
print(re.sub(r"\n([\d])", r"\1", match.group()))
But that code does nothing to it. Which mean the result is still the same as the first picture. Newlines and the line with numbers at first character are not removed. I already read this answer, that python is using \1 not $1. And yes, I want to keep the whitespaces between in order to be neat as \t\t\t.
Sorry if my explanation is confusing and also my english is bad.
The problem here is that you are reading the file line by line. In each loop of for i, line in enumerate(open(file)):, re.sub accesses only one line, and therefore it cannot see whether the next line starts with a digit.
Try instead:
import re
file = "out FCE.txt"
with open(file, 'r') as f:
text = f.read()
new_text = re.sub(r"\n([\d])", r"\1", text)
print(new_text)
In this code the file is read as a whole (into the variable text) so that re.sub now sees whether the subsequent line starts with a digit.

How to write a regex to capture letters separated by punctuation in Python 3? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am new to regex and encountered a problem. I need to parse a list of last names and first names to use in a url and fetch an html page. In my last names or first names, if it's something like "John, Jr" then it should only return John but if it's something like "J.T.R", it should return "JTR" to make the url work. Here is the code I wrote but it doesn't capture "JTR".
import re
last_names_parsed=[]
for ln in last_names:
L_name=re.match('\w+', ln)
last_names_parsed.append(L_name[0])
However, this will not capture J.T.R properly. How should I modify the code to properly handle both?
you can add \. to the regular expression:
import re
final_data = [re.sub('\.', '', re.findall('(?<=^)[a-zA-Z\.]+', i)[0]) for i in last_names]
Regex explanation:
(?<=^): positive lookbehind, ensures that the ensuring regex will only register the match if the match is found at the beginning of the string
[a-zA-Z\.]: matches any occurrence of alphabetical characters: [a-zA-Z], along with a period .
+: searches the previous regex ([a-zA-Z\.]) as long as a period or alphabetic character is found. For instance, in "John, Jr", only John will be matched, because the comma , is not included in the regex expression [a-zA-Z\.], thus halting the match.

Capture group with python regex not capturing [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
Im trying to gain an understanding of capture groups using this example:
sentence = "the quick brown fox jumps over the lazy dog"
re.search(r'\S+\s+\S+',sentence)
<_sre.SRE_Match object; span=(0, 9), match='the quick'>
I can see this matches as follows:
re.search(r'\S+\s+\S+',sentence).group()
'the quick'
I want to add a match group for the word 'quick' so I try this:
re.search(r'\S+\s+\(S+)',sentence)
Which gives an error:
error: unbalanced parenthesis at position 10
What am I doing wrong here?
Looks like a typo, but I'll still provide an explanation.
You are escaping the opening parenthesis making it matching a literal (, which makes the closing parenthesis at the end of the expression without an opening part, replace:
\S+\s+\(S+)
with:
\S+\s+(\S+)

Parsing a string to extract a delimited unit having an alphabetic starting character and and an unknown length [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm new to python regular expression so any help will be appreciated. Thanks in advance.
I have this
string = "Restaurant_Review-g503927-d3864736-Reviews"
I would like extract 'g503927' and 'd3864736' from it.
I know you can use re.match(pattern, string, flags=0)
But not sure how to write the regex for it. Plz help
Using re.findall:
>>> s = "Restaurant_Review-g503927-d3864736-Reviews"
>>> re.findall('[a-z]\d+', s)
['g503927', 'd3864736']
[a-z]\d+ matches lowercase alphabet followed by digits.
This should work
import re
pattern = re.compile("[a-z][0-9]+")
a non-regex solution but it depends on what is delimiting the units, here i assume it's a -:
s = "Restaurant_Review-g503927-d3864736-Reviews"
outputs = [i for i in s.split('-') if i[0].isalpha() and i[1:].isdigit()]
no need to use Regex... use the split() method:
s = "Restaurant_Review-g503927-d3864736-Reviews"
print s.split('-')
print s.split('-')[1]
print s.split('-')[2]
more info here: http://docs.python.org/2/library/stdtypes.html#str.split

Categories