This question already has answers here:
Reversing a regular expression in Python
(8 answers)
Closed 1 year ago.
I have some difficulties learning regex in python. I want to parse my tornado web route configuration along with arguments into a request path string without handlers request.path method.
For example, I have route with patterns like:
/entities/([0-9]+)
/product/([0-9]+/actions
The expected result combine with integer parameter (123) will be a string like:
/entities/123
/product/123/actions
How do I generate string based on that pattern?
Thank you very much in advance!
This might be a possible duplicate to:
Reversing a regular expression in Python
Generate a String that matches a RegEx in Python
Using the answer provided by #bjmc a solution works like this:
>>> import rstr
>>> intermediate = rstr.xeger(\d+)
>>> path = '/product/' + intermediate + '/actions'
Depending on how long you want your intermediate integer, you could replace the regex: \d{1,3}
Related
This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 3 years ago.
I have a string like this:
string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title" width="600"...
title="one more title"...> '''
I am trying to get anything that appears as title (title="Anything here")
I have already tried this but it does not work correctly.
re.findall(r'title=\"(.*)\"',string)
I think your Regex is too Greedy. You can try something like this
re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:
https://docs.python.org/3/library/html.parser.html
https://www.simplifiedpython.net/parsing-html-in-python/
https://github.com/psf/requests-html / Get html using Python requests?
If you would like to read more on performance testing of different python HTML parsers you can learn more here
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help
c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)
for i in c:
print(i.group(1))
The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I've been working on this file rename program for a few days now. I've learned a lot thanks to all of the "silly" questions those before me have asked on this site and the quality answers they have received. Well, on to my problem.
My filenames are in the following format:
ACP001.jpg, ACP002.jpg,... ACP010.jpg, ACP011.jpg, ACP012_x.jpg, ACP013.jpg, ACP014_x.jpg
pattern = r'(ACP0)(0*)(\d+)(\.jpg)'
replace = r'\3\4'
So that was working fine for most of them... but then there were some that had the "_x" just before the file extension. I ammended the pattern and replacement pattern as follows
pattern = r'(ACP0)(0*)(\d+)(_w)*(\.jpg)'
replace = r'\3.jpg'
I think I cheated by hardcoding the ".jpg" in the replace string. How would I handle these situations where the match object groups may be of varying sizes? I essentially want the last group and the third group in this example.
Make the _x term optional:
pattern = r'(ACP0)(0*)(\d+)(_x)?(\.jpg)'
I don't actually know why you have so many capture groups in your pattern. I would have written it this way:
pattern = r'ACP(\d{3})(_x)?\.jpg'
You can use . to match any character except newline. Considering OP wants to rename all files to numbers only (ACP001.jpg -> 1.jpg), you can use following pattern and replace strings for that-
li=['ACP001.txt', 'ACP012.txt', 'ACP013_x.jpg'] # list of filenames
import re # built-in package for regular expressions
pattern = r'(ACP)(0*)(\d+)(.*)(\.\w+)'
replace = r'\3\5'
res = [re.sub(pattern, replace, st) for st in li]
print(res)
OUTPUT
['1.txt', '12.txt', '13.jpg']
This code works on all file extensions and removes the problem of multiple groups altogether.
This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 4 years ago.
Initially I had my date regex working as follows, to capture "February 12, 2018" for example
match = re.search(r'(January|February|March|April|May|June|July|August|September?|October?|November|December)\s+\d{1,2},\s+\d{4}', date).group()
But I want it to become more flexible, and input my variable string into my regex but I can't seem to get it to work after looking through many of the stackoverflow threads about similar issues. I'm quite a novice so I'm not sure what's going wrong. I'm aware that simply MONTHS won't work. Thank you
MONTHS = "January|February|March|April|May|June|July|August|September|October|November|December"
match = re.search(r'(MONTHS)\s+\d{1,2},\s+\d{4}', date).group()
print(match)
'NoneType' object has no attribute 'group'
You've got MONTHS as just a part of the match string, python doesn't know that it's supposed to be referencing a variable that's storing another string.
So instead, try:
match = re.search(r'(' + MONTHS + ')\s+\d{1,2},\s+\d{4}', date).group()
That will concatenate (stick together) three strings, the first bit, then the string stored in your MONTHS variable, and then the last bit.
If you want to substitute something into a string, you need to use either format strings (whether an f-string literal or the format or format_map methods on string objects) or printf-style formatting (or template strings, or a third-party library… but usually one of the first two).
Normally, format strings are the easiest solution, but they don't play nice with strings that need braces for other purposes. You don't want that {4} to be treated as "fill in the 4th argument", and escaping it as {{4}} makes things less readable (and when you're dealing with regular expressions, they're already unreadable enough…).
So, printf-style formatting is probably a better option here:
pattern = r'(%s)\s+\d{1,2},\s+\d{4}' % (MONTHS,)
… or:
pattern = r'(%(MONTHS)s)\s+\d{1,2},\s+\d{4}' % {'MONTHS': MONTHS}
This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
Is there a way to ignore special character meaning when creating a regular expression in python? In other words, take the string "as is".
I am writing code that uses internally the expect method from a Telnet object, which only accepts regular expressions. Therefore, the answer cannot be the obvious "use == instead of regular expression".
I tried this
import re
SPECIAL_CHARACTERS = "\\.^$*+?{}[]|():" # backslash must be placed first
def str_to_re(s):
result = s
for c in SPECIAL_CHARACTERS:
result = result.replace(c,'\\'+c)
return re.compile(result)
TEST = "Bob (laughing). Do you know 1/2 equals 2/4 [reference]?"
re_bad = re.compile(TEST)
re_good = str_to_re(TEST)
print re_bad.match(TEST)
print re_good.match(TEST)
It works, since the first one does not recognize the string, and the second one does. I looked at the options in the python documentation, and was not able to find a simpler way. Or are there any cases my solution does not cover (I used python docs to build SPECIAL_CHARACTERS)?
P.S. The problem can apply to other libraries. It does not apply to the pexpect library, because it provides the expect_exact method which solves this problem. However, someone could want to specify a mix of strings (as is) and regular expressions.
If 'reg' is the regex, you gotta use a raw string as follows
pat = re.compile(r'reg')
If reg is a name bound to a regex str, use
reg = re.escape(reg)
pat = re.compile(reg)
This question already has answers here:
How to input a regex in string.replace?
(7 answers)
Closed 5 years ago.
I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?
z.write(article.replace('</html>.+', '</html>'))
No. Regular expressions in Python are handled by the re module.
article = re.sub(r'(?is)</html>.+', '</html>', article)
In general:
str_output = re.sub(regex_search_term, regex_replacement, str_input)
In order to replace text using regular expression use the re.sub function:
sub(pattern, repl, string[, count, flags])
It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.
Examples
>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'
>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like
z.write(article[:article.index("</html>") + 7]
This is much cleaner, and should be much faster than a regex based solution.
For this particular case, if using re module is overkill, how about using split (or rsplit) method as
se='</html>'
z.write(article.split(se)[0]+se)
For example,
#!/usr/bin/python
article='''<html>Larala
Ponta Monta
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')
se='</html>'
z.write(article.split(se)[0]+se)
outputs out.txt as
<html>Larala
Ponta Monta
</html>