Python Regex Question - python

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .
Something like this:
</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>
What Python regex should I use to replace it with something like this:
</tag1><tag3>content</tag3><tag2>
Thanks in advance.

Here is code for something like what you say that you need:
>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>
Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).
Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

How do I write a regex that excludes certain file suffixes?

I'm looking at the tutorial given here :-
https://docs.python.org/2/howto/regex.html#lookahead-assertions
I want to exclude files that end in .pqr.gz and I'm not quite sure how to do that.
e.g., the expected behaviour is :-
f1.gz => succeed
f1.abc.pqr => succeed
f1.pqr.gz => fail
f1.abc.gz => succeed
The best regex I could come up with was :-
r'.*[.](?=[^.]*[.][^.]*)(?!pqr[.]gz$)[^.]*[.][^.]*$'
This excludes files that end in .pqr.gz but doesn't for example allow files that are just f1.gz (i.e. first case I wrote above).
Any ideas on how this can be improved?
EDIT :- There are better ways to do this (e.g., using string.endswith), but I'm curious about how to do this with a regex purely as an exercise.
well, TBH, your use of regex seems overkill to me. You could simply do:
if not '.pqr.gz' in line:
print(line)
and done.
Actually, "simple" string manipulation can do a lot in just a few simple operations, like:
for line in lines:
file, result = line.split(' => ')
if file.endswith('.pqr.gz'):
print("Skipping file {}".format(file), file=sys.stderr)
continue
print(file)
# and you could do something if result == "success" there after!
as you insist on doing it with regexps:
here's your current regex representation
And here's a solution as inspired from #rawing suggestion:
.*(?<!\.pqr\.gz) =>
One thing to be aware of with Python's re module is that re.match implicitly anchors to the start of the string.
Also, you can match literal periods by escaping them (\.), which is probably easier to read (and potentially faster) than putting it in a character class.
For re.match the following regex should do the trick:
r'.*\.pqr\.gz$'
If using re.search instead, the regex can be shortened to just this:
r'\.pqr\.gz$'

Why my Python regular expression pattern run so slowly?

Please see my regular expression pattern code:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'
It takes 49 seconds to finish, any problem with the pattern?
See Runaway Regular Expressions: Catastrophic Backtracking.
In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.
Constructs like (x+)+ and x+x+ practically guarantee this behaviour.
To detect and fix the problematic constructs, the following concept can be used:
At conceptual level, the presence of a problematic construct means that your regex is ambiguous - i.e. if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.
One way to do this is to
always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)
Just repost the answer and solution in comments from nhahtdh and Marc B:
([A-Za-z\-\s\:\.]+)+ --> [A-Za-z\-\s\:\.]+
Thanks so much to nhahtdh and Marc B!

RegEx in Python for WikiMarkup

I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'
Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).
srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories