RegEx in Python for WikiMarkup - python

I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'

Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).

srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.

Related

re.search() or 'in', re.match() or startswith()?

I am learning how to use the re library in Python and a question flashed through my mind. Please forgive me if this sounds stupid. I am new to this stuff. :)
Since according to this answer,
re.search - find something anywhere in the string
re.match - find something at the beginning of the string
Now I have this code:
from re import search
str = "Yay, I am on StackOverflow. I am overjoyed!"
if search('am',str): # not considering regex
print('True') # returns True
if 'am' in str:
print('True') # returns True
And this:
from re import match
str = "Yay, I am on Stack Overflow. I am overjoyed!"
if match('Yay',str): # not considering regex
print('True') # prints True
if str.startswith('Yay'):
print('True') # prints True
So now my question is, which one should I use when I am doing similar stuffs (not considering regular expressions) such as fetching contents from a webpage and finding in its contents. Should I use built-ins like above, or the standard re library? Which one will make the code more optimised/efficient?
Any help will be much appreciated. Thank you!
Regex is mostly used for complex match, search and replace operations, while built-in keyword such as 'in' is mostly used for simple operations like replacing a single word by another. Normally 'in' keyword is preferred. In terms of performance 'in' keyword usage is faster but when you face a situation where you could use 'in' keyword but Regex offers much more elegant solution rather than typing a lot of 'if' statements use Regex.
When you are fetching contents from a webpage and finding stuff in the contents the codex above also applies.
Hope this helps.

find multiple things in a string using regex in python

My input string contains various entities like this:
conn_type://host:port/schema#login#password
I want to find out all of them using regex in python.
As of now, I am able to find them one by one, like
conn_type=re.search(r'[a-zA-Z]+',test_string)
if (conn_type):
print "conn_type:", conn_type.group()
next_substr_len = conn_type.end()
host=re.search(r'[^:/]+',test_string[next_substr_len:])
and so on.
Is there a way to do it without if and else?
I expect there to be some way, but not able to find it. Please note that every entity regex is different.
Please help, I don't want to write a boring code.
Why don't you use re.findall?
Here is an example:
import re;
s = 'conn_type://host:port/schema#login#password asldasldasldasdasdwawwda conn_type://host:port/schema#login#email';
def get_all_matches(s):
matches = re.findall('[a-zA-Z]+_[a-zA-Z]+:\/+[a-zA-Z]+:+[a-zA-Z]+\/+[a-zA-Z]+#+[a-zA-Z]+#[a-zA-Z]+',s);
return matches;
print get_all_matches(s);
this will return a list full of matches to your current regex as seen in this example which in this case would be:
['conn_type://host:port/schema#login#password', 'conn_type://host:port/schema#login#email']
If you need help making regex patterns in Python I would recommend using the following website:
A pretty neat online regex tester
Also check the re module's documentation for more on re.findall
Documentation for re.findall
Hope this helps!
>>>import re
>>>uri = "conn_type://host:port/schema#login#password"
>>>res = re.findall(r'(\w+)://(.*?):([A-z0-9]+)/(\w+)#(\w+)#(\w+)', uri)
>>>res
[('conn_type', 'host', 'port', 'schema', 'login', 'password')]
No need for ifs. Use findall or finditer to search through your collection of connection types. Filter the list of tuples, as need be.
If you like it DIY, consider creating a tokenizer. This is very elegant "python way" solution.
Or use a standard lib: https://docs.python.org/3/library/urllib.parse.html but note, that your sample URL is not fully valid: there is no schema 'conn_type' and you have two anchors in the query string, so urlparse wouldn't work as expected. But for real-life URLs I highly recommend this approach.

Why my Python regular expression pattern run so slowly?

Please see my regular expression pattern code:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'
It takes 49 seconds to finish, any problem with the pattern?
See Runaway Regular Expressions: Catastrophic Backtracking.
In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.
Constructs like (x+)+ and x+x+ practically guarantee this behaviour.
To detect and fix the problematic constructs, the following concept can be used:
At conceptual level, the presence of a problematic construct means that your regex is ambiguous - i.e. if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.
One way to do this is to
always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)
Just repost the answer and solution in comments from nhahtdh and Marc B:
([A-Za-z\-\s\:\.]+)+ --> [A-Za-z\-\s\:\.]+
Thanks so much to nhahtdh and Marc B!

Python Regex Question

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .
Something like this:
</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>
What Python regex should I use to replace it with something like this:
</tag1><tag3>content</tag3><tag2>
Thanks in advance.
Here is code for something like what you say that you need:
>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>
Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).
Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories