Get only URL from string - Python - python

I am scraping a page with Python and BeautifulSoup library.
I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this
javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

You can write a straightforward regex to extract the URL.
>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'
The regex in question here is
'(.*?)'
Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.
You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.

I did it that way.
terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
terms.split("('")[1].split("','")[0]
outputs
/Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Instead of a regex, you could just partition it twice on something, (eg: '):
s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Here's a quick and ugly answer
href.split("'")[1]

Related

Python RegEx for this HTML String

I've got a string which is like that:
<span class=\"market_listing_price market_listing_price_with_fee\">\r
\t\t\t\t\t$92.53 USD\t\t\t\t<\/span>
I need to find this string via RegEx. My try:
(^<span class=\\"market_listing_price market_listing_price_with_fee\\">\\r\\t\\t\\t\\t\\t&)
But my problem is, the count of "\t" and "\r" may vary.. And of course this is not the Regular Expression for the whole string.. Only for a part of it.
So, what's the correct and full RegEx for this string?
Answering your question about the Regex:
"market_listing_price market_listing_price_with_fee\\">[\\r]*[\\t]*&
This will catch the string you need. Even if you add more \t's or \r's.
If you need to edit this Regex I advice you to visit this website and test-modify it. It will also help you to understand how regular expression works and build your own complete RegEx.
Since this is an HTML string, I would suggest using an HTML Parser like BeautifulSoup.
Here is an example approach finding the element by class attribute value using a CSS selector:
from bs4 import BeautifulSoup
data = "my HTML data"
soup = BeautifulSoup(data)
result = soup.select("span.market_listing_price.market_listing_price_with_fee")
See also:
RegEx match open tags except XHTML self-contained tags

Extracting a URL's in Python from XML

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."
You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk
I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!
Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)

Python: store many regex matches in tuple?

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.
Let's say I have a page with the following stored in the variable HTMLtext:
<ul>
<li class="active"><b>Back to the index</b></li>
<li><b>About Me!</b></li>
<li><b>Audio Production</b></li>
<li><b>Gallery</b></li>
<li><b>Misc</b></li>
<li><b>Shoot me an email</b></li>
</ul>
I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:
pages = ["home", "about", "music", "photos", "stuff", "contact"]
So far, I'm able to use regex to search for one result:
pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]
Running this expression makespages = ['home'].
How can I get the regex search to continue for the whole text, appending the matched text to this tuple?
(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)
Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.
# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">',
full_html_text, re.I + re.S)
# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
full_html_text, re.I)
Obligatory Warning
For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as
if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
casing issues like <A HREF='foo'>
whitespace issues
alternate quotes like href='/foo/bar' instead of href="/foo/bar"
embedded HTML comments
That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.
However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.
Use findall function of re module:
pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)
Output:
['home', 'about', 'music', 'photos', 'stuff', 'contact']
Use findall instead of search:
>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']
The re.findall() function and the re.finditer() function are used to find multiple matches.
To find all results use findall(). Also you need to compile the re only once and then you can reuse it.
href_re = re.compile('<a href="/blog/(.*)">') # Compile the regexp once
pages = href_re.findall(HTMLtext) # Find all matches - ["home", "about",

How can I make a regular expression to extract all anchor tags or links from a string?

I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories