I have used the code below to extract urls from a html page using daringfireball's regex, http://daringfireball.net/2010/07/improved_regex_for_matching_urls, i.e.
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s!()[]{};:'".,<>?«»“”‘’]))`
The regex works amazingly but using re.findall() is almost taking forever. Is there anyway that I can get all the urls in the html quickly?
import urllib, re
seed = "http://web.archive.org/web/20100412111652/http://app.singaporeedu.gov.sg/asp/index.asp"
page = urllib.urlopen(seed).read().decode('utf8')
#print page
pattern = r'''(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
match = re.search(pattern,page)
print match.group(0)
matches = re.findall(pattern,page) # this line takes more than 3 mins on my i3 laptop
print matches
Yes. By not using regex at all. Use a HTML parser such as BeautifulSoup. That's what they're for.
>>> from bs4 import BeautifulSoup as BS
>>> import urllib2
>>> seed = "http://web.archive.org/web/20100412111652/http://app.singaporeedu.gov.sg/asp/index.asp"
>>> soup = BS(urllib2.urlopen(seed))
>>> print soup.find_all('a')
Do you just want all url's from a page? Wouldn't a simple regex like this be sufficient?
<a[^>]*href="([^"]+)"[^>]*>
Related
>>> data
'পূর্ববর্তী ৫০টি) (পরবর্তী ৫০টি'
>>> next = re.findall('<a href="(.*?)".*?>পরবর্তী ৫০টি</a>',data)
>>> next
['/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:%E0%A6%B8%E0%A6%82%E0%A6%AF%E0%A7%8B%E0%A6%97%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80_%E0%A6%AA%E0%A7%83%E0%A6%B7%E0%A7%8D%E0%A6%A0%E0%A6%BE%E0%A6%B8%E0%A6%AE%E0%A7%82%E0%A6%B9/%E0%A6%9F%E0%A7%87%E0%A6%AE%E0%A6%AA%E0%A7%8D%E0%A6%B2%E0%A7%87%E0%A6%9F:%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80%E0%A6%B9%E0%A7%80%E0%A6%A8&from=0&hidelinks=1']
see the Image here
I am trying to find what inside the second anchor tag, But why am I getting data from the first tag?
Any reason why you're using regex to parse HTML? Why not? Read this.
You should be using BeautifulSoup for example:
from bs4 import BeautifulSoup
a = 'পূর্ববর্তী ৫০টি) (পরবর্তী ৫০টি'
print(BeautifulSoup(a, "html.parser").find_all("a"))
This gets you the second anchor.
from bs4 import BeautifulSoup
a = 'পূর্ববর্তী ৫০টি) (পরবর্তী ৫০টি'
print([i.get("href") for i in BeautifulSoup(a, "html.parser").find_all("a") if i.text == "পরবর্তী ৫০টি"])
Output:
/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:%E0%A6%B8%E0%A6%82%E0%A6%AF%E0%A7%8B%E0%A6%97%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80_%E0%A6%AA%E0%A7%83%E0%A6%B7%E0%A7%8D%E0%A6%A0%E0%A6%BE%E0%A6%B8%E0%A6%AE%E0%A7%82%E0%A6%B9/%E0%A6%9F%E0%A7%87%E0%A6%AE%E0%A6%AA%E0%A7%8D%E0%A6%B2%E0%A7%87%E0%A6%9F:%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80%E0%A6%B9%E0%A7%80%E0%A6%A8&from=950505&hidelinks=1&back=776017
I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809
So I have a task which requires me to extract data from a website to form a 'top 10 list'. I have chosen IMDB top 250 page http://www.imdb.com/chart/top.
In other words I need a little help using regex to isolate the names of the films and then store them in a list. I already have the HTML stored in a variable as a string (if this is the wrong way of approaching it let me know).
Also, I am limited to use of modules urlopen, re and htmlparser
import HTMLParser
from urllib import urlopen
import re
site = urlopen("http://www.imdb.com/chart/top?tt0468569")
content = site.read()
print content
You really shouldn't use regex but you stated in your comment you have to, so here it is with regex:
import requests
respText = requests.get("http://www.imdb.com/chart/top").text
for title in re.findall(r'<td class="titleColumn">.+?>(.+?)<', respText, re.DOTALL):
print(title)
In BeautifulSoup (Which you can't use)
soup = BeautifulSoup(respText, "html.parser")
for item in soup.find_all("td", {"class" : "titleColumn"}):
print(item.find("a").text)
So when I run this code I keep getting empty brackets instead of the actual data.
I am trying to figure out why sense I don't receive any error messages.
import urllib
import re
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://finance.yahoo.com/q?s=%s&ql=1"%(symbol)
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_%s">(.+?)</span>'%(symbol.lower())
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print price
The brackets come up because the element code for regex is not 184 its l84 its an L not a one.
There is a number of libraries around which can help you to scrape sites. Take a look at Scrapy or at Beautiful Soup they should support both Python 2 and 3 as far as I know.
I am pulling an HTML page with requests and trying to extract a link from it using regex but i keep getting TypeError: expected string or buffer.
Code:
r = requests.get('https://reddit.com/r/spacex')
subreddit=r.text
match=re.search(r'(<li class="first"><a href=")(.+)(")', subreddit)
if match is not None:
print(match.group(2))
However if i take a chunk of the HTML and hardcode it as a string then my code works:
subreddit='<li class="first"><a href="http://www.reddit.com/r/spacex/comments/3115xw/latest_update_pic_of_slc4_pad_shows_ln2_tankage/"'
r = requests.get('https://reddit.com/r/spacex')
match=re.search(r'(<li class="first"><a href=")(.+)(")', subreddit)
if match is not None:
print(match.group(2))
I also tried doing
match=re.search(r'(<li class="first"><a href=")(.+)(")', str(subreddit))
as suggest around here but that didn't work. I did not receive any errors but match.group(2) never printed the link.
I expect you have multiple lines in subreddit separated by '\n' when you do subreddit=r.text. Hence your regex doesn't search beyond first '\n'
try adding re.MULTILINE option. OR
search in each line of for line in subreddit.split('\n')
.
r = requests.get('https://reddit.com/r/spacex')
subreddit=r.text
print('subreddit:' + subreddit)
subreddit.split('\n')
If the code above produces NoneType has no split() then there is something wrong with your requests.get().. It's not returning anything.. may be a proxy?
Post the output of this code anyway..
If you use BeautifulSoup, it will be a lot easier for you:
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> soup = BeautifulSoup(urllib2.urlopen('https://reddit.com/r/spacex').read())
>>> for x in soup.find_all("li", "first")
... print x.a['href']
Or you can simply do:
>>> soup.select('a[href="https://www.reddit.com/r/spacex/comments/3115xw/latest_update_pic_of_slc4_pad_shows_ln2_tankage/"]')
[<a class="comments may-blank" href="https://www.reddit.com/r/spacex/comments/3115xw/latest_update_pic_of_slc4_pad_shows_ln2_tankage/">10 comments</a>]
I'm running python 3.4 and this code which uses regex works for me.
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('https://reddit.com/r/spacex')
>>> re.search(r'(<li class="first"><a href=")(.+?)(")', r.text).group(2)
'https://www.reddit.com/r/spacex/comments/31p51d/tory_bruno_posts_infographic_of_ula_vs_spacex/'
I know this isn't using re like you asked about, but in a similar vein to the BeautifulSoup answer above:
Can you use PyQuery along with requests?
Is this the links you're looking for?
import requests
from pyquery import PyQuery as PyQuery
r = requests.get('https://reddit.com/r/spacex')
subreddit = r.text
pyq_to_parse = PyQuery(subreddit)
result = pyq_to_parse(".first").find("a")
print result