Web Scraper Not Producing Results Using Python - python

I am a young grasshopper in need of your help. I've done a lot of research and can't seem to find the solution. I written the following code below. When ran it doesn't pull any of the titles. I believe my regular expressions are correct. Not sure what the problem is. Probably obvious to a seasoned sensei. Thanks in advance.
from urllib import urlopen
import re
url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read()
'''
a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a>
'''
A = 'a href.*pdf">(expression to pull everything) a>'
B = re.compile(A)
C = re.findall(B,url)
print C

This comes up pretty often here on SO. Rather than using Regular Expressions you should be using an HTML parser that allows you to search/traverse the document tree.
I would use BeautifulSoup:
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ... print anchor.contents

I'll echo the other comment about not using RegEx for parsing HTML, but sometimes it is quick and easy. It looks like the HTML in your example is not quite correct, but I'd try something like:
re.findall('href.*?pdf">(.+?)<\/a>', A)

Related

beautiful soup parse url from messy output

I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text

Matching url in HTML using regex

It's been a while since I've used regex, and I feel like this should be simple to figure out.
I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.
import re
string_to_match = 'Roster'
re.findall('Roster',string_to_match)
Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:
import re
from bs4 import BeautifulSoup
data = """
<body>
Roster
</body>
"""
soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]
print(re.search(r"teamId=(\d+)", link).group(1))
Prints 58.
I would recommend using BeautifulSoup or lxml, it's worth the learning curve.
...But if you still want to use regexp
re.findall('href="[^"]*teamId=(\d+)',string_to_match)

Going through HTML DOM in Python

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.
I currently have this:
#!/usr/bin/env python
import urllib.request
def getSite(url):
return urllib.request.urlopen(url)
if __name__ == '__main__':
content = getSite('http://www.google.com').read()
print(content)
When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.
I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.
There are many different modules you could use. For example, lxml or BeautifulSoup.
Here's an lxml example:
import lxml.html
mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)
description = lxml_mysite.xpath("//meta[#name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag
>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
And a BeautifulSoup example:
from bs4 import BeautifulSoup
mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)
description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute
>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.
Check out the BeautifulSoup module.
from bs4 import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())
for link in soup.find_all('a'):
print(link.get('href'))

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?

Python copy or print Hyperlink

Hey I would like to copy or print the hyperlink from a word.
for example: Gift Cards
With which code is this possible?
Can I use urllib2?
If somebody speaks German it would be simpler :)
You'll want to use BeautifulSoup:
from bs4 import BeautifulSoup
htmlfile = urllib2.urlopen(url).read()
soup = BeautifulSoup(htmlfile)
a_tag = soup.find('a') # This finds the first occurrence of an a tag.
print a_tag['href'] # Prints the link which the a_tag links to. (hyperlink)
Of course this is pretty basic code. If you look at the documentation of the module, you'll see many other methods you can use to achieve your results :)

Categories