Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?] - python

I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.

Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.

Related

How can I fetch content from a particular html div from a list of URLs in a dataframe using beautifulsoup?

I am analyzing text from a specific div in several URLs.
All examples I've found ask for input of a single URL but in my case I am working in bulk.
Any suggestions?
Let's split this problem in parts.
First we want to fetch a single URL and return its corresponding HTML document.
Doing this separately also allows us to handle errors and timeouts in a transparent way.
def get_raw_content(url):
tmp = requests.get(r.url, timeout=10)
return tmp.content if tmp.status_code == 200 else None
Next comes the interesting bit. Given a single HTML document, we now want to fetch the content for a particular div. This is where your original code should be.
You could also use XPATH for this. But BeautifulSoup does not support XPATH.
I've written a module that provides a simple XPATH interpreter for bs4 though. If you need that, just let me know in the comments.
def get_div_content(url):
# first fetch the content for this URL
html_text = get_raw_content(url)
if html_text is None:
return None
# work with beautiful soup to fetch the content you need
# TODO : insert your code for 1 URL here
return None
Now, as indicated by other comments, we simply iterate over all URLs we have, and execute the code for a single URL on each one in turn.
def fetch_all(urls):
for url in urls:
txt = get_div_content(url)
print('{} {}'.format(url, txt))
Lastly, we need some entrypoint for the python script.
So I've provided this main method.
if __name__ == '__main__':
fetch_all(['http://www.google.com', 'http://www.bing.com'])

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Getting the XPath from an HTML document

https://next.newsimpact.com/NewsWidget/Live
I am trying to code a python script that will grab a value from a HTML table in the link above. The link above is the site that I am trying to grab from, and this is the code I have written. I think that possibly my XPath is incorrect, because its been doing fine on other elements, but the path I'm using is not returning/printing anything.
from lxml import html
import requests
page = requests.get('https://next.newsimpact.com/NewsWidget/Live')
tree = html.fromstring(page.content)
#This will create a list of buyers:
value = tree.xpath('//*[#id="table9521"]/tr[1]/td[4]/text()')
print('Value: ', value)
What is strange is when I open the view source code page, I cant find the table I am trying to pull from.
Thank you for your help!
Required data absent in initial page source - it comes from XHR. You can get it as below:
import requests
response = requests.get('https://next.newsimpact.com/NewsWidget/GetNextEvents?offset=-120').json()
first_previous = response['Items'][0]['Previous'] # Current output - "2.632"
second_previous = response['Items'][1]['Previous'] # Currently - "0.2"
first_forecast = response['Items'][0]['Forecast'] # ""
second_forecast = response['Items'][1]['Forecast'] # "0.3"
You can parse response as simple Python dict and get all required data
Your problem is simple, request don't handle javascript at all. The values are JS generated !
If you really need to run this xpath, you need to use a module capable of understanding JS, like spynner.
You can test when you need JS or not by first using curl or by disabling JS in your browser. With firefox : about:config in navigation bar, then search javascript.enabled, then double click on it to switch between true or false
In chrome, open chrome dev tools, there's the option somewhere.
Check https://github.com/makinacorpus/spynner
Another (possible) problem, use tree = html.fromstring(page.text) not tree = html.fromstring(page.content)

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Click an image which has a specific name

How do I click an image like below using Python mechanize?
<img name="next" id="next" src="...">
I know the name and id of the image I want to click to. I need to somehow identify the parent link and click it. How can I?
Bonus Question: How can I check if there is such an image or not?
Rather than using mechanize, it's very simple to do with bs4 (beautifulsoup 4).
from bs4 import BeautifulSoup
import urllib2
text = urllib2.urlopen("http://yourwebpage.com/").read()
soup = BeautifulSoup(text)
img = soup.find_all('img',{'id':'next'})
if img:
a_tag = img[0].parent
href = a_tag.get('href')
print href
Retrieving the parent tag is very easy with bs4, as it happens with nothing less than .parent after finding the tag of course with the find_all function. As the find_all function returns an array, it's best to do if img: in the future, but as this may not apply to your website, it'll be safe to do. See below.
EDIT: I have changed the code to include the "Bonus question", which is what I described above as an alternative.
For your bonus question - I would say you can use BeautifulSoup to check to see whether or not the img element works. You can use urllib to see if the image exists (at least, whether or not the server will pass it to you - otherwise you'll get an error back).
You can also check out this thread that someone more intelligent than I answered - it seems to discuss a library called SpiderMonkey and the inability for mechanize to click a button.
Well, I don't know how to do it using Mechanize, however I know how to do in using lxml:
Lets assume that our webpage has this code:
<img name="bla bla" id="next" src="Cat.jpg">. Using lxml we would write this code:
from lxml import html
page = urlllib2.urlopen('http://example.com')
tree = html.fromstring(page.read())
link = tree.xpath('//img[#id="next"]/ancestor::a/attribute::href')
Most of the magic happens in the tree.xpath function, where you define the image you're looking for first with //img[#id="next"], then you specify that you're looking for the a tag right before it: /ancestor::a and that you're looking for specifically the href attribute: /attribute::href. The link variable will now contain a list of strings matching that query - in this case link[0] will be page2.html - which you can urlopen(), thus effectively clicking it.
For the //img[#id="next"] part, you can use other attribute, for example this: //img[#name="bla bla"] and it's going to work perfectly fine. You just need to think which attribute is better for this situation.
I know this answer doesn't use Mechanize, however I hope it's a helpful pointer. Good luck!

Categories