Spynner: get html of second page after submitting form - python

I have just started using Spynner to scrape webpages and am not finding any good tutorials out there. I have here a simple example where I type a word into Google and then I want to see the resulting page.
But how do I go from clicking the button to actually getting the new page?
import spynner
def content_ready(browser):
if 'gbqfba' in browser.html:
return True #id of search button
b = spynner.Browser()
b.show()
b.load("http://www.google.com", wait_callback=content_ready)
b.wk_fill('input[name=q]', 'soup')
# b.browse() # Shows the word soup in the input box
with open("test.html", "w") as hf: # writes the initial page to a file
hf.write(b.html.encode("utf-8"))
b.wk_click("#gbqfba") # Clicks the google search button (or so I think)
But now what? I'm not even sure that I have clicked the google search button, although it does have id=gbqfba. I have also tried just b.click("#gbqfba"). How do I get the search results?
I have tried just doing:
with open("test.html", "w") as hf: # writes the initial page to a file
hf.write(b.html.encode("utf-8"))
but that still prints the initial page.

i solved this by sending Enter to the input and waiting two seconds. not ideal, but it works
import spynner
import codecs
from PyQt4.QtCore import Qt
b = spynner.Browser()
b.show()
b.load("http://www.google.com")
b.wk_fill('input[name=q]', 'soup')
# b.browse() # Shows the word soup in the input box
b.sendKeys("input[name=q]",[Qt.Key_Enter])
b.wait(2)
codecs.open("out.html","w","utf-8").write(b.html)

The recommended method is to wait for the new page to load:
b.wait_load()

Related

Python Request Module - Displaying More Results

I'm currently working on a learner project for webscraping
I've picked my site:
https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers#Page0
On this page, there is a button on the bottom that displays the list of the next 10 products there without this button being clicked it does not display the next batch of products however the URL does not change when the button is clicked.
I wanted to ask how I will solve this dilemma using requests module.
My code is below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all=soup.find_all("div",{"class":"product"})
for item in all:
print(item.find({"h2": "productInfo"}).text.replace('\h2','').replace(" ", ""))
print(item.find("span",{"class": "condition"}).text + " " + item.find("span",{"class": "value"}).text )
try:
print(item.find_all("span",{"class": "condition"})[1].text + " " + item.find_all("span",{"class": "value"})[1].text )
except:
print("No Preowned")
print(" ")
Try this code to get all the items available in that page. You can make use of chrome dev tools to retrieve this url in which there is an option for page number increment.
from bs4 import BeautifulSoup
import requests
page_link = "https://www.game.co.uk/en/m/games/best-selling-games/best-selling-xbox-one-games/?merchname=MobileTopNav-_-XboxOne_Games-_-BestSellers&pageNumber={}&pageMode=true"
page_no = 0
while True:
page_no+=1
res = requests.get(page_link.format(page_no))
soup = BeautifulSoup(res.text,'lxml')
container = soup.select(".productInfo h2")
if len(container)<=1:break
for content in container:
print(content.text)
Output of the last few titles:
ARK Survival Evolved
Kingdom Come Deliverance Special Edition
Halo 5 Guardians
Sonic Forces
The Elder Scrolls Online: Summerset - Digital
you need to use a webcrawler that supports javascript/jquery execution - i.e. selenium (it uses BoutifulSoup under the hood)
The problem you're facing is that the content you try to access gets created dynamically via javascript when the mentioned button is clicked.
When you request the page the additional html elements you want to read from are not created so BoutifulSoup cant find them.
Using selenium you can click buttons/fill out forms and much more. You can also wait for the server to create the content you want to access.
The documentation of selenium should be self explaining...

Scrape with BeautifulSoup from site that uses AJAX pagination using Python

I'm fairly new to coding and Python so I apologize if this is a silly question. I'd like a script that goes through all 19,000 search results pages and scrapes each page for all of the urls. I've got all of the scrapping working but can't figure out how to deal with the fact that the page uses AJAX to paginate. Usually I'd just make a loop with the url to capture each search result but that's not possible. Here's the page: http://www.heritage.org/research/all-research.aspx?nomobile&categories=report
This is the script I have so far:
with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
soup = BeautifulSoup(page)
snippet = soup.find_all('a', attrs={'item-title'})
for a in snippet:
logfile.write ("http://www.heritage.org" + a.get('href') + "\n")
print "Done collecting urls"
Obviously, it scrapes the first page of results and nothing more.
And I have looked at a few related questions but none seem to use Python or at least not in a way that I can understand. Thank you in advance for your help.
For the sake of completeness, while you may try accessing the POST request and to find a way round to access to next page, like I suggested in my comment, if an alternative is possible, using Selenium will be quite easy to achieve what you want.
Here is a simple solution using Selenium for your question:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
# uncomment if using Firefox web browser
driver = webdriver.Firefox()
# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()
url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)
# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
while True:
try:
# sleep here to allow time for page load
sleep(5)
# grab the Next button if it exists
btn_next = driver.find_element_by_class_name('next')
# find all item-title a href and write to file
links = driver.find_elements_by_class_name('item-title')
print "Page: {} -- {} urls to write...".format(pages, len(links))
for link in links:
f.write(link.get_attribute('href')+'\n')
# Exit if no more Next button is found, ie. last page
if btn_next is None:
print "crawling completed."
exit(-1)
# otherwise click the Next button and repeat crawling the urls
pages += 1
btn_next.send_keys(Keys.RETURN)
# you should specify the exception here
except:
print "Error found, crawling stopped"
exit(-1)
Hope this helps.

Any advice for sending a request to a website from Python?

def align_sequences(IDs):
import webbrowser
import urllib,urllib2
url = 'http://www.uniprot.org/align/'
params = {'query':IDs}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
job_url = response.geturl()
webbrowser.open(job_url)
align_sequences('Q4PRD1 Q7LZ61')
With this function I want to open 'http://www.uniprot.org/align/', request the protein sequences with IDs Q4PRD1 and Q7LZ61 to be aligned, and then open the website in my browser.
Initially it seems to be working fine - running the script will open the website and show the alignment job to being run. However, it will keep going forever and never actually finish, even if I refresh the page. If I input the IDs in the browser and hit 'align' it works just fine, taking about 8 seconds to align.
I am not familiar with the differences between running something directly from a browser and running it from Python. Do any of you have an idea of what might be going wrong?
Thank you :-)
~Max
You have to click align button. You can't do this with webbrowser though. One option is to use selenium:
from selenium import webdriver
url = 'http://www.uniprot.org/align/'
ids = 'Q4PRD1 Q7LZ61'
driver = webdriver.Firefox()
driver.get(url)
q = driver.find_element_by_id('alignQuery')
q.send_keys(ids)
btn = driver.find_element_by_id("sequence-align-submit")
btn.click()
I think this is in javascript. If you look at the html-code of button Align you can see
onclick="UniProt.analytics('AlignmentSubmissionPage', 'click', 'Submit align'); submitAlignForm();"
UniProt.analytics() and submitAlignForm() some javascript magic. This magic in js-compr.js2013_11 file.
You can view this file using http://jsbeautifier.org/ and then do on python what do javascript.

Getting all visible text from a webpage using Selenium

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Scraping Data Off National Vulnerbility Database: can't figure out clicking on a button (Mechanize+Python)

I am trying to scrape some data off National Vulnerbability Database (http://web.nvd.nist.gov). What I want to do is enter a search term, which brings me the first 20 results, scrape that data. then I want to click "next 20" until I traversed all results.
I am able to successfully submit search terms, but clicking "next 20" is not working at all.
Tools I am using Python + Mechanize
Here is my code:
# Browser
b = mechanize.Browser()
# The URL to this service
URL = 'http://web.nvd.nist.gov/view/vuln/search'
Search = ['Linux', 'Mac OS X', 'Windows']
def searchDB():
SearchCounter=0
for i in Search:
# Load the page
read = b.open(URL)
# Select the form
b.select_form(nr=0)
# Fill out the search form
b['vulnSearchForm:text'] = Search[int(SearchCounter)]
b.submit('vulnSearchForm:j_id120')
result=b.response().read()
file=open(Search[SearchCounter]+".txt","w")
file.write(result)
'''Here is where the problem is. vulnResultsForm:j_id116 is value of the "next 20 button'''
b.select_form(nr = 0)
b.form.click('vulnResultsForm:j_id116')
result=b.response().read()
if __name__ == '__main__':
searchDB()
From the docstring of b.form.click:
Return request that would result from clicking on a control.
The request object is a
urllib2.Request instance, which you
can pass to urllib2.urlopen (or
ClientCookie.urlopen).
So:
request = b.form.click('vulnResultsForm:j_id116')
b.open(request)
result = b.response().read()
I haven't used Mechanize outside of zope.testbrowser, whcih is based on Mechanize, so there may be differences, but here goes:
You click on the form...Try to get the button and click on the button instead.
Something like this, I think:
form.find_control("j_id120").click()
Also:
b['vulnSearchForm:text'] = Search[int(SearchCounter)]
Can be replaced with
b['vulnSearchForm:text'] = i
As i will contain the value. Python is not javascript, loop variables are not numbers (unless you want them to be).

Categories