How to scrape aspx pages with python

How to scrape aspx pages with python - python

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.
I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search
All the ones in caps are hidden. I have also managed to figure out the results structure.
My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.
import re
import urlparse
import mechanize
from bs4 import BeautifulSoup
class DocumentFinderScraper(object):
def __init__(self):
self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
##TO DO
##submit form
#get return URL
#scrape results
#analyze pagination
if __name__ == '__main__':
scraper = DocumentFinderScraper()
scraper.scrape()
Any help would be dearly appreciated

I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:
As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.
For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.
This code should log you in using Selenium:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
usr = ""
pwd = ""
driver = webdriver.Firefox()
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)

Related

div not showing up in html from url using requests library and bs4

I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?

I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.

parsing site with beautifulsoup

i'm trying to learn how to parse html with python
and i`m currently stuck with soup.findAll return me an empty array,therefore there are elements which could be found
Here is my code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
headers = {"User-Agent":'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
url = 'https://www.oddsportal.com/matches/tennis/20191114/'
responce = requests.get(url,headers=headers)
soup = BeautifulSoup(responce.text, 'html.parser')
info = soup.findAll('tr', {'class':'odd deactivate'})
print(info)
i`ll appreciate any help,thanks in advance

i'm trying to learn how to parse html with python
You happened to pick a webpage which isn't very beginner-friendly when it comes to webscraping. Broadly speaking, most webpages use one or both of these two common methods for loading / displaying data:
The user makes a request to a server (visits a page, for example).
The server gets the necessary data from a database. The server
generates an HTML response using a templating engine, and returns the
response for the user's browser to render.
The user makes a request to a server. The server returns an
HTML-skeleton response which gets populated with data dynamically by
making other requests / using APIs etc.
The webpage you picked is of the second type. Just because you can see the <tr> elements in the "Elements" tab of Chrome's Dev Tools doesn't mean that that's what the server sent you. By looking at the network tab of Chrome's Dev Tools you can see that a request is made to these two resources:
https://fb.oddsportal.com/ajax-next-games/2/0/1/20191114/yje3d.dat?=1574007087150
https://fb.oddsportal.com/ajax-next-games-odds/2/0/X0/20191114/1/yje3d.dat?=1574007087151
(The Query String parameters will not be the same for you. Visiting those urls also won't be very interesting unless you provide the right payload.)
The first resource seems to be a jQuery script which makes a request, the response of which contains HTML (this is your table). It looks something like this:
You can see that they seem to have assigned unique IDs to each of the matches. Giron Marcos vs. Holt Brandon in this case has an ID of ATM9GmXG.
The second resource is similar. It's also a jQuery script which seems to be making a request to their main API. The response this time is JSON, which is always desirable for webscraping. Here's what part of that looks like (notice the same ID):

Apparently, the page only loades the "odds" parts once it is called in a browser. So you could use Selenium and Chrome driver.
Note that you need to download the Chrome driver and place the driver in your .../python/ directory. Make sure you choose a matching driver version, meaning a version of Chrome driver that matches the version of the Chrome browser you have installed.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests, time, traceback, random, csv, codecs, re, os
# Webdriver
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('log-level=3')
browser = webdriver.Chrome(chrome_options=options)
url = 'https://www.oddsportal.com/matches/tennis/20191114/'
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
info = soup.findAll('tr', {'class':'odd deactivate'})
print(info)

Web Scraping w/Mechanize and/or BeautifulSoup4 on ASP pages

Hello I am a programming n00b and am desperately trying to get some code to work.
I can't really find any good tutorials on ASP scraping/filling in fields and submitting then working with the content.
Here is my code so far:
import mechanize
import re
url = 'http://www.cic.gc.ca/english/work/iec/index.asp'
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open(url)
response = br.response().read()
What I am trying to do:
Load that URL
Fill out the 2 form fields
Submit
Print the div id stats into python
Run a loop on a 15 minute timer
Play a loud sound if anything in div stats changes when it loops
Please advise me the best/fastest way of doing this with minimal programming experience.

It's not that difficult once you understand why the second form isn't shown when page is loaded. There's nothing to do with ASP itself, it's rather due to CSS style set to "display: none;" and will only be active once the first Option Select has been filled.
Here is a full sample of what you're after, implemented in Selenium, it's similar to machanize I believe.
The basic flow should be:
load the web browser and load the page
find the first Option Select and fill it in
trigger a change event (I chose to send a TAB key) and the second Option Select is shown
fill in the second Select and find the submit button and click it
assign a name to the content you get from the div id=stats text
compared if the text has changed from last fetch
a) if YES, do your BEEP and close the driver page etc.
b) if NO, set a scheduler (I use Python's Event's Scheduler, and run the crawling function again...
That's it! Easy, ok code time -- I used United Kingdom + Working Holiday pair for test:
import selenium.webdriver
from selenium.webdriver.common.keys import Keys
import sched, time
driver = selenium.webdriver.Firefox()
url = 'http://www.cic.gc.ca/english/work/iec/index.asp'
driver.get(url)
html_content = ''
# construct a scheduler
s = sched.scheduler(time.time, time.sleep)
def crawl_me():
global html_content
driver.refresh()
time.sleep(5) # wait 5s for page to be loaded
country_name = driver.find_element_by_name('country-name')
country_name.send_keys('United Kingdom')
# trick's here to send a TAB key to trigger the change event
country_name.send_keys(Keys.TAB)
# make sure the second Option Select is active (none not in style)
assert "none" not in driver.find_element_by_id('category_dropdown').get_attribute('style')
cateogory_name = driver.find_element_by_name('category-name')
cateogory_name.send_keys('Working Holiday')
btn_go = driver.find_element_by_id('submit')
btn_go.send_keys(Keys.RETURN)
# again, check if the content has been loaded
assert "United Kingdom - Working Holiday" not in driver.page_source
compared_content = driver.find_element_by_id('stats').text
# here we will end this script if content has changed already
if html_content != '' and html_content != compared_content:
# do whatever you want to play the beep sound
# at the end exit the loop
driver.close()
exit(-1)
# if no changes are found, trigger the schedule_crawl() function, like recursively
html_content = compared_content
print html_content
return schedule_crawl()
def schedule_crawl():
# set your time interval here, 15*60 = 15 minutes
s.enter(15*60, 1, crawl_me, ())
s.run() # and run it of course
crawl_me()
To be honest, this is quite easy and straight forward however it does require you fully understand how html/css/javascript (not javascript in this case, but you do need to know the basic) and all their elements how they work together.
You do need to learn from the basic read => digest => code => experience => do it in cycles, Programming doesn't have a shortcut or the fastest way.
Hope this helps (and I really hope you do not just copy & paste mine, but learn and implement your own in mechanize by the way).
Good Luck!

Item visible in browser not collected by scraper

I'm trying to collect data from the SumofUs website; specifically the number of signatures on the petition. The datum is presented like this: <div class="percent">256,485 </div> (this is the only item of this class on the Page.)
So I tried this:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
raw = requests.get(url, headers = user_agent)
html = BeautifulSoup(raw.text)
# get the item we're seeking
number = html.find("div", class_="percent")
print number
It seems that the number isn't rendered (I've tried a couple of user agent strings.) What else could be causing this? How can I work around this in future?

In the general case you should use a headless browser. Ghost.py is written in python so its probably a good choice to try first.
In this specific case a little research reveals that there's a much simpler method. By using the network tab in chrome you can see that the site makes an ajax call to populate the value. So you can just get it directly:
url = "http://action.sumofus.org/api/ak_action_count_by_action/?action=nhs-patient-corporations&additional="
number = int(requests.get(url).text)

You could use Selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
# then load BeautifulSoup with browsers content
html = BeautifulSoup(driver.page_source)
...

Open a page programmatically in python

Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin

That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.

You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.

You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()

You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.