Web Scraping w/Mechanize and/or BeautifulSoup4 on ASP pages - python

Hello I am a programming n00b and am desperately trying to get some code to work.
I can't really find any good tutorials on ASP scraping/filling in fields and submitting then working with the content.
Here is my code so far:
import mechanize
import re
url = 'http://www.cic.gc.ca/english/work/iec/index.asp'
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open(url)
response = br.response().read()
What I am trying to do:
Load that URL
Fill out the 2 form fields
Submit
Print the div id stats into python
Run a loop on a 15 minute timer
Play a loud sound if anything in div stats changes when it loops
Please advise me the best/fastest way of doing this with minimal programming experience.

It's not that difficult once you understand why the second form isn't shown when page is loaded. There's nothing to do with ASP itself, it's rather due to CSS style set to "display: none;" and will only be active once the first Option Select has been filled.
Here is a full sample of what you're after, implemented in Selenium, it's similar to machanize I believe.
The basic flow should be:
load the web browser and load the page
find the first Option Select and fill it in
trigger a change event (I chose to send a TAB key) and the second Option Select is shown
fill in the second Select and find the submit button and click it
assign a name to the content you get from the div id=stats text
compared if the text has changed from last fetch
a) if YES, do your BEEP and close the driver page etc.
b) if NO, set a scheduler (I use Python's Event's Scheduler, and run the crawling function again...
That's it! Easy, ok code time -- I used United Kingdom + Working Holiday pair for test:
import selenium.webdriver
from selenium.webdriver.common.keys import Keys
import sched, time
driver = selenium.webdriver.Firefox()
url = 'http://www.cic.gc.ca/english/work/iec/index.asp'
driver.get(url)
html_content = ''
# construct a scheduler
s = sched.scheduler(time.time, time.sleep)
def crawl_me():
global html_content
driver.refresh()
time.sleep(5) # wait 5s for page to be loaded
country_name = driver.find_element_by_name('country-name')
country_name.send_keys('United Kingdom')
# trick's here to send a TAB key to trigger the change event
country_name.send_keys(Keys.TAB)
# make sure the second Option Select is active (none not in style)
assert "none" not in driver.find_element_by_id('category_dropdown').get_attribute('style')
cateogory_name = driver.find_element_by_name('category-name')
cateogory_name.send_keys('Working Holiday')
btn_go = driver.find_element_by_id('submit')
btn_go.send_keys(Keys.RETURN)
# again, check if the content has been loaded
assert "United Kingdom - Working Holiday" not in driver.page_source
compared_content = driver.find_element_by_id('stats').text
# here we will end this script if content has changed already
if html_content != '' and html_content != compared_content:
# do whatever you want to play the beep sound
# at the end exit the loop
driver.close()
exit(-1)
# if no changes are found, trigger the schedule_crawl() function, like recursively
html_content = compared_content
print html_content
return schedule_crawl()
def schedule_crawl():
# set your time interval here, 15*60 = 15 minutes
s.enter(15*60, 1, crawl_me, ())
s.run() # and run it of course
crawl_me()
To be honest, this is quite easy and straight forward however it does require you fully understand how html/css/javascript (not javascript in this case, but you do need to know the basic) and all their elements how they work together.
You do need to learn from the basic read => digest => code => experience => do it in cycles, Programming doesn't have a shortcut or the fastest way.
Hope this helps (and I really hope you do not just copy & paste mine, but learn and implement your own in mechanize by the way).
Good Luck!

Related

Scraping multiple pages with an unchanging URL using BeautifulSoup

I am using Beautiful Soup to extract data from a non-English website. Right now my code only extracts the first ten results from the keyword search. The website is designed so that additional results are accessed through the ‘more’ button (sort of like an infinity scroll, but you have to keep on clicking more to get the next set of results ). When I click ‘more’ the URL doesn’t change, so I cannot just iterate over a different URL each time.
I would really like some help with two things.
Modifying the code below so that I can get data from all of the pages and not just the first 10 results
Insert a timer function so that the server doesn’t block me
I’m adding a photo of what the ‘more’ button looks like because it’s not in English. It’s in blue text at the end of the page.
import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep
# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
# read the content of the server’s response
rawPagePA = responsePA.text
soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())
urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
aTag = item.find("a") #extracting elements containing 'a' tags
urlsPA.append(aTag.attrs["href"])
print(urlsPA)
#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
specificpagePA=requests.get(link) #making a get request and stores the response in an object
rawAddPagePA=specificpagePA.text # read the content of the server’s response
PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"])
#print(PAcontent)
PAlist.append(PAcontent)
You don't actually need Selenium.
The Buttons sends the following GET-request:
https://www.prothomalo.com/api/v1/advanced-search?fields=headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards&offset=10&limit=6&q=ধর্ষণ
The important part is the "offset=10&limit=6" at the end, subsequent clicks on the button only increase that offset by 6.
Getting
data from all of the pages
won't work, because there seem to be quite a lot and I don't see an option to determine how many. So you better pick a number and request until you have that many links.
As this request returns JSON, you also might be better off to just parse that instead of feeding the HTML to BeautifulSoup.
Have a look at that:
import requests
import json
s = requests.Session()
term = 'ধর্ষণ'
count = 20
# Make GET-Request
r = s.get(
'https://www.prothomalo.com/api/v1/advanced-search',
params={
'offset': 0,
'limit': count,
'q': term
}
)
# Read response text (a JSON file)
info = json.loads(r.text)
# Loop over items
urls = [item['url'] for item in info['items']]
print(urls)
This returns the following list:
['https://www.prothomalo.com/world/asia/পাকিস্তানে-সন্তানদের-সামনে-মাকে-ধর্ষণের-মামলায়-দুজনের-মৃত্যুদণ্ড', 'https://www.prothomalo.com/bangladesh/district/খাবার-দেওয়ার-কথা-বদলে-ধর্ষণ-অবসরপ্রাপ্ত-শিক্ষকের-বিরুদ্ধে-মামলা', 'https://www.prothomalo.com/bangladesh/district/জয়পুরহাটে-অপহরণ-ও-ধর্ষণ-মামলায়-যুবকের-যাবজ্জীবন-কারাদণ্ড', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-ধর্ষণ-মামলায়-যুবক-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/সুবর্ণচরে-এত-ধর্ষণ-কেন', 'https://www.prothomalo.com/bangladesh/district/১২-বছরের-ছেলেকে-ধর্ষণ-মামলায়-একজন-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/ভালো-পাত্রের-সঙ্গে-বিয়ে-দেওয়ার-কথা-বলে-কিশোরীকে-ধর্ষণ-গ্রেপ্তার-১', 'https://www.prothomalo.com/bangladesh/district/সখীপুরে-দুই-শিশুকে-ধর্ষণ-মামলার-আসামিকে-গ্রেপ্তারের-দাবিতে-মানববন্ধন', 'https://www.prothomalo.com/bangladesh/district/বগুড়ায়-ছাত্রী-ধর্ষণ-মামলায়-তুফান-সরকারের-জামিন-বাতিল', 'https://www.prothomalo.com/world/india/ধর্ষণ-নিয়ে-মন্তব্যের-জের-ভারতের-প্রধান-বিচারপতির-পদত্যাগ-দাবি', 'https://www.prothomalo.com/bangladesh/district/ফুলগাজীতে-ধর্ষণ-মামলায়-অভিযুক্ত-ইউপি-চেয়ারম্যান-বরখাস্ত', 'https://www.prothomalo.com/bangladesh/district/ধুনটে-ধর্ষণ-মামলায়-ছাত্রলীগ-নেতা-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/নোয়াখালীতে-কিশোরীকে-ধর্ষণ-ভিডিও-ধারণ-ও-অপহরণের-অভিযোগে-গ্রেপ্তার-২', 'https://www.prothomalo.com/bangladesh/district/বাবার-সঙ্গে-দেখা-করানোর-কথা-বলে-স্কুলছাত্রীকে-ধর্ষণ', 'https://www.prothomalo.com/opinion/column/ধর্ষণ-ঠেকাতে-প্রযুক্তির-ব্যবহার', 'https://www.prothomalo.com/world/asia/পার্লামেন্টের-মধ্যে-ধর্ষণ-প্রধানমন্ত্রীর-ক্ষমা-প্রার্থনা', 'https://www.prothomalo.com/bangladesh/district/তাবিজ-দেওয়ার-কথা-বলে-গৃহবধূকে-ধর্ষণ-কবিরাজ-আটক', 'https://www.prothomalo.com/bangladesh/district/আদালত-প্রাঙ্গণে-বিয়ে-করে-জামিন-পেলেন-ধর্ষণ-মামলার-আসামি', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-দল-বেঁধে-ধর্ষণ-ও-ভিডিও-ধারণ-গ্রেপ্তার-৩', 'https://www.prothomalo.com/bangladesh/district/ধর্ষণ-মামলায়-সহকারী-স্টেশনমাস্টার-গ্রেপ্তার']
By adjusting count you can set the number of urls (articles) to retrieve, term is the search-term.
The requests.Session-object is used to have consistent cookies.
If you have any questions, feel free to ask.
Edit:
Just in case you are wondering how I found out which GET-request
was being sent by clicking the button: I went to the Network
Analysis-tab from the developer tools of my browser (Firefox),
clicked the button, observed which requests were being sent and
copied that URL:
Another explanation for the params parameter from the
.get-function: It contains (in python-dictionary-format) all the parameters that would normally be appended to the URL after the
question mark. So
requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
can be written as
requests.get('https://www.prothomalo.com/search', params={'q': 'ধর্ষণ'})
which makes it a lot nicer to look at and you can actually see what
you are searching for, because it's written in unicode and not
already encoded for the URL.
Edit:
If the script starts returning an empty JSON-file and thus no URLs, you probably have to set a User-Agent like so (I used the one for Firefox, but any browser should be fine):
s.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) '
'Gecko/20100101 Firefox/87.0'
})
Just put that code below the line where the session-object is initialized (the s = ... line).
A User-Agent tells the site what kind of program is accessing their data.
Always keep in mind that the server has other stuff to do as well and that the webpage has other priorities than sending thousands of search-results to a single person, so try to keep the traffic as low as possible. Scraping 5000 URLs is a lot and if you really have to do it multiple times, put a sleep(...) of at least a few seconds anywhere before you make the next request (not just to prevent getting blocked, but rather to be nice to the people who provide you with the information you request).
Where you put the sleep does not really matter, as the only time you're actually making contact with the server is the s.get(...) line.
This is where you add selenium with bs4. To add the click for the site to load then get the page content.
you can download the geckodriver from link
Mock code will look like this,
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
# You need to iterate over this with a loop on how many times you want to click more,
#do remember if it takes time to fetch the data try adding time.sleep() to wait for the page to load
driver.find_element_by_css_selector('{class-name}').click()
# Then you just get the page content
soup = BeautifulSoup(driver.page_source, 'html')
# now you have the content loaded with beautifulsoap and can manipulate it as you were doing previously
{YOUR CODE}

How to scrape aspx pages with python

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.
I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search
All the ones in caps are hidden. I have also managed to figure out the results structure.
My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.
import re
import urlparse
import mechanize
from bs4 import BeautifulSoup
class DocumentFinderScraper(object):
def __init__(self):
self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
##TO DO
##submit form
#get return URL
#scrape results
#analyze pagination
if __name__ == '__main__':
scraper = DocumentFinderScraper()
scraper.scrape()
Any help would be dearly appreciated
I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:
As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.
For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.
This code should log you in using Selenium:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
usr = ""
pwd = ""
driver = webdriver.Firefox()
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)

How can I speed up this script?

I am required to retrieve 8000 answers from a website for research purposes (auto filling a form and submitting it 8000 times). I wrote the below script but when I run it after 20 submits python stops working and I'm unable to get what I need. Could you please help me find the problem with my script?
from mechanize import ParseResponse, urlopen, urljoin
import urllib2
from urllib2 import Request, urlopen, URLError
import mechanize
import time
URL = "url of the website"
br = mechanize.Browser() # Creates a browser
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
def fetch(val):
br.open(URL) # Open the login page
br.select_form(nr=0) # Find the login form
br['subject']='question'
br['value'] =val
br.set_all_readonly(False)
resp = br.submit()
data = resp.read()
br.reload()
x=data.find("the answer is:")
if x!=-1:
ur=data[x:x+100]
print ur
val_list =val_list # This list is available and contains 8000 different values
for i in range(0,8000):
fetch(val_list[i])
Having used mechanize in the past to do a similar data-scraping kind of thing, you're almost certainly getting limited by the website as Erbureth mentioned. Usually websites have a way to monitor connections to filter out exactly the type of thing you're attempting, and for good reason.
Putting aside for a moment whatever the purpose of your script may be and moving to your question of why is doesn't work: At the very least, I would put some delays in there so you're not trying to access the site repeatedly in such a short time span. Put a few seconds of pause between calls, and maybe it will work. (Although then you'll have to let it run for hours.)

Mechanize (Python) - Trouble with form submission

I'm trying to do something very simple using Python's Mechanize library. I want to go to: JobSearch">http://careers.force.com/jobs/ts2_JobSearch, select Dublin Ireland from the drop down list, and then hit enter.
I've written a very short Python script for this, but for some reason when I run it, it returns the HTML for the default search page rather than the search page that is produced after selecting the location (Dublin Ireland) and hitting enter. I have no idea what is going wrong:
import mechanize
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
br.open(link)
br.select_form('j_id0:j_id1:atsForm' )
br.form['j_id0:j_id1:atsForm:j_id38:1:searchCtrl'] = ["Ireland - Dublin"]
response = br.submit()
newsite = response.read()
This is in case you're still having this problem or if not, in case anyone else is having this problem in the future....
I looked at the postdata that was being sent by your browser when you manually selected something and wrote a function for you that will get you to the page you want by manually performing a POST operation with urllib.urlencoded data. Cheers.
import mechanize,cookielib,urllib
def get_search(html,controls):
#viewstate
s=re.search('ViewState" value="', html).span()[1]
e=re.search('"',html[s:]).span()[0]+s
state=html[s:e]
#viewstateversion
s=re.search('ViewStateVersion', html).span()[1]
s=s+re.search('value="', html[s:]).span()[1]
e=re.search('"', html[s:]).span()[0]+s
version=html[s:e]
#viewstatemac
s=re.search('ViewStateMAC',html).span()[1]
s=s+re.search('value=\"',html[s:]).span()[1]
e=re.search('"',html[s:]).span()[0]+s
mac=html[s:e]
return {controls[0]:controls[0], controls[1]:'',controls[2]:'Ireland - Dublin', controls[3]:'Search','com.salesforce.visualforce.ViewState':state,'com.salesforce.visualforce.ViewStateVersion':version,'com.salesforce.visualforce.ViewStateMAC':mac}
#Define variables and create a mechanize browser
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
cj=cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open(link)
#get the html data
html=br.response().read()
#get the control names from the correct form
br.select_form(nr=1)
controls=[control.name for control in br.form.controls]
#run function with html and control names list as parameters and run urllib.urlencode on what gets returned
postdata=urllib.urlencode(get_search(br.response().read(), controls))
#go to the webpage again but this time also submit the encoded data
br.open(link, postdata)
#There Ya Go
print br.response().read()

Any advice for sending a request to a website from Python?

def align_sequences(IDs):
import webbrowser
import urllib,urllib2
url = 'http://www.uniprot.org/align/'
params = {'query':IDs}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
job_url = response.geturl()
webbrowser.open(job_url)
align_sequences('Q4PRD1 Q7LZ61')
With this function I want to open 'http://www.uniprot.org/align/', request the protein sequences with IDs Q4PRD1 and Q7LZ61 to be aligned, and then open the website in my browser.
Initially it seems to be working fine - running the script will open the website and show the alignment job to being run. However, it will keep going forever and never actually finish, even if I refresh the page. If I input the IDs in the browser and hit 'align' it works just fine, taking about 8 seconds to align.
I am not familiar with the differences between running something directly from a browser and running it from Python. Do any of you have an idea of what might be going wrong?
Thank you :-)
~Max
You have to click align button. You can't do this with webbrowser though. One option is to use selenium:
from selenium import webdriver
url = 'http://www.uniprot.org/align/'
ids = 'Q4PRD1 Q7LZ61'
driver = webdriver.Firefox()
driver.get(url)
q = driver.find_element_by_id('alignQuery')
q.send_keys(ids)
btn = driver.find_element_by_id("sequence-align-submit")
btn.click()
I think this is in javascript. If you look at the html-code of button Align you can see
onclick="UniProt.analytics('AlignmentSubmissionPage', 'click', 'Submit align'); submitAlignForm();"
UniProt.analytics() and submitAlignForm() some javascript magic. This magic in js-compr.js2013_11 file.
You can view this file using http://jsbeautifier.org/ and then do on python what do javascript.

Categories