Retrieving a form results with requests - python

I want to submit a multipart/form-data that sets the input for a simulation on TRILEGAL, and download the file available from a redirected page.
I studied documentation of requests, urllib, Grab, mechanize, etc. , and it seems that in mechanize my code would be :
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
However, I could not test it because it is not available in python 3.
So I tried requests :
import requests
url = 'http://stev.oapd.inaf.it/cgi-bin/trilegal'
values = {'gal_coord':"2",
'eq_alpha':"277.981111",
'eq_delta':"-19.0833",
'field':" 0.047117",
}
r = requests.post(url, files = values)
but I can't figure out how to get to the results page - if I do
r.content
it displays the content of the form that I had just submitted, whereas if you open the actual website, and click 'submit', you see a new window (following the method="post" action="./trilegal_1.6" ).
How can I get to that new window with requests (i.e. follow to the page that opens up when I click the submit button) , and click the link on the results page to retrieve the results file ( "The results will be available after about 2 minutes at THIS LINK.") ?
If you can point me to any other tool that could do the job I would be really grateful - I spent hours looking through SO for something that could help solve this problem.
Thank you!
Chris

Here is working solution for python 2.7
from mechanize import Browser
from urllib import urlretrieve # for download purpose
from bs4 import BeautifulSoup
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
soup = BeautifulSoup(content, 'html.parser')
base_url = 'http://stev.oapd.inaf.it'
# fetch the url from page source and it to base url
link = soup.findAll('a')[0]['href'].split('..')[1]
url = base_url + str(link)
filename = 'test.dat'
# now download the file
urlretrieve(url, filename)
Your file will be downloaded as test.dat. You can open it with respective program.

I post a separate answer because it would be too cluttered. Thanks to #ksai, this works in python 2.7 :
import re
import time
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
#set appropriate form contents
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = "277.981111"
browser['eq_delta'] = "-19.0833"
browser['field'] = " 0.047117"
browser['photsys_file'] = ["tab_mag_odfnew/tab_mag_lsst.dat"]
browser["icm_lim"] = "3"
browser["mag_lim"] = "24.5"
response = browser.submit()
# wait 1 min while results are prepared
time.sleep(60)
# select the appropriate url
url = 'http://stev.oapd.inaf.it/' + str(browser.links()[0].url[3:])
# download the results file
browser.retrieve(url, 'test1.dat')
Thank you very much!
Chris

Related

Python Requests-html not return the page content

I'm new to Python and would like your advice for the issue I've encountered recently. I'm doing a small project where I tried to scrape a comic website to download a chapter (pictures). However, when printing out the page content for testing (because i tried to use Beautifulsoup.select() and got no result), it only showed a line of html:
'document.cookie="VinaHost-Shield=a7a00919549a80aa44d5e1df8a26ae20"+"; path=/";window.location.reload(true);'
Any help would be really appreciated.
from requests_html import HTMLSession
session = HTMLSession()
res = session.get("https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html")
res.html.render()
print(res.content)
I also tried this but the resutl was the same.
import requests, bs4
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get(url, headers={"User-Agent": "Requests"})
res.raise_for_status()
# soup = bs4.BeautifulSoup(res.text, "html.parser")
# onePiece = soup.select(".page-chapter")
print(res.content)
update: I installed docker and splash (on Windows 11) and it worked. I included the update code. Thanks Franz and others for yor help.
import os
import requests, bs4
os.makedirs("OnePiece", exist_ok=True)
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get("http://localhost:8050/render.html", params={"url": url, "wait": 5})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
onePiece = soup.find_all("img", class_="lazy")
for element in onePiece:
imageLink = "https:" + element["data-cdn"]
res = requests.get(imageLink)
imageFile = open(os.path.join("OnePiece", os.path.basename(imageLink)), "wb")
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
import urllib.request
request_url = urllib.request.urlopen('https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html')
print(request_url.read())
it will return html code of the page.
by the way in that html it is loading several images. you need to use regx to trakdown those img urls and download them.
This response means that we need a javascript render that reload the page using this cookie. for you get the content some workaround must be added.
I commonly use splash scrapinhub render engine and putting a sleep in the page just renders ok all the content. Some tools that render in same way are selenium for python or pupitter in JS.
Link for Splash and Pupeteer

Login to aspx website using python requests

I'm trying to log into my school website that utilizes aspx with requests in order to scrape some data. My problem is similar to the one described here:
Log in to ASP website using Python's Requests module
However, my form also requires SubmitButton.x and SubmitButton.y and I don't know where to get them form. I tried to pass in values that worked in manual login, but it didn't work.
Here's the page
form data from successful manual login
from bs4 import BeautifulSoup
import requests
data = {}
with requests.Session() as s:
page = s.get('https://adfslight.resman.pl/LoginPage.aspx?ReturnUrl=%2f%3fwa%3dwsignin1.0%26wtrealm%3dhttps%253a%252f%252fcufs.resman.pl%253a443%252frzeszow%252fAccount%252fLogOn%26wctx%3drm%253d0%2526id%253dADFS%2526ru%253d%25252frzeszow%25252fFS%25252fLS%25253fwa%25253dwsignin1.0%252526wtrealm%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%252526wctx%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%26wct%3d2018-02-04T18%253a08%253a18Z&wa=wsignin1.0&wtrealm=https%3a%2f%2fcufs.resman.pl%3a443%2frzeszow%2fAccount%2fLogOn&wctx=rm%3d0%26id%3dADFS%26ru%3d%252frzeszow%252fFS%252fLS%253fwa%253dwsignin1.0%2526wtrealm%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx%2526wctx%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx&wct=2018-02-04T18%3a08%3a18Z').content
soup = BeautifulSoup(page, "lxml")
data["__EVENTTARGET"] = ""
data["__EVENTARGUMENT"] = ""
data["___VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["UsernameTextBox"] = "myusername"
data["PasswordTextBox"] = "mypassword"
data["SubmitButton.x"] = "49"
data["SubmitButton.y"] = "1"
s.post('https://adfslight.resman.pl/LoginPage.aspx?ReturnUrl=%2f%3fwa%3dwsignin1.0%26wtrealm%3dhttps%253a%252f%252fcufs.resman.pl%253a443%252frzeszow%252fAccount%252fLogOn%26wctx%3drm%253d0%2526id%253dADFS%2526ru%253d%25252frzeszow%25252fFS%25252fLS%25253fwa%25253dwsignin1.0%252526wtrealm%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%252526wctx%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%26wct%3d2018-02-04T18%253a08%253a18Z&wa=wsignin1.0&wtrealm=https%3a%2f%2fcufs.resman.pl%3a443%2frzeszow%2fAccount%2fLogOn&wctx=rm%3d0%26id%3dADFS%26ru%3d%252frzeszow%252fFS%252fLS%253fwa%253dwsignin1.0%2526wtrealm%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx%2526wctx%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx&wct=2018-02-04T18%3a08%3a18Z', data=data)
open_page = s.get("https://uonetplus.resman.pl/rzeszow/Start.mvc/Index")
print(open_page.text)

Using "requests" (in ipython) to download pdf files

I want to download all the pdf docs corresponding to a list of "API#" values from http://imaging.occeweb.com/imaging/UIC1012_1075.aspx
So far I have managed to post the "API#" request but not sure what to do next.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
payload = {'txtIndex7':'1','txtIndex2': API}
session = requests.Session()
res = session.post(url,headers=headers,data=payload)
It is a bit more complicated than that, there are some additional event validation hidden input fields that you need to take into account. For that you first need to get the page, collect all the hidden values, set the value for the API and then make a POST request with following HTML parsing of the HTML response.
Fortunately, there is a tool called MechanicalSoup that may help to auto-fill these hidden fields in your the form submission request. Here is a complete solution including sample code for parsing the resulting table:
import mechanicalsoup
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr'):
print([td.get_text() for td in tr.find_all("td")])
import mechanicalsoup
import urllib
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
Form = '1012'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex7"] = Form
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr')[2:]:
try:
pdf_url = tr.select('td')[0].find('a').get('href')
except:
print('Pdf not found')
else:
pdf_id = tr.select('td')[0].text
response = urllib.urlopen(pdf_url) # for python 2.7, for python 3. urllib.request.urlopen()
pdf_str = "C:\\Data\\"+pdf_id+".pdf"
file = open(pdf_str, 'wb')
file.write(response.read())
file.close()
print('Pdf '+pdf_id+' saved')

How to scrape Ajax webpage using python

I am learning Python scraping technique but I am stuck with the problem of scraping an Ajax page like this one.
I want to scrape all the medicines name and details coming in the page. Since I read most of the answer on the stack overflow but I am not getting the right data after scraping. I also tried to scrape using selenium or send a forge post request but it failed.
So please help me on this Ajax scraping topic specially this page because ajax is triggered on selecting an option from dropdown options.
Also please provide me with some resources for ajax page scraping.
//using selenium
from selenium import webdriver
import bs4 as bs
import lxml
import requests
path_to_chrome = '/home/brutal/Desktop/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chrome)
url = 'https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/'
browser.get(url)
browser.find_element_by_xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option[contains(text(), "Ohio")]').click()
new_url = browser.current_url
r = requests.get(new_url)
print(r.content)
ChromeDriver you can download here
normalize-space is used in order to remove trash from web text, such as x0
from time import sleep
from selenium import webdriver
from lxml.html import fromstring
data = {}
driver = webdriver.Chrome('PATH TO YOUR DRIVER/chromedriver') # i.e '/home/superman/www/myproject/chromedriver'
driver.get('https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/')
# Loop states
for i in range(2, 7):
dropdown_state = driver.find_element(by='id', value='ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList')
# open dropdown
dropdown_state.click()
# click state
driver.find_element_by_xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']').click()
# let download the page
sleep(3)
# prepare HTML
page_content = driver.page_source
tree = fromstring(page_content)
state = tree.xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']/text()')[0]
data[state] = []
# Loop products inside the state
for line in tree.xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_gridSearchResults"]/tbody/tr[#style]'):
med_type = line.xpath('normalize-space(.//td[#class="medication-type"])')
generic_name = line.xpath('normalize-space(.//td[#class="generic-name"])')
brand_name = line.xpath('normalize-space(.//td[#class="brand-name hidden-xs"])')
strength = line.xpath('normalize-space(.//td[#class="strength"])')
form = line.xpath('normalize-space(.//td[#class="form"])')
qty_30_day = line.xpath('normalize-space(.//td[#class="30-qty"])')
price_30_day = line.xpath('normalize-space(.//td[#class="30-price"])')
qty_90_day = line.xpath('normalize-space(.//td[#class="90-qty hidden-xs"])')
price_90_day = line.xpath('normalize-space(.//td[#class="90-price hidden-xs"])')
data[state].append(dict(med_type=med_type,
generic_name=generic_name,
brand_name=brand_name,
strength=strength,
form=form,
qty_30_day=qty_30_day,
price_30_day=price_30_day,
qty_90_day=qty_90_day,
price_90_day=price_90_day))
print('data:', data)
driver.quit()

Performing a get request in Python

Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page
Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.

Categories