This might be a little too much direct question. New to Python
I am trying to Parse/Scrape video link from a video website(Putlocker).
Ie http://www.putlocker.com/file/A189D40E3E612C50.
The page comes up initially with this code below or similar
<form method="post">
<input type="hidden" value="3d0865fbb040e670" name="hash">
<input name="confirm" type="submit" value="Continue as Free User"
disabled="disabled"
id="submitButton" class="confirm_button" style="width:190px;">
</form>
value="3d0865fbb040e670" Changes everytime...
Import urllib
import urllib2
url = 'http://www.putlocker.com/file/A189D40E3E612C50.'
response = urllib2.urlopen(url)
page = response.read()
from here i find the Value="?" of Hash
then
url = 'http://www.putlocker.com/file/A189D40E3E612C50.'
values = {'hash' : 3d0865fbb040e670}
data = urllib.urlencode(values)
response = urllib2.urlopen(url)
page = response.read()
But I end up on same page again. Do I post value="Continue as Free User" as well?
How do I go ahead with posting both data.
A working code would be helpful.
I am trying hard but no avail yet.
Ok..after the suggestion made by few programmers
I tried with codes like below
url = 'http://www.putlocker.com/file/A189D40E3E612C50'
response = urllib2.urlopen(url)
html = response.read()
r = re.search('value="([0-9a-f]+?)" name="hash"', html)
session_hash = r.group(1)
print session_hash
form_values = {}
form_values['hash'] = session_hash
form_values['confirm'] = 'Continue as Free User'
data = urllib.urlencode(form_values)
response = urllib2.urlopen(url, data=data)
html = response.read()
print html
So I am returned with same page again again..What am I doing wrong here!! I have seen something called pycurl..but I want use something simpler..Any clue??
urllib2.urlopen(url,data=data)
You do need to give your encoded values parameter to the urlopen command:
response = urllib2.urlopen(url, data)
otherwise you will create another GET request instead of POSTing.
Related
I've looked at quite a few suggestions for clicking on buttons on web pages using python, but don't fully understand what the examples are doing and can't get them to work (particularly the ordering and combination of values).
I'm trying to download a PDF from a web site.
The first time you click on the PDF to download it, it takes you to a page where you have to click on "Agree and Proceed". Once you do that the browser stores the cookie (so you never need to agree again) and then opens the PDF in the browser (which is what I want to download).
Here is the link to the accept page - https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753"
I've used Chrome Developer to get this:-
<form name="showAnnouncementPDFForm" method="post" action="announcementTerms.do">
<input value="Decline" onclick="window.close();return false;" type="submit">
<input value="Agree and proceed" type="submit">
<input name="pdfURL" value="/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf" type="hidden">
</form>
and this is the final page you get to:- "https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf"
then tried to use it like this:-
import requests
values = {}
values['showAnnouncementRDFForm'] = 'announcementTerms.do'
values['pdfURL'] = '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'
req = requests.post('https://asx.com.au/', data=values)
print(req.text)
I tried a variety of URL's and changed what values I provide, but I don't think it's working correctly.
The print at the end provides me with what looks like the HTML form a web page. I'm not sure exactly what it is as I'm doing this from the command line of a server I'm ssh'd into (Pi). But I'm confident it's not the PDF I'm after.
As a final solution I'd like the python code to do is take the PDF link, automatically Agree and Proceed, store the cookie for next to to avoid future approvals, then download the PDF.
Hope that made sense and thanks for taking the time to read my question.
Markus
If you want to download the file directly and you know the URL you can access it without using a cookie:
import requests
response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf")
with open('./test1.pdf', 'wb') as f:
f.write(response.content)
If you don't know the URL you can read it from the form then access it directly without a cookie:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.asx.com.au"
response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
soup = BeautifulSoup(response.text, 'html.parser')
pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
response = requests.get(f'{base_url}{pdf_url}')
with open('./test2.pdf', 'wb') as f:
f.write(response.content)
if you want to set the cookie:
import requests
cookies = {'companntc': 'tc'}
response = requests.get("https://www.asx.com.au/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf", cookies=cookies)
with open('./test3.pdf', 'wb') as f:
f.write(response.content)
If you really want to use POST:
import requests
payload = {'pdfURL': '/asxpdf/20200506/pdf/44hlvnb8k3n3f8.pdf'}
response = requests.post('https://www.asx.com.au/asx/statistics/announcementTerms.do', params=payload)
with open('./test4.pdf', 'wb') as f:
f.write(response.content)
Or read the pdfURL from the form and do a POST:
import requests
from bs4 import BeautifulSoup
base_url = "https://www.asx.com.au"
response = requests.get(f"{base_url}/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02232753")
soup = BeautifulSoup(response.text, 'html.parser')
pdf_url = soup.find('input', {'name': 'pdfURL'}).get('value')
payload = {'pdfURL': pdf_url}
response = requests.post(f"{base_url}/asx/statistics/announcementTerms.do", params=payload)
with open('./test5.pdf', 'wb') as f:
f.write(response.content)
I have been reading a lot on how to submit a form with python and then read and scrap the obtained page. However I do not manage to do it in the specific form I am filling. My code returns the html of the form page. Here is my code :
import requests
values = {}
values['archive'] = "1"
values['descripteur[]'] = ["mc82", "mc84"]
values['typeavis[]'] = ["10","6","7","8","9"]
values['dateparutionmin'] = "01/01/2015"
values['dateparutionmax'] = "31/12/2015"
req = requests.post('https://www.boamp.fr/avis/archives', data=values)
print req.text
Any suggestion appreciated.
req.text looks like :
You may post data to a wrong page. I access the url and post one, then i found the post data is sent to https://www.boamp.fr/avis/liste. (sometime fiddler is useful to figure out the process)
So your code should be this
req = requests.post('https://www.boamp.fr/avis/liste', data=values)
I'm trying to bypass this page http://casesearch.courts.state.md.us/casesearch/ by mimicking the POST data when you click "I Agree". I've been able to do it with PHP but for some reason I can't recreate the request in Python. The response is always the page with the disclaimer despite sending the correct POST data. The title of the page I want to land on is "Maryland Judiciary Case Search Criteria". I'm using Python 3.
url = "http://casesearch.courts.state.md.us/casesearch/"
postdata = parse.urlencode({'disclaimer':'Y','action':'Continue'})
postdata = postdata.encode("utf-8")
header = {"Content-Type":"application/x-www-form-urlencoded"}
req = request.Request(url,data=postdata,headers=header)
res = request.urlopen(req)
print(res.read())
It looks like the URL you want to post to is actually http://casesearch.courts.state.md.us/casesearch/processDisclaimer.jis. Try this:
url = "http://casesearch.courts.state.md.us/casesearch/processDisclaimer.jis"
postdata = parse.urlencode({'disclaimer':'Y','action':'Continue'})
postdata = postdata.encode("utf-8")
header = {"Content-Type":"application/x-www-form-urlencoded"}
req = request.Request(url,data=postdata,headers=header)
res = request.urlopen(req)
print(res.read())
I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)
I'm working on using the python requests module to login to a webpage. I'm getting the csrf_token by doing a GET request on the url, parsing it with BeautifulSoup to get the csrf_token which is fine and working great, but when I do a POST and use the csrf_token from the GET, the csrf_token has changed and I can't login with an invalid csrf_token.
I understand csrf changes from GET to GET but shouldn't change from GET to POST.
How do I get the csrf_token to not change? I do have access to the source too, but I didn't write it.
When I step into the code with pdb.set_trace() I can change the csrf_token to what I got from the GET and continue then everything works.
here is the requests code I have:
import sys
import requests
from bs4 import BeautifulSoup
#~ URL = 'https://portal.bitcasa.com/login'
URL = 'http://0.0.0.0:12080/login'
EMAIL = 'foo#foo.com'
PASSWORD = 'abc123'
CLIENT_ID = 12345
client = requests.session(config={'verbose': sys.stderr})
# Retrieve the CSRF token first
soup = BeautifulSoup(client.get(URL).content)
csrftoken = soup.find('input', dict(name='csrf_token'))['value']
print csrftoken
# Parameters to pass
data = dict(email=EMAIL, password=PASSWORD, csrf_token=csrftoken)
headers = dict(referer=URL)
params = dict(client_id=CLIENT_ID)
r = client.post(URL, data=data, headers=headers, params=params)
print r
print r.text
I can login to other web pages with this method.
What other information can I provide to help you help me?
Thanks