How to use Python to reach an external web page - python

I have external web page like www.3thpart.com/folder/test.jsp?page=IND
This page uses "TOKEN" to authenticate there is no login(authentication) but for each different request web site creates a token.
I use this page in my company by some users lots of time. To get easier and save time What I need to do is to collect result in backand then display the result in a HTML page in from of the users.
I tried python with Visual studio. I am comfused. is it possible to give me advise that the way i have to follow. I mean I am trying to understand the logic of this situation.
Thanks in advance
C.A.
UPDATE:
import requests
url = 'https://subdomain.domain.com.tr/folder/main.jsp?page=ID'
payload = {
'Requested': 'fieldValue',
'Cmdd ': 'fieldValue',
'TOKEN': '',
'LANG': 'tr',
}
s = requests.Session()
response = s.post(url, data=payload)
outfile =open( "/Projects/Gib/test.txt","w")
response.encoding = 'UTF-8'
outfile.write(response.text)

Related

How to access a website with a non-inspectable login pop-up

I am trying to write a program to scrape data from this website: https://www.dmr.nd.gov/oilgas/feeservices/getscoutticket.asp
The issue is when you go to that link, a login pop-up appears that is non-inspectable. There is only a blank html page behind it, and I cant figure out how to get the username and password(which I have) into the login box. I have tried using various post requests, but as the pop up doesn't have a separate URL and doesn't seem to be accessible from my current code, none of them are working
from bs4 import BeautifulSoup
import requests
url = 'https://www.dmr.nd.gov/oilgas/feeservices/getscoutticket.asp'
s = requests.Session()
payload = {
'dbconnect': 'y',
'entryPoint': 1001,
'nomblogon': 0,
'Username': 'username',
'Password': 'password',
}
s.post(url, data = payload)
site = s.get(url)
soup = BeautifulSoup(site.content, "html.parser")
Has anyone dealt with this type of login pop-up before? Is there any way to navigate past this using code?
I am currently writing in python, but java answers would also be appreciated.
That non-inspectable pop-up is Basic HTTP Authentication, and you handle it with something like this, in requests:
from requests.auth import HTTPBasicAuth
basic = HTTPBasicAuth('user', 'pass')
requests.get('https://www.dmr.nd.gov/oilgas/feeservices/getscoutticket.asp', auth=basic)
I suggest reading the requests documentation, which can be found at
https://requests.readthedocs.io/en/latest/user/quickstart/

login website(kicktipp.de) with python request library

I am trying to use python to login to a website (kicktipp.de). This is the following code I came up with by studying other stackoverflow question. Unfortunantelly for some reasons that I don't understand yet, it is not working. Can you help me? Thanks in advance!
import requests
payload = {
'kennung': 'name',
'passwort': 'pw'
}
with requests.Session() as s:
p = s.post('https://www.kicktipp.de/alternativlos/profil/login',
data=payload)
#print(p.text)
r = s.get('https://www.kicktipp.de/alternativlos/tippuebersicht')
print(r.text)
Before posting, you should get the page once, so that you get the session cookies, and re-transmit them.
Also you post to login, but you should post to loginaction
https://www.kicktipp.de/alternativlos/profil/loginaction
And finally, you are missing the charset post parameter.
As a word of advice, when you try to do that, open the console in "network" tab.
Check "preserve log" (on chrome), and then do log in as you would normally.
In the console, you'll see every request being made.
The first one is the POST that you are trying to do, copy it as much as possible.
See :
import requests
payload = {
'kennung': 'name',
'passwort': 'pw',
'_charset_' : 'UTF-8'
}
with requests.Session() as s:
s.get('https://www.kicktipp.de/alternativlos/profil/login') # get session cookie
p = s.post('https://www.kicktipp.de/alternativlos/profil/loginaction',
data=payload) # login
#print(p.text)
r = s.get('https://www.kicktipp.de/alternativlos/tippuebersicht')
print(r.text)
The url to post to when you check the form action in the browser inspector is /alternativlos/profil/loginaction.
...
p = s.post('https://www.kicktipp.de/alternativlos/profil/loginaction',
...

Trigger data response from .aspx page

from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
I'm trying to scrape data from this .aspx webpage. The page loads the data dynamically via javascript so scraping directly with requests/BeautifulSoup won't work.
By looking at the network traffic I can see that when you click the export (Exportar a) button for an element, select a type of export (excel, csv) then confirm a POST request is made to the page. It returns a base64 encoded string of the data I need. As far as I can tell there is no way to make a GET request for the file directly as it is only generated when requested.
What I'm trying to do is is copy the POST request which triggers the csv response. So first I scrape for __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION. __EVENTTARGET, DXCSS and DXScript look to be fixed. __EVENTARGUMENT is copied directly from the POST request.
My code returns a server application error. I'm thinking the problem is either a) wrong __EVENTARGUMENT (maybe part dynamic rather than fixed?), b) not really understanding how .aspx pages work or c) what I'm trying to do isn't possible with these tools.
I did look at using selenium to trigger the data export but I couldn't see a way to capture the server response.
I was able to get help from someone who knows more about aspx pages than me.
Link to the Github gist that provides the solution.
https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba

Enabling cookies in python HTTP POST request

So I am trying to write a script that that submits a form that contains two fields for a username and password in a POST request, but the site responds with:
"This system requires the use of HTTP cookies to verify authorization information. Our system has detected that your browser has disabled HTTP cookies, or does not support them."
*EDIT: So I believe with the new modified code below that I can successfully login to the page. The only thing is that when I print out the page's html text to the terminal it only displays an html element and a head element that contains the url of the page; however, ive inspected the actual html of page when i log in and there is a lot missing, anyone know why this might be?
import requests
url = "https://someurl"
payload = {
'username': 'myname',
'password': '1234'
}
headers = {
'User-Agent': 'Mozilla/5.0'
}
session = requests.Session()
page = session.post(url, data=payload)
Without the precise URL it is very hard to give you an answer.
Many Web pages are dynamically built through JavaScript calls. The execution of the JavaScript will create a DOM that is rendered. If it's the case for the site you are looking at, you will get only the raw HTML response with Python but not the rendered DOM. You need something which actually executes the JS to get the final DOM. For example, SlimerJS

Relogin to Scraped Website on Resuming a Scrapy Job

Is there a way to have a Scrapy spider log in to a website on resuming a previously paused scraping job?
EDIT: To clarify, my question is really about Scrapy spiders rather than cookies in general. Perhaps a better question is whether there's any method which is called when a Scrapy spider is revived after being frozen in a job directory.
Yes, you can.
You should be more clearer about the exact workflow of your scraper.
Anyways, I assume you are going to login when you are scraping for the first time, and want to use the same cookie while you resume the scraping.
You could use the httplib2 library to do something like this. Here is a code sample from their examples page, I have added comments for more clarity.
import urllib
import httplib2
http = httplib2.Http()
url = 'http://www.example.com/login'
body = {'USERNAME': 'foo', 'PASSWORD': 'bar'}
headers = {'Content-type': 'application/x-www-form-urlencoded'}
//submitting form data for logging into the website
response, content = http.request(url, 'POST', headers=headers, body=urllib.urlencode(body))
//Now the 'response' object contains the cookie the website sends
//which can be used for visiting the website again
//setting the cookie for the new 'headers'
headers_2 = {'Cookie': response['set-cookie']}
url = 'http://www.example.com/home'
// using the 'headers_2' object to visit the website,
response, content = http.request(url, 'GET', headers=headers_2)
In case you are not clear how cookies work, do a search. Shortly put, 'Cookies' is a client-side technology which helps servers maintain sessions.

Categories