Logging into website and scraping data

Logging into website and scraping data - python

The website I am trying to log in to is https://realitysportsonline.com/RSOLanding.aspx. I can't seem to get the login to work since the process is a little different to a typical site that has a login specific page. I haven't got any errors, but the log in action doesn't work, which then causes the main to redirect to the homepage.
import requests
url = "https://realitysportsonline.com/RSOLanding.aspx"
main = "https://realitysportsonline.com/SetLineup_Contracts.aspx?leagueId=3000&viewingTeam=1"
data = {"username": "", "password": "", "vc_btn3 vc_btn3-size-md vc_btn3-shape-rounded vc_btn3-style-3d vc_btn3-color-danger" : "Log In"}
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
'Referer': 'https://realitysportsonline.com/RSOLanding.aspx',
'Host': 'realitysportsonline.com',
'Connection': 'keep-alive',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'}
s = requests.session()
s.get(url)
r = s.post(url, data, headers=header)
page = requests.get(main)

First of all, you create a session and assuming your POST request worked, you then request an authorised page without using your previously created session.
You need to make the request with the s object you created like so:
page = s.get(main)
However, there were also a few issues with your POST request. You were making a request to the home page instead of the /Login route. You were also missing the Content-Type header.
import requests
url = "https://realitysportsonline.com/Services/AccountService.svc/Login"
main = "https://realitysportsonline.com/LeagueSetup.aspx?create=true"
payload = {"username":"","password":""}
headers = {
'Content-Type': "text/json",
'Cache-Control': "no-cache"
}
s = requests.session()
response = s.post(url, json=payload, headers=headers)
page = s.get(main)
PS your main request url redirects to the homepage, even with a valid session (at least for me).

Related

403 error returned from python get requests, but auth works in postman?

I'm trying to return a GET request from an API using HTTPBasicAuth.
I've tested the following in Postman, and received the correct response
URL:"https://someapi.data.io"
username:"username"
password:"password"
And this returns me the data I expect, and all is well.
When I've tried this in python however, I get kicked back a 403 error, alongside a
""error_type":"ACCESS DENIED","message":"Please confirm api-key, api-secret, and permission is correct."
Below is my code:
import requests
from requests.auth import HTTPBasicAuth
URL = 'https://someapi.data.io'
authBasic=HTTPBasicAuth(username='username', password='password')
r = requests.get(URL, auth = authBasic)
print(r)
I honestly can't tell why this isn't working since the same username and password passes in Postman using HTTPBasicAuth

You have not conveyed all the required parameters. And postman is doing this automatically for you.
To be able to use in python requests just specify all the required parameters.
headers = {
'Host': 'sub.example.com',
'User-Agent': 'Chrome v22.2 Linux Ubuntu',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest'
}
url = 'https://sub.example.com'
response = requests.get(url, headers=headers)

It could be due to the fact that the user-agent is not defined
try the following:
import requests
from requests.auth import HTTPBasicAuth
URL = 'https://someapi.data.io'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
authBasic=HTTPBasicAuth(username='username', password='password')
r = requests.get(URL, auth = authBasic, headers=headers)
print(r)

Invalid URL when using Python Requests

I am trying to access the API returning program data at this page when you scroll down and new tiles are displayed on the screen. Looking in Chrome Tools I have found the API being called and put together the following Requests script:
import requests
session = requests.session()
url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node?slug=/entertainment/collections/all-entertainment&represent=(items[take=60](items(items[select_list=iceberg])))'
session.headers = {
'Host': 'https://www.nowtv.com',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.nowtv.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}
scraper = cloudscraper.create_scraper(sess=session)
r = scraper.get(url)
data = r.content
print(data)
session.close()
This is returning the following only:
b'<HTML><HEAD>\n<TITLE>Invalid URL</TITLE>\n</HEAD><BODY>\n<H1>Invalid URL</H1>\nThe requested URL "[no URL]", is invalid.<p>\nReference #9.3c0f0317.1608324989.5902cff\n</BODY></HTML>\n'
I assume the issue is the part at the end of the URL that is in curly brackets. I am not sure however how to handle these in a Requests call. Can anyone provide the correct syntax?
Thanks

The issue is the Host session header value, don't set it.
That should be enough. But I've done some additional things as well:
add the X-* headers:
session.headers.update(**{
'X-SkyOTT-Proposition': 'NOWTV',
'X-SkyOTT-Language': 'en',
'X-SkyOTT-Platform': 'PC',
'X-SkyOTT-Territory': 'GB',
'X-SkyOTT-Device': 'COMPUTER'
})
visit the main page without XHR header set and with a broader Accept header value:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
I've also used params for the GET parameters - you don't have to do it, I think. It's just cleaner:
In [33]: url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node'
In [34]: response = session.get(url, params={
'slug': '/entertainment/collections/all-entertainment',
'represent': '(items[take=60,skip=2340](items(items[select_list=iceberg])))'
}, headers={
'Accept': 'application/json, text/plain, */*',
'X-Requested-With':'XMLHttpRequest'
})
In [35]: response
Out[35]: <Response [200]>
In [36]: response.text
Out[36]: '{"links":{"self":"/adapter-atlas/v3/query/node/e5b0e516-2b84-11e9-b860-83982be1b6a6"},"id":"e5b0e516-2b84-11e9-b860-83982be1b6a6","type":"CATALOGUE/COLLECTION","segmentId":"","segmentName":"default","childTypes":{"next_items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":68},"items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":2376},"curation-config":{"nodeTypes":["CATALOGUE/CURATIONCONFIG"],"count":1}},"attributes":{"childNodeTyp
...

Python: requests post login not succeed

Problem:
I tried to login with request.post for having the extra information in the part "preise", but soup.select("div[id*='preise']") always returns me the result which indicate login is not succeed.
payload = {
"username": "email",
"password": "password",
"rememberMe": True
}
with requests.Session() as s:
p = s.post("https://www.xxx", data=payload)
print(s.cookies)
response = s.get("https://www.xxx/yy")
soup = BeautifulSoup(response.content.decode("utf8"))
print(soup.select("div[id*='preise']"))
The cookieJar looks like below:
<RequestsCookieJar[<Cookie msid=5d7cbf9f-43b2-43dc-9f6a-c766c2f4e805 for www.xxx>]>
I tried:
adding some headers, but I didn't understand what I should add exactly, so it is not working
doubt if the login url is not the real login url, (reference: Why is my login not working with Python Requests?), but I couldn't find any other url possible.
tried to print cookiejar to check if login is succeed, which indeed returns some value. Thus I am really confused why I could not have the tag correctly.
I know that I could retrieve the correct result using selenium, but I still wonder if this issue could be solved with only request.
Tried with solution of #Tobias P. G
Tried with personal cookie

A couple of steps made this possible for me:
Copy headers from the login post on the website
Copy the encrypted password from the post on the website
Both done by using the developer tool and copying the information from the post request found under network. If you are unsure how to do that please ask and I will provide a brief introduction.
import requests
s = requests.Session()
headers = {
'Host': 'www.xxx',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': '*/*',
'Accept-Language': 'da,en;q=0.7,en-US;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded;',
'X-Requested-With': 'XMLHttpRequest',
'Content-Length': '70',
'Origin': 'https://www.xxx',
'Connection': 'keep-alive',
'Referer': 'https://www.xxx/yy',
'Cookie': 'cookie', # replace 'cookie' with those from the manual post request
}
payload = {
"username": "email",
"password": "password", # Copy the encrypted version from the manual post request
"rememberMe": "true"
}
url = 'https://www.xxx/yy'
r = s.post(url, headers=headers, data=payload)
print(r.reason)
p = s.get('https://www.xxx/yy')
print(p.content)

Scrape XML response from ajax request with python

I'm trying to get the data that get loaded into the chart of this page when hitting the max (time range) button. The data are loaded with an ajax request.
I inspected the request and tried to reproduce it with the requests python library but I'm only able to retrieve the 1-year data from this chart.
Here is the code I used:
r = requests.get("https://www.justetf.com/en/etf-profile.html?0-4.0-tabs-panel-chart-dates-ptl_max&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart&_=1576272593482")
r.content
I also tried to use Session:
from requests import Session
session = Session()
session.head('http://justetf.com')
response = session.get(
url='https://www.justetf.com/en/etf-profile.html?0-4.0-tabs-panel-chart-dates-ptl_max&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart&_=1575929227619',
data = {"0-4.0-tabs-panel-chart-dates-ptl_max":"",
"groupField":"none","sortField":"ter",
"sortOrder":"asc","from":"search",
"isin":"IE00B3VWN518",
"tab":"chart",
"_":"1575929227619"
},
headers={
'Host': 'www.justetf.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Wicket-Ajax': 'true',
'Wicket-Ajax-BaseURL': 'en/etf-profile.html?0&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart',
'Wicket-FocusedElementId': 'id28',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.justetf.com/en/etf-profile.html?groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart',
'Cookie': 'locale_=en; _ga=GA1.2.1297456970.1574289342; cookieconsent_status=dismiss; AWSALB=QMWHJxgfcpLXJLqX0i0FgBuLn+mpVHVeLRQ6upH338LdggA4/thXHT2vVWQX7pdBd1r486usZXgpAF8RpDsGJNtf6ei8e5NHTsg0hzVHR9C+Fj89AWuQ7ue+fzV2; JSESSIONID=ABB2A35B91751CA9B2D293F5A04505BE; _gid=GA1.2.1029531470.1575928527; _gat=1',
'TE': 'Trailer'
},
cookies = {"_ga":"GA1.2.1297456970.1574289342","_gid":"GA1.2.1411779365.1574289342","AWSALB":"5v+tPMgooQC0deJBlEGl2wVeUSmwVGJdydie1D6dAZSRAK5eBsmg+DQCdBj8t25YRytC5NIi0TbU3PmDcNMjiyFPTp1xKHgwNjZcDvMRePZjTxthds5DsvelzE2I","JSESSIONID":"310F346AED94D1A345207A3489DCF83D","locale_":"en"}
)
but I get this response
<ajax-response><redirect><![CDATA[/en/etf-profile.html?0&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart]]></redirect></ajax-response>
Why am I not getting a response to the same XML file that I get on my browser when I hit MAX?

Okay below is my solution to obtaining the data you seek:
url = "https://www.justetf.com/en/etf-profile.html"
querystring = {
# Modify this string to get the timeline you want
# Currently it is set to "max" as you can see
"0-1.0-tabs-panel-chart-dates-ptl_max":"",
"groupField":"none",
"sortField":"ter",
"sortOrder":"asc",
"from":"search",
"isin":"IE00B3VWN518",
"tab":"chart",
"_":"1576627890798"}
# Not all of these headers may be necessary
headers = {
'authority': "www.justetf.com",
'accept': "application/xml, text/xml, */*; q=0.01",
'x-requested-with': "XMLHttpRequest",
'wicket-ajax-baseurl': "en/etf-profile.html?0&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart",
'wicket-ajax': "true",
'wicket-focusedelementid': "id27",
'Connection': "keep-alive",
}
session = requests.Session()
# The first request won't return what we want but it sets the cookies
response = session.get( url, params=querystring)
# Cookies have been set now we can make the 2nd request and get the data we want
response = session.get( url, headers=headers, params=querystring)
print(response.text)
As a bonus, I have included a link to a repl.it where I actually parse the data and get each individual data point. You can find this here.
Let me know if that helps!

Using Requests Post to login to this site not working

I know there are tons of threads and videos on how to do this, I've gone through them all and am in need of a little advanced guidance.
I am trying to log into this webpage where I have an account so I can send a request to download a report.
First I send the get request to the login page, then send the post request but when I print(resp.content) I get the code back for the login page. I do get a code[200] but I can't get to the index page. No matter what page I try to get after the post it keeps redirecting me back to the login page
Here are a couple things I'm not sure if I did correctly:
For the header I just put everything that was listed when I inspected the page
Not sure if I need to do something with the cookies?
Below is my code:
import requests
import urllib.parse
url = 'https://myurl.com/login.php'
next_url = 'https://myurl.com/index.php'
username = 'myuser'
password = 'mypw'
headers = {
'Host': 'url.myurl.com',
'Connection': 'keep-alive',
'Content-Length': '127',
'Cache-Control': 'max-age=0',
'Origin': 'https://url.myurl.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://url.myurl.com/login.php?redirect=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Cookie': 'PHPSESSID=3rgtou3h0tpjfts77kuho4nnm3'
}
login_payload = {
'XXX_login_name': username,
'XXX_login_password': password,
}
login_payload = urllib.parse.urlencode(login_payload)
r = requests.Session()
r.get(url, headers = headers)
r.post(url, headers = headers, data = login_payload)
resp = r.get(next_url, headers = headers)
print(resp.content)

You don't need to send separate requests for authorization and file download. You need to send single POST with specifying credentials. Also in most cases you don't need to send headers. In common your code should looks like follow:
from requests.auth import HTTPBasicAuth
url_to_download = "http://some_site/download?id=100500"
response = requests.post(url_to_download, auth=HTTPBasicAuth('your_login', 'your_password'))
with open('C:\\path\\to\\save\\file', 'w') as my_file:
my_file.write(response.content)

There are a few more fields in the form data to post:
import requests
data = {"redirect": "1",
"XXX_login_name": "your_username",
"XXX_login_password": "your_password",
"XXX_actionSUBMITLOGIN": "Login",
"XXX_login_php": "1"}
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"})
r1 = s.get("https://eym.sicomasp.com/login.php")
s.headers["cookie"] = r1.headers["Set-Cookie"]
pst = s.post("https://eym.sicomasp.com/login.php", data=data)
print(pst.history)
You may get redirected to index.php automatically after the post, you can check r1.history and r1.content to see exactly what is happening.

So I figured out what my problem was, just in case anyone in the future has the same issue. I am sure different websites have different requirements but in this case the Cookie: I was sending in the request header was blocking it. What I did was grab my cookie in the headers AFTER I logged in. I updated my headers and then I sent the request. This is what ended up working:
(also the form data needs to be encoded in HTML)
import requests
import urllib.parse
headers = {
'Host' : 'eym.sicomasp.com',
'Content-Length' : '62',
'Origin' : 'https://eym.sicomasp.com',
'Upgrade-Insecure-Requests' : '1',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Referer' : 'https://eym.sicomasp.com/login.php?redirect=1',
'Cookie' : 'PHPSESSID=vdn4er761ash4sb765ud7jakl0; SICOMUSER=31+147234553'
} #Additional cookie information after logging in ^^^^
data = {
'XXX_login_name': 'myuser',
'XXX_login_password': 'mypw',
}
data = urllib.parse.urlencode(data)
with requests.Session() as s:
s.headers.update(headers)
resp = s.post('https://eym.sicomasp.com/index.php', data=data2)
print(resp.content)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Logging into website and scraping data - python

Related

403 error returned from python get requests, but auth works in postman?

Invalid URL when using Python Requests

Python: requests post login not succeed

Scrape XML response from ajax request with python

Using Requests Post to login to this site not working

Categories

Resources