I am trying to monitor a page for any updates. However, I need to keep the same session and cookies so I can't just send a whole new request.
How can I check for updates in the HTML within my current request? The page won't just be updated, I will be redirected but the URL remains the same.
Here is my current code:
import requests
url = 'xxx'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
response = requests.get(url, headers=headers, allow_redirects=True, config={'keep_alive': True})
def get_status():
html = response.text # this should be the current HTML, not the HTML when I made the initial request
if x in html:
status = "exists"
else:
status = "null"
return status
print(get_status())
EDIT: I will be using a while loop to run this function every 5 seconds to check if the status is = "exists".
EDIT2: I tried to implement it via requests_html but I am not getting as many cookies as I should be:
import requests_html
from requests_html import HTMLSession
session = HTMLSession()
session.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'})
r = session.get('x')
r.html.render(reload=False)
print(r.cookies.get_dict())
However, I need to keep the same session and cookies so I can't just send a whole new request.
What you want to do here is open a session using
s = requests.Session()
response = s.get("http://www.google.com")
This will make sure to persist cookies and certain other things across requests. Navigate to the documentation of Sessions for further details.
As you simply want to check whether the returned html is the exact same as in the previous request, simply save the first response.text outside your function and check whether your new response.text equals to the one saved earlier.
If the website displays any content dynamically, this will of course not do the trick but if you can check for a specific element in the DOM and compare it to the object from the previous request, this will work just fine.
Related
I find How would I log into Instagram using BeautifulSoup4 and Requests, and how would I determine it on my own? this
but code
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': 'login',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:your_password',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
r = s.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken": csrf
})
print(r.status_code)
gives me error with csrftoken
line 21, in <module>
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
IndexError: list index out of range
and other posts on Stack Overflow don't work for me
I dont want use Selenium
TL;DR
Add a user-agent to your get request header on line 20:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Long answer
If we look at the error message you posted, we can start to dissect what's gone wrong. Line 21 is attempting to find a csrf_token attribute on the instagram login page.
Diagnostics
We can see from the error message that the list index is out of range, which in this case means that the list returned by re.findall (docs) is empty. This means that either
Your regex is wrong
The html returned by your get request (docs) r = s.get(link) on line 20 doesn't contain a csrf_token attribute
The attribute doesn't exist in the source html
If we visit the page and look at its html source, we can see that a csrf_token attribute is indeed present on line 261:
<script type="text/javascript">window._sharedData = {"config":{"csrf_token":"TOKEN HERE","viewer":null,"viewerId":null}}</script>
Note, I have excluded the rest on the code for brevity.
Now that we know it's present on the page, we can write the scraped html that you're receiving via your get request to a local file and inspect it:
r = s.get(link)
with open("csrf.html", "w") as f:
f.write(html)
If you open that file and do a Ctrl+f for csrf_token, it's not present. This likely means that Instagram detected that you're accessing the page via a scraper and returned a modified version of the page.
The fix
In order to fix this, you need to add a user-agent to your request header which essentially 'tricks' the page into thinking you're accessing it via a browser, This can be done by by changing:
r = s.get(link)
to something like this:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Note, this is a random user agent from here.
Notes
I appreciate that you don't want to use selenium for your task, but you might find that the more dynamic interactions you want to do, the harder it is to achieve it with static scraping libraries like the requests module. Here are some good resources for learning selenium in python:
Selenium docs
Python Selenium Tutorial #1 - Web Scraping, Bots & Testing
I'm trying to make a web scraper with python, I made it with selenium but it is really slow.Then i saw that i could speed up the project because of a button that make a post request.
import requests
from bs4 import BeautifulSoup
url = "http://vidtome.host/tnoz00am9j8p"
myobj = {
'op': 'download1',
'code':'tnoz00am9j8p',
'hash': 'the hash',
'imhuman': 'Proceed to video'
}
x = requests.post(url, data = myobj)
print(x.text)
That's the code and it works but only for the first time.
When I started it the first time it doesn't show any error and it printed me out the page with the right changes, but when i started it later it gave me no error but it printed me out the page with no changes like it doesn't do anything.
How can it be possible?
Requests are faster, but you cannot extract dynamically rendered content. However this is probably not the issue.
Problem is that you do not have access to the website.
If it is a basic human checking system, you could try to add user agent to your request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.68',
}
r = requests.get(url, headers=headers)
If this will not work, I would recommend looking into the data that you are passing. Maybe it is validating through it and it contains expired values or something.
I'm trying to log into a website using Python request. Unfortunately, it is always showing this error when printing its content.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
For reference my code
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
May I know what is missing and how can I get HTML code of welcome page from this
UPDATE
I'm trying to get make an API call after login authentication. However, I'm not able to succeed in login authentication. Hence I am not able to get the response of API Call. As per my thought it due to multi-factor authentication it is getting failed. I need to know how can I implement this?
For eg: www.abc.com is the URL of the website. The login is done through JS form submission Hence URL is specified in the ajax part. On the success of that, there is another third authentication party(okta) which will also verify the credentials and finally reach the home page. then I need to call the real API for my task.
But it is not working.
import requests
import sys
class Login:
def sendRequestWithAuthentication(self,loginDetails,requestDetails):
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
action_url=loginDetails['action_url'] if 'action_url' in loginDetails.keys() else None
pay_load=loginDetails['payload'] if 'payload' in loginDetails.keys() else None
session_requests = requests.session()
if action_url and pay_load:
act_resp=session_requests.post(action_url, data=pay_load, headers=user_agent,verify=False,files=[ ])
print(act_resp)
auth_cookies=act_resp.cookies.get_dict()
url,method,request_payload = requestDetails['url'],requestDetails['method'],requestDetails['payload']
querystring=requestDetails['querystring']
response=session_requests.get(url,headers=user_agent,cookies=auth_cookies,data=request_payload,params=querystring)
print(response)
return response.json()
In the above action URL is the API given in the ajax part & in the second request, the URL is the API address for that GET.
In short, may I know how can implement multifactor authentication in python request
My Doubt
Do we need the cookies from the login form page to include in the login request
How to implement multifactor authentication in python request(Here we don't need any pin or something it is done through RSA.)Is there any need of a certificate for login as it now raising unable to validate the SSL certificate
Give a dummy example api that is implement such kind of scenario
No, you make it complex.This code worked:
import requests
login_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php"
welcome_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php"
payload = 'user_email=test#phpzag.com&password=test&login_button='
login_headers = {
'x-requested-with': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded', # its urlencoded instead of form-data
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
}
s = requests.Session()
login = s.post(login_url, headers=login_headers, data=payload) # post requests
welcome = s.get(welcome_url, headers=login_headers)
print(welcome.text)
Result:
.....Hello, <br><br>Welcome to the members page.<br><br>
TL;DR
Change the part of your code that says data=payload to json=payload, and it should work.
Direct answer to your question
How [does one] implement [an] AJAX request using Python Requests?
You cannot do that. An AJAX request is specifically referring to a Javascript-based HTTP request. To quote from W3 school's AJAX introduction page, "AJAX = Asynchronous JavaScript And XML".
Indirect answer to your question
What I believe you're asking is how to perform auth/login HTTP requests using the popular python package, requests. The short answer— unfortunately, and like most things—is that it depends. Various auth pages handle the auth requests differently, and so you might have to do different things in order to authenticate against the specific web service.
Based on your code
I'm going to make some assumptions that the login page is probably looking for a POST request with the authentication details (e.g. credentials) in the form of a JSON object based on your code, and based on the response back from the server being a 406 error meaning that you're sending data with an accept header that doesn't align with how the server wants to respond.
When using requests, using the data parameter to the request function will send the data "raw"; that is, it'll send it in the native data format it is (like in cases of binary data), or it'll translate it to standard HTML form data if that format doesn't work (e.g. key1=value1&key2=value2&key3=value3, this form has the MIME type of application/x-www-form-urlencoded and is what requests will send when data has not been specified with an accept header). I'm going to make an educated guess based on the fact that you put your credentials into a dictionary that the login form is expecting a POST request with a JSON-formatted body (most modern web apps do this), and you were under the impression that setting the data parameter to requests will make this into a JSON object. This is a common gotcha/misconception with requests that has bitten me before. What you want is instead to pass the data using the json parameter.
Your code:
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
Fixed (and cleaned up) code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Test script to login to php web app.
"""
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {
'user_email': 'test#phpzag.com',
'password':'test'
}
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
session = requests.Session()
auth_response = session.post(
url=LOGIN_URL,
json=payload, # <--- THIS IS THE IMPORTANT BIT. Note: data param changed to json param
headers=user_agent
)
response = session.get(
'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',
headers=headers,
cookies=auth_response.cookies.get_dict() # TODO: not sure this is necessary, since you're using the session object to initiate the request, so that should maintain the cookies/session data throughout the session...
)
print(response.content)
Check out this section of the requests documentation on POST requests, if you scroll down a bit from there you'll see the docs talk about the github API which expects JSON and how to handle that.
Auth can be tricky overall. Sometimes things will want "basic auth", which requests will expect you to pass as a tuple to the auth parameter, sometimes they'll want a bearer token / OAUTH thing which can get headache-inducing-ly complicated/annoying.
Hope this helps!
You are missing the User agent that the server (apache?) requires
Try this:
import requests
from requests import Session
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
s = requests.Session()
x=s.get(URL, headers=user_agent)
x=s.post(LOGIN_URL, data=payload, headers=user_agent)
print(x.content)
print(x.status_code)
Take a look at Requests: Basic Authentication
import requests
requests.post(URL, auth=('user', 'pass'))
# If there are some cookies you need to send
cookies = dict(cookies_are='working')
requests.post(URL, auth=('user', 'pass'), cookies=cookies)
As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim¤cy=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL
I have taken a look at the login data on the forum i'm trying to login too, but it still leaves me confused as to what information is needed to pass along through POST. Is it the html id fields? name fields? type?
Also, when you successfully log on, it has a "Login Successful!" quick redirection page, then continues to redirect you to index.php. The result i'm getting is that of index.php, but it's not an instance of me logged in. I believe i'm passing the cookies through correctly? I just believe it's a matter of wrong 'data' being passed through at login.
import requests
from bs4 import BeautifulSoup
import json
import string
headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"}
payload = {"action": "login",
"username": "myusername",
"password": "**********"}
with requests.Session() as s:
s0 = s.post("http://minewind.com/forums/ucp.php?mode=login", data=payload, headers=headers)
print (s.cookies)
s1 = s.get("http://minewind.com/forums/index.php", cookies=s0.cookies, headers=headers)
perty = BeautifulSoup(s1.content)
perty.prettify()
for links in perty.find_all('a'):
print (links.get('href'))