Google scholar scraping with ip-rotator through AWS ApiGateway

Google scholar scraping with ip-rotator through AWS ApiGateway - python

I am receiving the error below. The code (George method, https://stackoverflow.com/users/7173479/george) worked in the beginning a couple of times and a bit later it crashed. It should be something with the configuration of HTTP but I am lost in the AWS documentation. I am working on the jupyter notebook. Anybody could help?
Create gateway object and initialise in AWS
engine = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q={}&btnG='
gateway = ApiGateway(engine,\
access_key_id="KEY", access_key_secret="SECRET_KEY")
gateway.start()
Assign gateway to session
session = requests.Session()
session.mount(engine, gateway)
Send request (IP will be randomised)
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
search_string = '{}+and+{}+and+{}+and+{}'.format('term1','term2','term3','term4')
url = engine.format(search_string)
print(url)
response = session.get(url,headers=header)
tree = BeautifulSoup(response.content,'lxml')
result = tree.find('div',id='gs_ab_md')
print(response.status_code)
print(result.text)
print(len(result.text))
number=[int(s.replace('.','').replace(',','')) for s in result.text.split() \
if s.replace('.','').replace(',','').isdigit()]
Delete gateways
gateway.shutdown()
=====================================
BadRequestException: An error occurred (BadRequestException) when calling the PutIntegration operation: Invalid HTTP endpoint specified for URI

The site parameter for the ApiGateway constructor in the requests-ip-rotator package expects to be just the site. It can't have any part of the URI other than the protocol, domain name or IP address, and port.
If you change your constructor to something like this:
gateway = ApiGateway("https://scholar.google.com")
gateway.start()
It will construct the gateway endpoint correctly.

Related

Blob URLs with python requests

I'm trying to figure out why the request i send gives me this error :
requests.exceptions.InvalidSchema: No connection adapters were found for 'blob:https://192.168.56.108/7020557a-95f0-4560-a3d4-94e23bc3db4a'
In another thread, i read that it was due to https missing. But in my url i still do have it. Here is the code i wrote to send the request :
url_image = 'blob:https://192.168.56.108/7020557a-95f0-4560-a3d4-94e23bc3db4a'
headers = {'Origin': 'https://192.168.56.108',
'Referer':'',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.27'}
response = s.get(url_image, stream=True, verify=False)
print(response.url)
I also read in another thread that blob url where generated by the browser once the page is loaded. So i thought a doing a GET request to the page where i would usually download first then sending the POST request but it doesn't work still. I thought it could be for the fact that the blob url was not the one associated to the page i loaded (a new one would have been generated).
For a bit more context, i load a page on which there is a graphic that i can download. To check what happens, i use the network console. What happens is that each time i click and download that graphic. A GET request is made with a blob URL that changes each time i download.
So my question is more how to get the correct url with python requests and why would i get the first error when sending the request to the blob url ?

Python: How to get URL when running a HTTP proxy server?

I am creating an HTTP Proxy Server that is able to retrieve the URL of the website requested by a user. I am only allowed to use a single file for my HTTP Proxy Server (I can't have multiple files).
I am able within a infinite running while loop to detect a connection and the address and receive a message from the client:
while True:
conn, addr = created_socket.accept()
data_received = conn.recv(1024)
print(data_received)
When I run my server on a specified port and type the [IP Address]:[Port Number] into Chrome, I get the following result after printing data_received:
b'GET /www.google.com HTTP/1.1\r\nHost: 192.168.1.2:5050\r\nConnection: keep-alive\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n
Is there a systematic way in which I can retrieve the URL (in this case, www.google.com)? Right now, I am coding in a constant buffer size for conn.recv (1024). However, I was wondering if there was first a way to first retrieve the message size of the client, store it in a variable, and then pass that variable to the buffer size parameter for recv?

Web Scraping TooManyRedirects: Exceeded 30 redirects. requests_ip_rotator

import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
if __name__ == "__main__":
# Create gateway object and initialise in AWS
gateway = ApiGateway("https://spare.avspart.com", regions=EXTRA_REGIONS, access_key_id = 'my key', access_key_secret = 'my secret key')
gateway.start(force=True)
# Execute from random IP
session = requests.Session()
# session.max_redirects = 100
session.mount("https://spare.avspart.com", gateway)
# setting User-Agent header
session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
response = session.get("https://spare.avspart.com/catalog/case/64848/4534337/677993/")
print(response.status_code)
# Delete gateways
gateway.shutdown()
I am trying to scrape this page"https://spare.avspart.com/catalog/case/64848/4534337/677993/" using requests-ip-rotator because I was blocked using requests.get() but when I try and access it I get TooManyRedirects: Exceeded 30 redirects. error.
I have read through most of the posts on this problem and tried various things such as changing the session.max_redirects and trying different types of headers, and also reaching out to the library creator. This answer Accepted answer for the same issue seems to solve the problem, but when I try and implement this in my code the issue persists.
It would be great if anyone has any recommendations for other things I can try.

How to implement ajax request using Python Request

I'm trying to log into a website using Python request. Unfortunately, it is always showing this error when printing its content.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
For reference my code
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
May I know what is missing and how can I get HTML code of welcome page from this
UPDATE
I'm trying to get make an API call after login authentication. However, I'm not able to succeed in login authentication. Hence I am not able to get the response of API Call. As per my thought it due to multi-factor authentication it is getting failed. I need to know how can I implement this?
For eg: www.abc.com is the URL of the website. The login is done through JS form submission Hence URL is specified in the ajax part. On the success of that, there is another third authentication party(okta) which will also verify the credentials and finally reach the home page. then I need to call the real API for my task.
But it is not working.
import requests
import sys
class Login:
def sendRequestWithAuthentication(self,loginDetails,requestDetails):
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
action_url=loginDetails['action_url'] if 'action_url' in loginDetails.keys() else None
pay_load=loginDetails['payload'] if 'payload' in loginDetails.keys() else None
session_requests = requests.session()
if action_url and pay_load:
act_resp=session_requests.post(action_url, data=pay_load, headers=user_agent,verify=False,files=[ ])
print(act_resp)
auth_cookies=act_resp.cookies.get_dict()
url,method,request_payload = requestDetails['url'],requestDetails['method'],requestDetails['payload']
querystring=requestDetails['querystring']
response=session_requests.get(url,headers=user_agent,cookies=auth_cookies,data=request_payload,params=querystring)
print(response)
return response.json()
In the above action URL is the API given in the ajax part & in the second request, the URL is the API address for that GET.
In short, may I know how can implement multifactor authentication in python request
My Doubt
Do we need the cookies from the login form page to include in the login request
How to implement multifactor authentication in python request(Here we don't need any pin or something it is done through RSA.)Is there any need of a certificate for login as it now raising unable to validate the SSL certificate
Give a dummy example api that is implement such kind of scenario

No, you make it complex.This code worked:
import requests
login_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php"
welcome_url = "https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php"
payload = 'user_email=test#phpzag.com&password=test&login_button='
login_headers = {
'x-requested-with': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded', # its urlencoded instead of form-data
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
}
s = requests.Session()
login = s.post(login_url, headers=login_headers, data=payload) # post requests
welcome = s.get(welcome_url, headers=login_headers)
print(welcome.text)
Result:
.....Hello, <br><br>Welcome to the members page.<br><br>

TL;DR
Change the part of your code that says data=payload to json=payload, and it should work.
Direct answer to your question
How [does one] implement [an] AJAX request using Python Requests?
You cannot do that. An AJAX request is specifically referring to a Javascript-based HTTP request. To quote from W3 school's AJAX introduction page, "AJAX = Asynchronous JavaScript And XML".
Indirect answer to your question
What I believe you're asking is how to perform auth/login HTTP requests using the popular python package, requests. The short answer— unfortunately, and like most things—is that it depends. Various auth pages handle the auth requests differently, and so you might have to do different things in order to authenticate against the specific web service.
Based on your code
I'm going to make some assumptions that the login page is probably looking for a POST request with the authentication details (e.g. credentials) in the form of a JSON object based on your code, and based on the response back from the server being a 406 error meaning that you're sending data with an accept header that doesn't align with how the server wants to respond.
When using requests, using the data parameter to the request function will send the data "raw"; that is, it'll send it in the native data format it is (like in cases of binary data), or it'll translate it to standard HTML form data if that format doesn't work (e.g. key1=value1&key2=value2&key3=value3, this form has the MIME type of application/x-www-form-urlencoded and is what requests will send when data has not been specified with an accept header). I'm going to make an educated guess based on the fact that you put your credentials into a dictionary that the login form is expecting a POST request with a JSON-formatted body (most modern web apps do this), and you were under the impression that setting the data parameter to requests will make this into a JSON object. This is a common gotcha/misconception with requests that has bitten me before. What you want is instead to pass the data using the json parameter.
Your code:
from requests import Session
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
s = requests.Session()
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
t=s.post(LOGIN_URL, data=payload, headers=user_agent)
r=s.get('https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',headers=user_agent,cookies=t.cookies.get_dict())
print(r.content)
Fixed (and cleaned up) code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Test script to login to php web app.
"""
import requests
INDEX_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/index.php'
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {
'user_email': 'test#phpzag.com',
'password':'test'
}
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
session = requests.Session()
auth_response = session.post(
url=LOGIN_URL,
json=payload, # <--- THIS IS THE IMPORTANT BIT. Note: data param changed to json param
headers=user_agent
)
response = session.get(
'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php',
headers=headers,
cookies=auth_response.cookies.get_dict() # TODO: not sure this is necessary, since you're using the session object to initiate the request, so that should maintain the cookies/session data throughout the session...
)
print(response.content)
Check out this section of the requests documentation on POST requests, if you scroll down a bit from there you'll see the docs talk about the github API which expects JSON and how to handle that.
Auth can be tricky overall. Sometimes things will want "basic auth", which requests will expect you to pass as a tuple to the auth parameter, sometimes they'll want a bearer token / OAUTH thing which can get headache-inducing-ly complicated/annoying.
Hope this helps!

You are missing the User agent that the server (apache?) requires
Try this:
import requests
from requests import Session
URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/welcome.php'
LOGIN_URL = 'https://phpzag.com/demo/ajax_login_script_with_php_jquery/login.php' # Or whatever the login request url is
payload = {'user_email': 'test#phpzag.com','password':'test'}
user_agent = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
s = requests.Session()
x=s.get(URL, headers=user_agent)
x=s.post(LOGIN_URL, data=payload, headers=user_agent)
print(x.content)
print(x.status_code)

Take a look at Requests: Basic Authentication
import requests
requests.post(URL, auth=('user', 'pass'))
# If there are some cookies you need to send
cookies = dict(cookies_are='working')
requests.post(URL, auth=('user', 'pass'), cookies=cookies)

Python, sending a request with headers and cookies

Sending a python request to a server that requires authentication to download a file. I am trying to send both a cookie and a header with the request. What is the right format to send this request?
From chrome developer, I can see the request header as:
The python request:
Session = requests.Session()
cookies = browser.get_cookies()
response = Session.get(url)
tt = "ASPSESSIONIDSGSRSATR"
cookie = {tt:Session.cookies.get_dict().get(tt,""),
cookies[2].get("name",""):cookies[2].get("value",""),
cookies[0].get("name",""):cookies[0].get("value","")}
header = {"Host":"ecsxxxxxxxxxxxxxxxxxxxxx",
"Connection": "keep-alive",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36",
"Referer":"https://ecsxxxxxxxxxxxxxxxxxxxxx",
"Accept-Encoding":"gzip, deflate, sdch",
"Accept-Language":"en-US,en;q=0.8,fr;q=0.6"}
response = Session.get(url, cookies = cookie, headers = header)

Comments: ... how does that accept the username/pword?
Using the Python Kerberos Module
Kerberos Basics
When setting up Kerberos authentication on a server, there are two basic modes of operation.
The simplest from a client implementation point of view just uses Basic Auth to pass a username and password to the server, which then checks them with the Kerberos realm.
Comments: it's a kerberos server
There is a Optional
requests Kerberos/GSSAPI authentication library
This library adds optional Kerberos/GSSAPI authentication support and supports mutual authentication.
Basic GET usage:
import requests
from requests_kerberos import HTTPKerberosAuth
r = requests.get("http://example.org", auth=HTTPKerberosAuth())
SO: session auth in python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.