I am scraping search results from google using people_also_ask module. The module itself dont have method to use proxies but I manually added proxies in the module. When I got blocked from google I printed the status and it was printing my ip address was banned from sending requests. The code I added in people_also_ask module to use proxies is
proxies = {
'http' : "http://username:passward#ip:port"
}
response = SESSION.get(URL, params=params, headers=HEADERS, proxies=proxies)
.I know it is an illegal activity but I want to know why it happens for education purpose mainly. I think the code to extract the data is irrelevant so I am adding simple code to send request using people_also_ask module
import people_also_ask as paa
queries = ["how to boil eggs","how to make cake","price of poco f1","price of wooden table","best soap in us","how much tesla worth"]
for query in queries:
questions = paa.get_related_questions(query ,40)
Note: The changes are made in first function named search() of google.py of people_also_people module
Note: I am doing searchs from browser without any problem. why is google allowing me to use google but blocked from using the script
The answer is quite simple. Although it is a proxy service, it doesn't guarantee 100% anonymity. When you send the HTTP GET request via the proxy server, the request sent by your program to the proxy server is:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Now, when the proxy server sends this request to the actual destination, it sends:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Via: 1.1 naxserver (squid/3.1.8)
X-Forwarded-For: 122.126.64.43
Cache-Control: max-age=18000
Connection: keep-alive
As you can see, it throws your IP (in my case, 122.126.64.43) in the HTTP header: X-Forwarded-For and hence the website knows that the request was sent on behalf of 122.126.64.43
Read more about this header at: https://www.rfc-editor.org/rfc/rfc7239
If you want to host your own squid proxy server and want to disable setting X-Forwarded-For header, read: http://www.squid-cache.org/Doc/config/forwarded_for/
I dont get any credit for the answer I copied this answer from the following post I found Python Requests module - proxy not working
Related
Normally, when sending a HTTP request, the actually traffic is like below:
GET /abc?hello HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.7
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
However, I would like to send URLs without the leading slash, for example:
GET abc?hello HTTP/1.1
GET ftp://abc?hello HTTP/1.1
I understand that's not compliant with RFCs, but just need to send such request for testing purpose in Python.
Have checked requests, urllib, urllib2, urllib3, haven't figured out how to do it.
Anyone can help me out?
I'm using python requests to send http requests to www.fredmeyer.com
I can't even get past an initial get request to this domain. doing a simple requests.get results in the connection hanging and never timing out. i've verified i have access to this domain and am able to run the request on my local machine. can anyone replicate
The site seems to have some filtering enabled to prohibit bots or similar. The following HTTP request works currently with the site:
GET / HTTP/1.1
Host: www.fredmeyer.com
Connection: keep-alive
Accept: text/html
Accept-Encoding:
If the Connection header is removed or its value changed to close it will hang. If the (empty) Accept-Encoding header is missing it will also hang. If the Accept line is missing it will return 403 Forbidden.
In order to access this site with requests the following currently works for me:
import requests
headers = { 'Accept':'text/html', 'Accept-Encoding': '', 'User-Agent': None }
resp = requests.get('https://www.fredmeyer.com', headers=headers)
print(resp.text)
Note that the heuristics used by the site to detect bots might change, so this might stop working in the future.
I need to log into a website using python but the login page requires a sessionID cookie in the request header. Using Google developer tools along with a webclient(hurl.it), I was able to determine the required format of the request header that is acceptable by the webserver:
Accept: */*
Accept-Encoding: gzip, deflate
Content-Length: 85
Content-Type: application/x-www-form-urlencoded
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
User-Agent: runscope/0.1
email=******%40beatswitch.com&login_token=545a46230b291&password=*****&submission=
I have produced the following request using Python requests module:
POST /my-ade/login/ HTTP/1.1
Host: www.amsterdam-dance-event.nl
Content-Length: 85
Accept-Encoding: gzip,deflate
Accept: */*
User-Agent: runscope/0.1
Connection: keep-alive
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
Content-Type: application/x-www-form-urlencoded
login_token=545a46230b291&password=*****&email=******%40beatswitch.com&submission='
When I load the former request header with hurl.it, everything works perfectly and the webserver lets me log in but trying the almost-same request with the same parameters fails in python. While using python's request, the webserver presents an error page. Any help would be highly appreciated. I need a solution desperately.
EDIT:
Here is the code:
#Open the login page to get sessionID and login_token
loginURL = "https://www.amsterdam-dance-event.nl/my-ade/login/"
loginReq = session.get(loginURL)
loginSoup = BeautifulSoup(loginReq.text)
loginToken = loginSoup.find('input',attrs={'name':'login_token'})['value']
sessionID= loginReq.cookies['www_amsterdam-dance-event_nl_session']
cookie = 'www_amsterdam-dance-event_nl_session='+sessionID
#Construct the header and post it to the webserver
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
payload = {'email':'*******#beatswitch.com','password':'********','login_token':loginToken,'submission':''}
loggedinReq = session.post(loginURL,headers=headers,data=payload)
I found the solution, thanks to Md. Mohsin. I was trying to handle the request headers and cookies manually while the requests module can handle them by itself. So I REMOVED the following line from the code and let requests take total control, and everything worked as intended:
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
I have to make a basic proxy which intercepts the requests of the browser and sends back a standard response. It doesn't work if I try to send the response in response to a https request. The code I'm using is:
#after the server socket starts listening
conn, addr = server.accept()
request = conn.recv(4096)
print request
conn.send(b"HTTP/1.1 200 OK\n\n<p>Hello</p>")
conn.close()
Now for https requests, e.g.:
Got request:
CONNECT www.google.com:443 HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Proxy-Connection: keep-alive
Connection: keep-alive
Host: www.google.com
I've tried sending the same response, but the browser shows "The connection was interrupted". The response has certainly been sent though. Am I right in thinking that to overcome this, I need to get a ssl certificate and send the response through a ssl socket?
(I'm not asking this because I'm too lazy to try it out, but setting up the certificate should take some time so I'd like to verify with someone who knows before wasting hours on a wrong hypothesis)
I've written a simple http web server using python but I've noticed that when I connect to it, the html page appears in the browser window but the indicator in the chrome tab continues to spin and the server receives empty strings. This continues until I click the 'X' to stop loading the page. Could someone please explain why this is happening and how to fix this. Also, if http headers are wrong or I'm missing important ones please tell me. I found it very difficult to find information on http headers and commands.
You find the code here.
Link to image of network tab
Console output:
Socket created
Bound socket
Socket now listening
Connected with 127.0.0.1:55146
Connected with 127.0.0.1:55147Received data: GET / HTTP/1.1
Host: localhost
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0
.8
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/33.0.1750.154 Safari/537.36
DNT: 1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Parsing GET command
Client requested directory /index.html with HTTP version 1.1
html
/index.html
Reply headers:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Server: Max's Python Web Server/1.0
Cache-Control: max-age=600, public
Connected with 127.0.0.1:55148
Received data:
Received data:
Received data:
Received data:
Your fault is how you think about sockets:
socket.recv will wait forever for data from clients
You don't need a loop here.
However, your requests will be limited by recv param.
But if you want to allow any size request,
you should detect the end of data by HTTP specification.
For example, if you wait headers only, double linefeed will mean they ends.
And size of body(for POST method for example) should be passed with Content-length header as I know.
You issue is same as in this question: link
And google for HTTP Specifications, if you want to make correct HTTP server.