I'm using python requests to send http requests to www.fredmeyer.com
I can't even get past an initial get request to this domain. doing a simple requests.get results in the connection hanging and never timing out. i've verified i have access to this domain and am able to run the request on my local machine. can anyone replicate
The site seems to have some filtering enabled to prohibit bots or similar. The following HTTP request works currently with the site:
GET / HTTP/1.1
Host: www.fredmeyer.com
Connection: keep-alive
Accept: text/html
Accept-Encoding:
If the Connection header is removed or its value changed to close it will hang. If the (empty) Accept-Encoding header is missing it will also hang. If the Accept line is missing it will return 403 Forbidden.
In order to access this site with requests the following currently works for me:
import requests
headers = { 'Accept':'text/html', 'Accept-Encoding': '', 'User-Agent': None }
resp = requests.get('https://www.fredmeyer.com', headers=headers)
print(resp.text)
Note that the heuristics used by the site to detect bots might change, so this might stop working in the future.
Related
I am scraping search results from google using people_also_ask module. The module itself dont have method to use proxies but I manually added proxies in the module. When I got blocked from google I printed the status and it was printing my ip address was banned from sending requests. The code I added in people_also_ask module to use proxies is
proxies = {
'http' : "http://username:passward#ip:port"
}
response = SESSION.get(URL, params=params, headers=HEADERS, proxies=proxies)
.I know it is an illegal activity but I want to know why it happens for education purpose mainly. I think the code to extract the data is irrelevant so I am adding simple code to send request using people_also_ask module
import people_also_ask as paa
queries = ["how to boil eggs","how to make cake","price of poco f1","price of wooden table","best soap in us","how much tesla worth"]
for query in queries:
questions = paa.get_related_questions(query ,40)
Note: The changes are made in first function named search() of google.py of people_also_people module
Note: I am doing searchs from browser without any problem. why is google allowing me to use google but blocked from using the script
The answer is quite simple. Although it is a proxy service, it doesn't guarantee 100% anonymity. When you send the HTTP GET request via the proxy server, the request sent by your program to the proxy server is:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Now, when the proxy server sends this request to the actual destination, it sends:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Via: 1.1 naxserver (squid/3.1.8)
X-Forwarded-For: 122.126.64.43
Cache-Control: max-age=18000
Connection: keep-alive
As you can see, it throws your IP (in my case, 122.126.64.43) in the HTTP header: X-Forwarded-For and hence the website knows that the request was sent on behalf of 122.126.64.43
Read more about this header at: https://www.rfc-editor.org/rfc/rfc7239
If you want to host your own squid proxy server and want to disable setting X-Forwarded-For header, read: http://www.squid-cache.org/Doc/config/forwarded_for/
I dont get any credit for the answer I copied this answer from the following post I found Python Requests module - proxy not working
Does anyone have any ideas on why the Urlib2 version returns the webpage, while the Requests version returns a connection error:
[Errno 10060] A connection attempt failed because the connected party
did not properly respond after a period of time, or established
connection failed because connected host has failed to respond.
Urllib2 code (Working):
import urllib2
proxy = urllib2.ProxyHandler({'http': 'http://login:password#proxy1.com:80'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
wPage = urllib2.urlopen('http://www.google.com/')
print wPage.read();
Requests code (Not working - Errno 10060):
import requests
proxy = {"http": "http://login:password#proxy1.com:80"}
wPage = requests.get('http://www.google.com/', proxies=proxy)
print wPage.text
The requests version returns intranet webpages, but gives an error on internet pages.
I am running Python 2.7
* Edit *
Based on m170897017's suggestion, I looked for differences in the GET requests. The only difference was in Connection and Proxy-Connection.
Urllib2 version:
header: Connection: close
header: Proxy-Connection: close
Requests version :
header: Connection: Keep-Alive
header: Proxy-Connection: Keep-Alive
I forced the Requests version to close both of those connections by modifying the header
header = {
"Connection": "close",
"Proxy-Connection": "close"
}
The GET request for both now match, however the Requests version still does not work.
Try this:
import urllib2
proxy = urllib2.ProxyHandler({'http': '1.1.1.1:9090'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com/')
datum = response.read().decode("UTF-8")
response.close()
print datum
A little late... but for future reference this line:
proxy = {"http": "http://login:password#proxy1.com:80"}
should also have a second key/value pair for https, even if it's not going to be used.
Also there is an awesome requests module called proxy requests that does something very similar:
pip3 install proxy-requests
https://pypi.org/project/proxy-requests/
I have to make a basic proxy which intercepts the requests of the browser and sends back a standard response. It doesn't work if I try to send the response in response to a https request. The code I'm using is:
#after the server socket starts listening
conn, addr = server.accept()
request = conn.recv(4096)
print request
conn.send(b"HTTP/1.1 200 OK\n\n<p>Hello</p>")
conn.close()
Now for https requests, e.g.:
Got request:
CONNECT www.google.com:443 HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0
Proxy-Connection: keep-alive
Connection: keep-alive
Host: www.google.com
I've tried sending the same response, but the browser shows "The connection was interrupted". The response has certainly been sent though. Am I right in thinking that to overcome this, I need to get a ssl certificate and send the response through a ssl socket?
(I'm not asking this because I'm too lazy to try it out, but setting up the certificate should take some time so I'd like to verify with someone who knows before wasting hours on a wrong hypothesis)
I'm working on a project that will access a specific site to do a search and then I will filter and return the value; the program logs in and then runs the search saving the cookie with a cookie jar to authenticate the connection while it runs the search . However when I run the program it returns no results and the packet header looks completely different. What am I doing wrong that the search always returns no results.
Here is my code:
import cookielib, urllib, urllib2
file= open('results.txt', 'wb')
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[('Referer', 'http:// site that runs the search/psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL')]
opener.addheaders=[('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0')]
posts={'timezoneOffset':'180', 'userid':'user', 'pwd':'password', 'Submit':'Signon'}
data = urllib.urlencode(posts)
opens=opener.open('loginpage.com', data)
print cj
file.write(opens.read())
cjs=str(cj)
posts2 = urllib.urlencode({'ICType':'Panel', 'ICElementNum':0, 'ICStateNum':1, 'ICAction':'SRCH_ATD_TAP_WK_SRCH_PB', 'ICXPos':0, 'ICYPos':0, 'ICFocus':'', 'ICChanged':1, 'ICResubmit':0, 'ICFind':'', 'SRCH_ATD_TAP_WK_MSISDN_TAP':'', 'SRCH_ATD_TAP_WK_CNPJ_TAP':'', 'SRCH_ATD_TAP_WK_STATUS_RA_TAP':'', 'SRCH_ATD_TAP_WK_INTERACTION_ID':'', 'SRCH_ATD_TAP_WK_CASE_ID':48373914, 'SRCH_ATD_TAP_WK_PROTOCOLO_TAP':'', 'SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_INI_RA_TAP':'', 'SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP':'', 'SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP':0, 'SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP':0, 'SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP':'','SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP':''})
url2='searchpage.com'
opens2 = opener.open(url2, posts2)
str=opens2.read()
print cj
file.write(str + cjs)
file.close()
It connects the first time to the login page to save the cookie and then it connects to the the search page. Again this is just to be used on one site so the connections and post data are very specific.
Again, this code doesn't return any results (after searching the str var which is the entire unfiltered site.
Here are the results I get when scanning the the requests with wireshark, the first one is the site ran in firefox doing the search in a normal browser (including the post data sent) and the second one is my program running and automating the search for me.
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: pt-BR,pt;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Referer: site that runs the search/BANNER_TAP.SRCH_ATDO_TAP.GBL #note I wasn't able to create this header.
Cookie: SignOnDefault=my login id; PS_LOGINLIST=http:// siteroot; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=dpLmTCpY8vTmj4nMHbpyptPMdvphpRLR!841308261; ExpirePage=http:// siteroot/psp/p01ps1/; PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSfJDUA/BR2T3ekF0/cVhdJ7uJlpgAAAGIABVNkYXRhVnicHYpBCoAgFESfFi2jixRqYrgO2hbWvjN0vw7X5B94bxg+8BjbtBh09v05kJlxpGq1joOd0ksnGxc3KyUS9OSJjHIQPUtlYNLqK52Ya5Li+ABuIwtr; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:||||||; PS_360=PS_360_BO_ID_CUST!0!PS_360_CUST_SETID!!PS_360_BO_ID_CONT!0!PS_360_BO_ID_SITE!0!PS_360_CUST_ROLE!0!PS_360_CONT_ROLE!0!PS_360_BO_ID!0!PS_360_VIEW_OPTION!False; PS_TOKENEXPIRE=18_Feb_2014_00:04:41_GMT; HPTabName=DEFAULT
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 683
POST DATA: ICType=Panel&ICElementNum=0&ICStateNum=17&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&ICXPos=0&ICYPos=84&ICFocus=&ICChanged=1&ICResubmit=0&ICFind=&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_INTERACTION_ID=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=
POST /psc/p01ps1/EMPLOYEE/CRM/c/BANNER_TAP.SRCH_ATDO_TAP.GBL HTTP/1.1
Accept-Encoding: identity
Content-Length: 681
Host: siteroot
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
Connection: close
Cookie: PS_TOKEN=AAAAogECAwQAAQAAAAACvAAAAAAAAAAsAARTaGRyAgBOcQgAOAAuADEAMBSX+ZILWKx7oU/VKvJbVT8LbueJtwAAAGIABVNkYXRhVnicJYpLCoAwDAWnVVyKF1Hsh2rXgluluvcM3s/DGWNCZh6PALexVY1Bxj4fOzKBkaSW1LCzUVrRwcrJxUKJeHlyRHqxFzomZWCQZlYm5b9Z7gVtawtT; ExpirePage=siteroot; PS_LOGINLIST=siteroot; PS_TOKENEXPIRE=18_Feb_2014_00:08:09_GMT; brux0128-claro-com-br-7090-PORTAL-PSJSESSIONID=QG14TCkJK7PpfRtNH0CSCw9S1m6jtRR9!841308261; SignOnDefault=my login id; http%3a%2f%2fsiteroot%2fpsp%2fp01ps1%2femployee%2fcrm%2frefresh=list:
Content-Type: application/x-www-form-urlencoded
POST DATA: SRCH_ATD_TAP_WK_DATA_INI_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID4_TAP=0&ICResubmit=0&ICXPos=0&SRCH_ATD_TAP_WK_DATA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_PROTOCOLO_TAP=&SRCH_ATD_TAP_WK_SUBTIPO_CLI_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID3_TAP=0&ICAction=SRCH_ATD_TAP_WK_SRCH_PB&SRCH_ATD_TAP_WK_MOTIVO_ID5_TAP=0&ICElementNum=0&SRCH_ATD_TAP_WK_INTERACTION_ID=&ICType=Panel&SRCH_ATD_TAP_WK_STATUS_RA_TAP=&SRCH_ATD_TAP_WK_COMPANY_TYPE_TAP=&SRCH_ATD_TAP_WK_HORA_FIM_BAN_TAP=&SRCH_ATD_TAP_WK_MOTIVO_ID2_TAP=0&ICFind=&SRCH_ATD_TAP_WK_MOTIVO_ID1_TAP=0&SRCH_ATD_TAP_WK_HORA_INI_RA_TAP=&ICChanged=1&ICStateNum=1&ICYPos=0&ICFocus=&SRCH_ATD_TAP_WK_CASE_ID=48373914&SRCH_ATD_TAP_WK_MSISDN_TAP=&SRCH_ATD_TAP_WK_CNPJ_TAP=
(This is for personal use at the company I work at to make this task more simple which needs to be done around 500 times at this point manualy. it is a site that registers protocols and we need to search the protocols to check if (later will import a list from excel) the protocol is closed of not)
note that I don't have the additional headers but if that could solve the problem I can. And for some reason my post data gets all disorganized ( but from what I understand about post data that shouldn't make a difference) and the cookie information is also somewhhat backwards, but that also shouldn't matter I would assum because to retrieve the cookie info is handled much like a python dictionary.
so I've been breaking my head over this little code and rewritting it several times for the past two weeks and I still can't get it to return the search results.
it's also important to note that I won't be able to install the browser core to be able to execute the javascript, but I also don't think that it's necessary do to the fact that the results from the search done on firefox show in wireshark, so the site is downloaded with the result. I was able to get mechanize running, but I havn't been able to try it yet. If there is a way to automate firefox (I don't remember which version at this moment) with python, that is an option that I'm open to.
One ore thing, because I'm working on this project at work, I'm not able to use and python plugin that has to be installed. I got mechanize to work because I open and copied the file over, with out running the setup.py. So just to make things easier, I have no way to install libraries.
You don't have PS_360 set in your cookie. Not sure how essential this is, but the best strategy going through these issues is to get step by step identical requests. Probably the first request to get ỳour cookie set was already different, or your browser has cookie data from previous requests, that you need to create manually for your request.
I'm trying to automate form filling on Sharepoint site, but my Python script can't get passed this authentication box that pops up when you type in the url from below.
from base64 import b64encode
import mechanize
url = 'http://moss.micron.com/MFG/ProbeTest/Lists/Manufacturing%20Requests/AllItems.aspx'
username = 'username'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
br.addheaders.append(
('Authorization', 'Basic %s' % b64login, )
)
br.open(url)!
This results in the following error:
EDIT:
Here are the results of running wget on the requested page.
--2013-08-30 11:16:17-- http://moss.micron.com/MFG/ProbeTest/Lists/Manufacturing%20Requests/AllItems.aspx
Resolving moss.micron.com... 137.201.88.118
Connecting to moss.micron.com|137.201.88.118|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 401 Unauthorized
Server: Microsoft-IIS/7.0
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 12.0.0.6341
Date: Fri, 30 Aug 2013 17:16:17 GMT
Connection: keep-alive
Content-Length: 0
Authorization failed.
Your browser is respecting the robots.txt on your site disallowing it.
You can set mechanize.Browser to ignore robots.txt, prior to making the request via:
br.set_handle_robots(False)
Alternately, edit your robots.txt to allow that sort of connection.
If you set a custom UserAgent header in your mechanize.Browser to allow you to filter for it.
See here for basic info about robots.txt.
If you can get to the site with a PC, download Fiddler2 which will allow you to see the transactions required when you log in.
Edit.. Ok. Obviously you have a PC.