scrapy has response status 400 , but browser response is ok? - python

I have this strange situation,
I have a link that works on all borwsers that I currently have (chrome,IE,firefox),
I tried to crawl the page using scrapy in python. however I get response.status == 400,
I am using tor + polipo to crawl anonymously
response.body is :
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head>
<title>Proxy error: 400 Couldn't parse URL.</title>
</head><body>
<h1>400 Couldn't parse URL</h1>
<p>The following error occurred while trying to access <strong>https://exmpale.com/blah</strong>:<br><br>
<strong>400 Couldn't parse URL</strong></p>
<hr>Generated Thu, 11 Dec 2014 13:55:38 UTC by Polipo on <em>localhost:8123</em>.
</body></html>
I'm just wondering why that should be, could it be that browser can get results but not scrapy?

Related

What is the correct 3-legged OAuth workflow in Python (Example: ImmoScout API) ? (How to get request_token)

I am trying to access the ImmoScout24 web api for a data science project in python and I kind of stuck in the 3 legged authentication process. I googled the problem but its kind of special, so maybe someone can help me.
I want to implement the workflow described on: https://api.immobilienscout24.de/api-docs/authentication/three-legged/#callback-url
To obtain the request_token (first step within the authentication process) I tried the following approach:
API Credentials are stored in those two variables:
client_key
client_secret
The Python code looks like follows
immoscout_api = OAuth1Session(client_key,
client_secret=client_secret)
request_token_url='http://rest.immobilienscout24.de/restapi/security/oauth/request_token'
fetch_response = immoscout_api.fetch_request_token(request_token_url)
I am getting an Error in my Jupyter Notebook that looks like the following:
TokenRequestDenied: Token request failed with code 403, response was '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Bad request.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<BR clear="all">
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: M_HHRf9VaNN9xFRqWlHWt2txfuIsBE5fe6siJACFUFjVWw20p91jLg==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>'.
Can somebody help me to obtain the request toke?

How do I limit request data being sent from server?

I'm trying to get a lot of requests, but I only need a part of the data near the start of the webpage html. Since right now I'm requesting for the whole webpage each time I request, it takes a lot of network usage to loop it. Can I request only a section of a website html, with any module?
If you know the specific number of bytes that is enough, then you can request a partial "range" of the resource:
curl -q http://www.example.com -i -H "Range: bytes=0-50"
HTTP/1.1 206 Partial Content
Accept-Ranges: bytes
Age: 506953
Cache-Control: max-age=604800
Content-Range: bytes 0-50/1256
...
Content-Length: 51
<!doctype html>
<html>
<head>
<title>Example Do%
See https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests

Can not find out the source of data I need when crawling website

I am writing a web crawler with python. I come across a problem when I am trying to find out the source of the data I need.
The site I am crawling is: https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League, and the data I want is as below:
I can find these data by browsering the page source after the page has been tatolly loaded by firefox:
DataStore.prime('standings', { stageId:15151, idx:0, field: 'overall'}, [[15151,32,'Manchester United',1,5,4,1,0,16,2,14,13,1,3,3,0,0,10,0,10,9,7,2,1,1,0,6,2,4,4,[[0,1190179,4,0,2,252,'England',2,'Premier League','2017/2018',32,29,'Manchester United','West Ham','Manchester United','West Ham',4,0,'w'] ......
I thought these data should be requested though ajax, but I detected no such request by using the web console.
Then, I simulated the browser behaviour (set header and cookies) requiring the html page:
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05">
</script>
<script>
(function() {
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B7661722073746174757......";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>
I created an .html file with the content above, and open it with firefox, but it seems that the script did not executed. Now, I don`t know how to do, I need some help, thanks!

Python3: http.client with privoxy/TOR making bad requests

I'm trying to use TOR with http.client.HTTPConnection, but for some reason I keep getting weird responses from everything. I'm not really sure exactly how to explain, so here's an example of what I have:
class Socket(http.client.HTTPConnection):
def __init__(self, url):
super().__init__('127.0.0.1', 8118)
super().set_tunnel(url)
#super().__init__(url)
def get(self, url = '/', params = {}):
params = util.params_to_query(params)
if params:
if url.find('?') == -1: url += '?' + params
else: url += '&' + params
self.request(
'GET',
url,
'',
{'Connection': 'Keep alive'}
)
return self.getresponse().read().decode('utf-8')
If I run this with:
sock = Socket('www.google.com')
print(sock.get())
I get:
<html><head><meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<title>301 Moved</title></head><body>
<h1>301 Moved</h1>
The document has moved
here.
</body></html>
Google is redirecting me to the url I just requested, except with the privoxy port. And it gets weirder - if I try https://check.torproject.org:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>Welcome to sergii!</title>
</head>
<body>
<h1>Welcome to sergii!</h1>
This is sergii, a system run by and for the Tor Project.
She does stuff.
What kind of stuff and who our kind sponsors are you might learn on
db.torproject.org.
<p>
</p><hr noshade=""/>
<font size="-1">torproject-admin</font>
</body>
</html>
If I don't try to use privoxy/TOR, I get exactly what your browser gets at http://www.google.com or http://check.torproject.org. I don't know what's going on here. I suspect the issue is with python because I can use TOR with firefox, but I don't really know.
Privoxy log reads:
2015-06-27 19:28:26.950 7f58f4ff9700 Request: www.google.com:80/
2015-06-27 19:30:40.360 7f58f4ff9700 Request: check.torproject.org:80/
TOR log has nothing useful to say.
This ended up being because I was connecting with http:// and those sites wanted https://. It does work correctly for sites that accept normal http://.

Cannot fetch a web site with python urllib.urlopen() or any web browser other than Shiretoko

Here is the URL of the site I want to fetch
https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff%27s+tags
When I fetch the web site with the following code and display the contents with the following code:
sock = urllib.urlopen("https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff's+tags")
html = sock.read()
sock.close()
soup = BeautifulSoup(html)
print soup.prettify()
I get the following output:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>
Error message
</title>
</head>
<body>
<h2>
Invalid input data
</h2>
</body>
</html>
I get the same result with urllib2 as well. Now interestingly, this URL works on only Shiretoko web browser v3.5.7. (when I say it works I mean that it brings me the right page). When I feed this URL into Firefox 3.0.15 or Konqueror v4.2.2. I get exactly the same error page (with "Invalid input data"). I don't have any idea what creates this difference and how I can fetch this page using Python. Any ideas?
Thanks
If you see the urllib2 doc, it says
urllib2.build_opener([handler, ...])ΒΆ
.....
If the Python installation has SSL support (i.e., if the ssl module can be imported), HTTPSHandler will also be added.
.....
you can try using urllib2 together with ssl module. alternatively, you can use httplib
That's exactly what you get when you click on the link with a webbrowser. Maybe you are supposed to be logged in or have a cookie set or something
I get the same message for firefox 3.5.8 (shiretoko) on linux

Categories