Get HTTP request message from Request object in Scrapy - python

I need to somehow extract plain HTTP request message from a Request object in Scrapy (so that I could, for example, copy/paste this request and run from Burp).
So given a scrapy.http.Request object, I would like to get the corresponding request message, such as e.g.
POST /test/demo_form.php HTTP/1.1
Host: w3schools.com
name1=value1&name2=value2
Clearly I have all the information I need in the Request object, however trying to reconstruct the message manually is error-prone as I could miss some edge cases. My understanding is that Scrapy first converts this Request into Twisted object, which then writes headers and body into a TCP transport. So maybe there's away to do something similar, but write to a string instead?
UPDATE
I could use the following code to get HTTP 1.0 request message, which is based on http.py. Is there a way to do something similar with HTTP 1.1 requests / http11.py, which is what's actually being sent? I would obviously like to avoid duplicating code from Scrapy/Twisted frameworks as much as possible.
factory = webclient.ScrapyHTTPClientFactory(request)
transport = StringTransport()
protocol = webclient.ScrapyHTTPPageGetter()
protocol.factory = factory protocol.makeConnection(transport)
request_message = transport.value()
print(request_message.decode("utf-8"))

As scrapy is open source and also has plenty of extension points, this should be doable.
The requests are finally assembled and sent out in scrapy/core/downloader/handlers/http11.py in ScrapyAgent.download_request ( https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py#L270 )
If you place your hook there you can dump the request type, request headers, and request body.
To place your code there you can either try monkey patching ScrapyAgent.download_request or to subclass ScrapyAgent to do the request logging, then subclass HTTP11DownloadHandler to use your Scrapy Agent and then set HTTP11DownloadHandler as new DOWNLOAD_HANDLER for http / https requests in your project's settings.py (for details see: https://doc.scrapy.org/en/latest/topics/settings.html#download-handlers)
In my opinion this is the closest you can get to logging the requests going out without using a packet sniffer or a logging proxy (which might be a bit overkill for your scenario).

Related

Python http.client - What is the difference between request and putrequest?

The documentation I've found explaining http.client for Python seems a bit sparse. I want to use it over requests because requests has not worked for our project.
So, knowing that I'm using Python's http.client, I'm seeing again and again request and putrequest. Both methods are defined here under HTTPConnection.
HTTPConnection.request: This will send a request to the server using
the HTTP request method method and the selector url.
HTTPConnection.putrequest: This should be the first call after the
connection to the server has been made. It sends a line to the server
consisting of the method string, the url string, and the HTTP version
(HTTP/1.1). To disable automatic sending of Host: or Accept-Encoding:
headers (for example to accept additional content encodings), specify
skip_host or skip_accept_encoding with non-False values.
Also, the source code for both is defined in this file.
From my guess and reading things, it seems like request is a more high level API compared to putrequest. Is that correct?
The Answer: request() is an abstracted version of multiple functions, putrequest() being one of them.
Although this is defined in the documentation, it's easy to skip over the line that answers this question.
This is pointed out in this line of the http.client documentation:
As an alternative to using the request() method described above, you can also send your request step by step, by using the four functions below.

Documentation for Flask app object `get` and `post` class methods?

In the Flask documentation on testing (http://flask.pocoo.org/docs/testing/), it has a line of code
rv = self.app.get('/')
And below it, it mentions "By using self.app.get we can send an HTTP GET request to the application with the given path."
Where can the documentation be found for these direct access methods (I'm assuming that there's one for all of the restful methods)? Specifically, I'm wondering what sort of arguments they can take (for example, passing in data, headers, etc). Looking around on flask's documentation for a Flask object, it doesn't seem to list these methods, even though it uses them in the above example.
Alternatively, a knowledgeable individual could answer what I am trying to figure out: I'm trying to simulate sending a POST request to my server, as I would with the following line, if I were doing it over HTTP:
res = requests.post("http://localhost:%d/generate" % port,
data=json.dumps(payload),
headers={"content-type": "application/json"})
The above works when running a Flask app on the proper port. But I tried replacing it with the following:
res = self.app.post("/generate",
data=json.dumps(payload),
headers={"content-type": "application/json"})
And instead, the object I get in response is a 400 BAD REQUEST.
This is documented in the Werkzeug project, from which Flask gets the test client: Werkzeug's test client.
The test client does not issue HTTP requests, it dispatches requests internally, so there is no need to specify a port.
The documentation isn't very clear about support for JSON in the body, but it seems if you pass a string and set the content type you should be fine, so I'm not exactly sure why you get back a code 400. I would check if your /generate view function is invoked at all. A debugger should be useful to figure out where is the 400 coming from.

image download problem (python)

I was trying to download images with multi-thread, which has a limited max_count in python.
Each time a download_thread is started, I leave it alone and activate another one. I hope the download process could be ended in 5s, which means downloading is failed if opening the url costs more than 5s.
But how can I know it and stop the failed thread???
Can you tell which version of python you are using?
Maybe you could have posted a snippet too.
From Python 2.6, you have a timeout added in urllib2.urlopen.
Hope this will help you. It's from the python docs.
urllib2.urlopen(url[, data][,
timeout]) Open the URL url, which can
be either a string or a Request
object.
Warning HTTPS requests do not do any
verification of the server’s
certificate. data may be a string
specifying additional data to send to
the server, or None if no such data is
needed. Currently HTTP requests are
the only ones that use data; the HTTP
request will be a POST instead of a
GET when the data parameter is
provided. data should be a buffer in
the standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format. urllib2 module sends
HTTP/1.1 requests with
Connection:close header included.
The optional timeout parameter
specifies a timeout in seconds for
blocking operations like the
connection attempt (if not specified,
the global default timeout setting
will be used). This actually only
works for HTTP, HTTPS and FTP
connections.
This function returns a file-like
object with two additional methods:
geturl() — return the URL of the
resource retrieved, commonly used to
determine if a redirect was followed
info() — return the meta-information
of the page, such as headers, in the
form of an mimetools.Message instance
(see Quick Reference to HTTP Headers)
Raises URLError on errors.
Note that None may be returned if no
handler handles the request (though
the default installed global
OpenerDirector uses UnknownHandler to
ensure this never happens).
In addition, default installed
ProxyHandler makes sure the requests
are handled through the proxy when
they are set.
Changed in version 2.6: timeout was
added.

PUT Variables Missing between Python and Tomcat

I'm trying to get a PUT request from Python into a servlet in Tomcat. The parameters are missing when I get into Tomcat.
The same code is happily working for POST requests, but not for PUT.
Here's the client:
lConnection = httplib.HTTPConnection('localhost:8080')
lHeaders = {"Content-type": "application/x-www-form-urlencoded",
"Accept": "text/plain"}
lParams = {'Username':'usr', 'Password':'password', 'Forenames':'First','Surname':'Last'}
lConnection.request("PUT", "/my/url/", urllib.urlencode(lParams), lHeaders)
Once in the server, a request.getParameter("Username") is returning null.
Has anyone got any clues as to where I'm losing the parameters?
I tried your code and it seems that the parameters get to the server using that code. Tcpdump gives:
PUT /my/url/ HTTP/1.1
Host: localhost
Accept-Encoding: identity
Content-Length: 59
Content-type: application/x-www-form-urlencoded
Accept: text/plain
Username=usr&Password=password&Surname=Last&Forenames=First
So the request gets to the other side correctly, it must be something with either tomcat configuration or the code that is trying to read the parameters.
I don't know what the Tomcat side of your code looks like, or how Tomcat processes and provides access to request parameters, but my guess is that Tomcat is not "automagically" parsing the body of your PUT request into nice request parameters for you.
I ran into the exact same problem using the built-in webapp framework (in Python) on App Engine. It did not parse the body of my PUT requests into request parameters available via self.request.get('param'), even though they were coming in as application/x-www-form-urlencoded.
You'll have to check on the Tomcat side to confirm this, though. You may end up having to access the body of the PUT request and parse out the parameters yourself.
Whether or not your web framework should be expected to automagically parse out application/x-www-form-urlencoded parameters in PUT requests (like it does with POST requests) is debatable.
I'm guessing here, but I think the problem is that PUT isn't meant to be used that way. The intent of PUT is to store a single entity, contained in the request, into the resource named in the headers. What's all this stuff about user name and stuff?
Your Content Type is application/X-www-form-urlencoded, which is a bunch of field contents. What PUT wants is something like an encoded file. You know, a single bunch of data it can store somewhere.

SOAP, Python, suds

Please advise library for working with soap in python.
Now, I'm trying to use "suds" and I can't undestand how get http headers from server reply
Code example:
from suds.client import Client
url = "http://10.1.0.36/money_trans/api3.wsdl"
client = Client(url)
login_res = client.service.Login("login", "password")
variable "login_res" contain xml answer and doesnt contain http headers. But I need to get session id from them.
I think you actually want to look in the Cookie Jar for that.
client = Client(url)
login_res = client.service.Login("login", "password")
for c in client.options.transport.cookiejar:
if "sess" in str(c).lower():
print "Session cookie:", c
I'm not sure. I'm still a SUDS noob, myself. But this is what my gut tells me.
The response from Ishpeck is on the right path. I just wanted to add a few things about the Suds internals.
The suds client is a big fat abstraction layer on top of a urllib2 HTTP opener. The HTTP client, cookiejar, headers, request and responses are all stored in the transport object. The problem is that none of this activity is cached or stored inside of the transport other than, maybe, the cookies within the cookiejar, and even tracking these can sometimes be problematic.
If you want to see what's going on when debugging, my suggestion would be to add this to your code:
import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
Suds makes use of the native logging module and so by turning on debug logging, you get to see all of the activity being performed underneath including headers, variables, payload, URLs, etc. This has saved me tons of times.
Outside of that, if you really need to definitively track state on your headers, you're going to need to create a custom subclass of a suds.transport.http.HttpTransport object and overload some of the default behavior and then pass that to the Client constructor.
Here is a super-over-simplified example:
from suds.transport.http import HttpTransport, Reply, TransportError
from suds.client import Client
class MyTransport(HttpTransport):
# custom stuff done here
mytransport_instance = MyTransport()
myclient = Client(url, transport=mytransport_instance)
I think Suds library has a poor documentation so, I recommend you to use Zeep. It's a SOAP requests library in Python. Its documentation isn't perfect, but it's very much clear than Suds Doc.

Categories