I want to find the info about a webpage using curl, but in Python, so far I have this:
os.system("curl --head www.google.com")
If I run that, it prints out:
HTTP/1.1 200 OK
Date: Sun, 15 Apr 2012 00:50:13 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3e39ad65c9fa03f3:FF=0:TM=1334451013:LM=1334451013:S=IyFnmKZh0Ck4xfJ4; expires=Tue, 15-Apr-2014 00:50:13 GMT; path=/; domain=.google.com
Set-Cookie: NID=58=Giz8e5-6p4cDNmx9j9QLwCbqhRksc907LDDO6WYeeV-hRbugTLTLvyjswf6Vk1xd6FPAGi8VOPaJVXm14TBm-0Seu1_331zS6gPHfFp4u4rRkXtSR9Un0hg-smEqByZO; expires=Mon, 15-Oct-2012 00:50:13 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
What I want to do, is be able to match the 200 in it using a regex (i don't need help with that), but, I can't find a way to convert all the text above into a string. How do I do that?
I tried: info = os.system("curl --head www.google.com") but info was just 0.
For some reason... I need use curl (no pycurl, httplib2...), maybe this can help to somebody:
import os
result = os.popen("curl http://google.es").read()
print result
Try this, using subprocess.Popen():
import subprocess
proc = subprocess.Popen(["curl", "--head", "www.google.com"], stdout=subprocess.PIPE)
(out, err) = proc.communicate()
print out
As stated in the documentation:
The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several other, older modules and functions, such as:
os.system
os.spawn*
os.popen*
popen2.*
commands.*
import os
cmd = 'curl https://randomuser.me/api/'
os.system(cmd)
Result
{"results":[{"gender":"male","name":{"title":"mr","first":"çetin","last":"nebioğlu"},"location":{"street":"5919 abanoz sk","city":"adana","state":"kayseri","postcode":53537},"email":"çetin.nebioğlu#example.com","login":{"username":"heavyleopard188","password":"forgot","salt":"91TJOXWX","md5":"2b1124732ed2716af7d87ff3b140d178","sha1":"cb13fddef0e2ce14fa08a1731b66f5a603e32abe","sha256":"cbc252db886cc20e13f1fe000af1762be9f05e4f6372c289f993b89f1013a68c"},"dob":"1977-05-10 18:26:56","registered":"2009-09-08 15:57:32","phone":"(518)-816-4122","cell":"(605)-165-1900","id":{"name":"","value":null},"picture":{"large":"https://randomuser.me/api/portraits/men/38.jpg","medium":"https://randomuser.me/api/portraits/med/men/38.jpg","thumbnail":"https://randomuser.me/api/portraits/thumb/men/38.jpg"},"nat":"TR"}],"info":{"seed":"0b38b702ef718e83","results":1,"page":1,"version":"1.1"}}
You could use an HTTP library or http client library in Python instead of calling a curl command. In fact, there is a curl library that you can install (as long as you have a compiler on your OS).
Other choices are httplib2 (recommended) which is a fairly complete http protocol client supporting caching as well, or just plain httplib or a library named Request.
If you really, really want to just run the curl command and capture its output, then you can do this with Popen in the builtin subprocess module documented here: http://docs.python.org/library/subprocess.html
Well, there is an easier to read, but messier way to do it. Here it is:
import os
outfile='' #put your file path there
os.system("curl --head www.google.com>>{x}".format(x=str(outfile)) #Outputs command to log file (and creates it if it doesnt exist).
readOut=open("{z}".format(z=str(outfile),"r") #Opens file in reading mode.
for line in readOut:
print line #Prints lines in file
readOut.close() #Closes file
os.system("del {c}".format(c=str(outfile)) #This is optional, as it just deletes the log file after use.
This should work properly for your needs. :)
Try this:
import httplib
conn = httplib.HTTPConnection("www.python.org")
conn.request("GET", "/index.html")
r1 = conn.getresponse()
print r1.status, r1.reason
Related
I am trying to write a client to some web app. At one point it sends the following Set-cookie:
JSESSIONID=1BDC39CBF91C299C3330963D1EEFE399; Path=; HttpOnly; Secure, JSESSIONID=E6FFF3B159AFB9575D47662FC70DC161; Path=/; Secure; HttpOnly, XSESSIONID=b163abe6-bd6c-4381-9f68-01eeaee15a6c; Path=/; Secure; HttpOnly
It is unclear to me what is the meaning of having the same cookie name twice and what I am supposed to send back to the server?
I checked it with the following python code:
import sys
if sys.version_info.major == 2:
from Cookie import SimpleCookie
else:
from http.cookies import SimpleCookie
set_cookie = "JSESSIONID=1BDC39CBF91C299C3330963D1EEFE399; Path=; HttpOnly; Secure, JSESSIONID=E6FFF3B159AFB9575D47662FC70DC161; Path=/; Secure; HttpOnly, XSESSIONID=b163abe6-bd6c-4381-9f68-01eeaee15a6c; Path=/; Secure; HttpOnly"
cookies = SimpleCookie()
cookies.load(set_cookie)
for name in cookies.keys():
print("{} = {}".format(name, cookies[name].value))
for field in ['secure', 'httponly', 'path']:
print(" {}: {}".format(field, cookies[name][field]))
The python 2 code shows only one of the duplicate keys. The python 3 version does not recognize this at all.
$ /usr/bin/python cookie.py
XSESSIONID = b163abe6-bd6c-4381-9f68-01eeaee15a6c
secure: True
httponly: True
path: /
JSESSIONID = E6FFF3B159AFB9575D47662FC70DC161
secure: True
httponly: True
path: /
$ python3 cookie.py
So I would like to understand what is the meaning of having the same key twice and what should be sent back to the server?
It would be also nice to understand why does the Python 3 library disregard the whole string. What do I need to do to fix it?
I have a strange problem i've been trying to 'google-out' for several hours.
I've tried also solutions from similar topics on stack but still wiht no positive result:
How do I set cookies using Python urlopen?
Handling rss redirects with Python/urllib2
So the case is that i want to download whole set of articles form some webpage. Its sub-links with proper content differ with just one number, so I loop for whole range ( 1 to 400 000 ) and write html's to files. Whats importatnt here is this webpage need the cookies to be re-send in order to get to proper url, and after a lecture of How to use Python to login to a webpage and retrieve cookies for later usage? i have this done.
But some times my script returns error:
response = meth(req, response)
File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response
'http', request, response, code, msg, hdrs)
....
File "/usr/lib/python3.1/urllib/request.py", line 553, in http_error_302 self.inf_msg + msg, headers, fp)
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
This problem is hard to reproduce because script generally works fine but it happens randomly after a few thousands of 'for loops'.
Here is curl ouptut from server:
$ curl -I "http://my.url/"
HTTP/1.1 200 OK
Date: Wed, 17 Oct 2012 10:14:13 GMT
Server: Apache/2.2.15 (Oracle)
X-Powered-By: PHP/5.3.3
Set-Cookie: Kuuxk=ae7s3isu2cEshhijte4nb1clk5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=UTF-8
Some folks suggested to use mechanize or try to catch exception but i have no clue how to do this, others said that error is caused by wrong cookie handling but i tried also to get and send cookies 'manually' using urllib2 and add_header('cookie', cookie) with similar result.
I wonder if my for loop and mabey to short sleep might cause script to fail some times..
Anwyay - any help appreciated.
edit:
In case this might work - how to catch the exception and try ignore it ?
edit:
Solved by simply ignoring this error. No everything goes fine.
I used
try:
#here open url
except any_HTTPError:
pass
On each time i use url.open instruction.
TO BE CLOSED.
Let me suggest another solution:
HTTP status code 302 means Found redirection (See: https://en.wikipedia.org/wiki/HTTP_302).
For example:
HTTP/1.1 302 Found
Location: http://www.iana.org/domains/example/
You can grab the Location header and try fetching this url.
There are 8 redirection status codes (301-308). You can for Location header if 301 <= status code <= 308.
I would like an automated way to test how my app handles email, with attachments.
Firstly I modified my app (on App Engine) to log the contents of the request body for a received message (as sent through appspotmail). I copied these contents into a file called test_mail.txt
I figured I could post this file to imitate the inbound mail tester, something like so.
curl --header "Content-Type:message/rfc822" -X POST -d #test_mail.txt http://localhost:8080/_ah/mail/test#example.com
Whenever I do this, the message isn't properly instantiated, and I get an exception when I refer to any of the standard attributes.
Am I missing something in how I am using curl?
I run into the same problem using a simpler email, as posted by _ah/admin/inboundmail
MIME-Version: 1.0
Date: Wed, 25 Apr 2012 15:50:06 +1000
From: test#example.com
To: test#example.com
Subject: Hello
Content-Type: multipart/alternative; boundary=cRtRRiD-6434410
--cRtRRiD-6434410
Content-Type: text/plain; charset=UTF-8
There
--cRtRRiD-6434410
Content-Type: text/html; charset=UTF-8
There
--cRtRRiD-6434410--
Try --data-binary instead of -d as the flag for the input file. When I tried with your flags, it looked like curl stripped the carriage returns out of the input file, which meant the MIME parser choked on the POST data.
Black,
I noticed your program is written in python? Why not use twisted to create a tiny smtp client?
Here are a few examples..
http://twistedmatrix.com/documents/current/mail/tutorial/smtpclient/smtpclient.html
I just want a better idea of what's going on here, I can of course "work around" the problem by using urllib2.
import urllib
import urllib2
url = "http://www.crutchfield.com/S-pqvJFyfA8KG/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html"
# urllib2 works fine (foo.headers / foo.read() also behave)
foo = urllib2.urlopen(url)
# urllib throws errors though, what specifically is causing this?
bar = urllib.urlopen(url)
http://pae.st/AxDW/ shows this code in action with the exception/stacktrace. foo.headers and foo.read() work fine
stu#sente.cc ~ $: curl -I "http://www.crutchfield.com/S-pqvJFyfA8KG/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html"
HTTP/1.1 302 Object Moved
Cache-Control: private
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
Location: /S-FSTWJcduy5w/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html
Server: Microsoft-IIS/7.5
Set-Cookie: SESSIONID=FSTWJcduy5w; domain=.crutchfield.com; expires=Fri, 22-Feb-2013 22:06:43 GMT; path=/
Set-Cookie: SYSTEMID=0; domain=.crutchfield.com; expires=Fri, 22-Feb-2013 22:06:43 GMT; path=/
Set-Cookie: SESSIONDATE=02/23/2012 17:07:00; domain=.crutchfield.com; expires=Fri, 22-Feb-2013 22:06:43 GMT; path=/
X-AspNet-Version: 4.0.30319
HostName: cws105
Date: Thu, 23 Feb 2012 22:06:43 GMT
Thanks.
This server is both non-deterministic and sensitive to HTTP version. urllib2 is HTTP/1.1, urllib is HTTP/1.0. You can reproduce this by running curl --http1.0 -I "http://www.crutchfield.com/S-pqvJFyfA8KG/p_15410415/Dynamat-10415-Xtreme-Speaker-Kit.html"
a few times in a row. You should see the output curl: (52) Empty reply from server occasionally; that's the error urllib is reporting. (If you re-issue the request a bunch of times with urllib, it should succeed sometimes.)
I solved the Problem. I simply using now the urrlib instead of urllib2 and anything works fine thank you all :)
I know how to use urllib to download a file. However, it's much faster, if the server allows it, to download several part of the same file simultaneously and then merge them.
How do you do that in Python? If you can't do it easily with the standard lib, any lib that would let you do it?
Although I agree with Gregory's suggestion of using an existing library, it's worth noting that you can do this by using the Range HTTP header. If the server accepts byte-range requests, you can start several threads to download multiple parts of the file in parallel. This snippet, for example, will only download bytes 0..65535 of the specified file:
import urllib2
url = 'http://example.com/test.zip'
req = urllib2.Request(url, headers={'Range':'bytes=0-65535'})
data = urllib2.urlopen(req).read()
You can determine the remote resource size and see whether the server supports ranged requests by sending a HEAD request:
import urllib2
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
url = 'http://sstatic.net/stackoverflow/img/sprites.png'
req = HeadRequest(url)
response = urllib2.urlopen(req)
response.close()
print respose.headers
The above prints:
Cache-Control: max-age=604800
Content-Length: 16542
Content-Type: image/png
Last-Modified: Thu, 10 Mar 2011 06:13:43 GMT
Accept-Ranges: bytes
ETag: "c434b24eeadecb1:0"
Date: Mon, 14 Mar 2011 16:08:02 GMT
Connection: close
From that we can see that the file size is 16542 bytes ('Content-Length') and the server supports byte ranges ('Accept-Ranges: bytes').
PycURL can do it. PycURL is a Python interface to libcurl. It can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL is mature, very fast, and supports a lot of features.