Using urllib2 to fetch plain text, result isn't full - python

I'm writing a python script to parse jenkins job results. I'm using urllib2 to fetch consoleText, but the file that I receive isn't full. The code to fetch the file is:
data = urllib2.urlopen('http://<server>/job/<jobname>/<buildid>/consoleText')
lines = data.readlines()
And the number of lines I get is 2306, while the actual number of lines in the console log is 37521. I can check that buy fetching the file via wget:
$ wget 'http://<server>/job/<jobname>/<buildid>/consoleText'
$ wc -l consoleText
37521
Why does urlopen not give me the full result?
UPDATE:
Using requests (as suggested by #svrist) instead of urllib2 doesn't have such a problem, so I'm switching to it. My new code is:
data = requests.get('http://<server>/job/<jobname>/<buildid>/consoleText')
lines = [l for l in data.iter_lines()]
But I still have no idea why urllib2.urlopen doesn't work properly.

The Jenkins build log is returned using a chunked encoding response.
Transfer-Encoding: chunked
Based on a couple of other questions, it seems like urllib2 does not handle the entire response and as you've observed, only returns the first chunk.
I also recommend using the requests package.

Related

Python 2.7 requests not reading complete body

I am trying to read the body of a web service call with the requests module in Python 2.7. The body of the result is CSV data.
The request is just a GET. I can run the url in Chrome and it returns a complete response of 200 rows. The url looks like:
http://localhost:8080/summary-report?endTime=9999999&startTime=0
When I try to get the data from Python, it retrieves only the first row of data. I look at the response header and see the length of the response is only 122 bytes, about the length of the first row. I have run it a number of times from Python and it is consistent in returning only the first row.
The code is only:
r = requests.get(url)
ans = r.content
Could it be some with with the new line characters on each line of the CSV file?
Or because I am using localhost in the url?
I also tried it with 127.0.0.1 but see a similar behavior.
Possibly the ampersand is an issue in the URL?
You can get dinamically generated content with request.get()
You need JavaScript Engine to parse and run JavaScript code inside the page. Some of those can be found below:
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/

Python urllib download only some of a webpage?

I have a program where I need to open many webpages and download information in them. The information, however, is in the middle of the page, and it takes a long time to get to it. Is there a way to have urllib only retrieve x lines? Or, if nothing else, don't load the information afterwards?
I'm using Python 2.7.1 on Mac OS 10.8.2.
The returned object is a file-like object, and you can use .readline() to only read a partial response:
resp = urllib.urlopen(url)
for i in range(10):
line = resp.readline()
would read only 10 lines, for example. Note that this won't guarantee a faster response.

Python Urllib2 Reading only part of document

OK, this is driving me nuts.
I am trying to read from the Crunchbase API using Python's Urllib2 library. Relevant code:
api_url="http://api.crunchbase.com/v/1/financial-organization/venrock.js"
len(urllib2.urlopen(api_url).read())
The result is either 73493 or 69397. The actual length of the document is much longer. When I try this on a different computer, the length is either 44821 or 40725. I've tried changing the user-agent, using Urllib, increasing the time-out to a very large number, and reading small chunks at a time. Always the same result.
I assumed it was a server problem, but my browser reads the whole thing.
Python 2.7.2, OS X 10.6.8 for the ~40k lengths. Python 2.7.1 running as iPython for the ~70k lengths, OS X 10.7.3. Thoughts?
There is something kooky with that server. It might work if you, like your browser, request the file with gzip encoding. Here is some code that should do the trick:
import urllib2, gzip
api_url='http://api.crunchbase.com/v/1/financial-organization/venrock.js'
req = urllib2.Request(api_url)
req.add_header('Accept-encoding', 'gzip')
resp = urllib2.urlopen(req)
data = resp.read()
>>> print len(data)
26610
The problem then is to decompress the data.
from StringIO import StringIO
if resp.info().get('Content-Encoding') == 'gzip':
g = gzip.GzipFile(fileobj=StringIO(data))
data = g.read()
>>> print len(data)
183159
I'm not sure if this is a valid answer, since it's a different module entirely but using the requests module, I get a ~183k response:
import requests
url = r'http://api.crunchbase.com/v/1/financial-organization/venrock.js'
r = requests.get(url)
print len(r.text)
>>>183159
So if it's not too late into the project, check it out here: http://docs.python-requests.org/en/latest/index.html
edit: Using the code you provided, I also get a len of ~36k
Did a quick search and found this: urllib2 not retrieving entire HTTP response

Download file using partial download (HTTP)

Is there a way to download huge and still growing file over HTTP using the partial-download feature?
It seems that this code downloads file from scratch every time it executed:
import urllib
urllib.urlretrieve ("http://www.example.com/huge-growing-file", "huge-growing-file")
I'd like:
To fetch just the newly-written data
Download from scratch only if the source file becomes smaller (for example it has been rotated).
It is possible to do partial download using the range header, the following will request a selected range of bytes:
req = urllib2.Request('http://www.python.org/')
req.headers['Range'] = 'bytes=%s-%s' % (start, end)
f = urllib2.urlopen(req)
For example:
>>> req = urllib2.Request('http://www.python.org/')
>>> req.headers['Range'] = 'bytes=%s-%s' % (100, 150)
>>> f = urllib2.urlopen(req)
>>> f.read()
'l1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.'
Using this header you can resume partial downloads. In your case all you have to do is to keep track of already downloaded size and request a new range.
Keep in mind that the server need to accept this header for this to work.
This is quite easy to do using TCP sockets and raw HTTP. The relevant request header is "Range".
An example request might look like:
mysock = connect(("www.example.com", 80))
mysock.write(
"GET /huge-growing-file HTTP/1.1\r\n"+\
"Host: www.example.com\r\n"+\
"Range: bytes=XXXX-\r\n"+\
"Connection: close\r\n\r\n")
Where XXXX represents the number of bytes you've already retrieved. Then you can read the response headers and any content from the server. If the server returns a header like:
Content-Length: 0
You know you've got the entire file.
If you want to be particularly nice as an HTTP client you can look into "Connection: keep-alive". Perhaps there is a python library that does everything I have described (perhaps even urllib2 does it!) but I'm not familiar with one.
If I understand your question correctly, the file is not changing during download, but is updated regularly. If that is the question, rsync is the answer.
If the file is being updated continually including during download, you'll need to modify rsync or a bittorrent program. They split files into separate chunks and download or update the chunks independently. When you get to the end of the file from the first iteration, repeat to get the appended chunk; continue as necessary. With less efficiency, one could just repeatedly rsync.

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.
If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}
First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.
If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

Categories